DATA 710 - Midterm (Units 1-7)

Created by

alex

Cards (64)

Five Vs of Big Data
Volume, velocity, variety, veracity, and value
View source
Veracity 
Accuracy, integrity, and fidelity of data
View source
Pattern 
consistent trait or feature found in dataset
View source
Problem Solving Steps 
Recognizing, defining, structuring, analyzing, interpreting, implementing.
View source
Descriptive/Diagnostic Analytics 
Post-mortem analysis to know what happened and why it happened
View source
Predictive Analytics 
Forecasts future outcomes based on historical data.
View source
Prescriptive Analytics 
Recommends actions based on data analysis.
View source
Supervised Learning 
Starts with known problems to find solutions.
View source
Unsupervised Learning 
Explores data without predefined objective.
View source
NARA Policies 
Governs electronic archive management focused on authenticity, integrity, chain of custody.
View source
Good Data Quality 
Data that is clear, concise, accurate, trustworthy
View source
Risks of Poor Data Quality 
Cost, time, and accuracy implications for analysis.
View source
Causes of Poor Data Quality
Data evolution, use evolution, bad initial design, missing data, redundant data, erroneous data
View source
Handling Missing Data 
Fill null values using statistical analysis if relevant.
View source
Handling Erroneous Data
Use ablation studies to assess data contributions.
View source
Data Quality Correction Questions 
Can you? Should you? Can you afford not to?
View source
Minimizing Data Quality Problems 
Document purpose, clean data, test the design, and audit database queries.
View source
Bad Data Qualities 
Inconsistent, inaccurate, incomplete, obsolete, duplicated entries.
View source
Binning 
Grouping data into ranges for analysis.
View source
Smoothing 
Removing noise and outliers to reveal underlying trends.
View source
Generalization 
Aggregating detailed data into broader categories.
View source
Normalization 
Scaling data to a common range for comparison.
View source
Aggregation 
Summarizing data to derive overall insights.
View source
Discretization 
Transforming continuous data into discrete intervals.
View source
Exploratory Data Analysis (EDA) 
Methods to assess data's viability, integrity, and usefullness with respect to solving the problem's objective
View source
Hypothesis Testing (HT) 
Core part of EDA, statistical methods to validate assumptions about data.
View source
T-Tests 
Comparative tests for average values of 2 datasets.
View source
Chi-Squared Tests 
Assessing differences between observed and expected data.
View source
Regression Tests 
Modeling relationships between variables using curves.
View source
Metadata 
Data that provides information about other data.
View source
Descriptive metadata 
Describes the content and context of data specific to the domain in which the data lives (Arguably most important)
View source
Technical metadata 
Describes the information needed to make the metadata available to users, handles the interface between the data and the hardware
View source
Operational metadata 
Captures all of the requirements for the care and upkeep of the data.
View source
Organizational metadata 
Information related to the organization, policies, and reporting guidelines of data.
View source
Ontology 
Standard vocabulary used in metadata.
View source
Resource Description Framework (RDF) 
Primary language for defining ontologies.
View source
Data Curation 
Ensures data is not only useful now, but also in the future
View source
Digital Preservation 
Long-term storage of inactive data.
View source
How does curation contribute to reproducibility of results?
Allows multiple analysts to use same dataset
View source
Security 
Access control policies for data authorization.
View source