Save
DATA 710 - Midterm (Units 1-7)
Save
Share
Learn
Content
Leaderboard
Learn
Created by
alex
Visit profile
Cards (64)
Five Vs of Big Data
Volume, velocity, variety, veracity, and value
View source
Veracity
Accuracy,
integrity
, and fidelity of data
View source
Pattern
consistent trait or feature found in
dataset
View source
Problem Solving Steps
Recognizing
, defining, structuring,
analyzing
, interpreting,
implementing
.
View source
Descriptive/Diagnostic Analytics
Post-mortem
analysis to know what happened and why it happened
View source
Predictive Analytics
Forecasts future outcomes based on
historical
data.
View source
Prescriptive Analytics
Recommends actions based on
data analysis
.
View source
Supervised Learning
Starts with known
problems
to find solutions.
View source
Unsupervised Learning
Explores data without
predefined
objective.
View source
NARA Policies
Governs electronic archive management focused on
authenticity
,
integrity
,
chain of custody
.
View source
Good
Data Quality
Data that is clear, concise,
accurate
, trustworthy
View source
Risks of Poor Data Quality
Cost, time, and accuracy
implications
for analysis.
View source
Causes of Poor Data Quality
Data evolution
, use evolution, bad
initial design
, missing data,
redundant data
,
erroneous data
View source
Handling Missing Data
Fill null values using
statistical analysis
if relevant.
View source
Handling Erroneous Data
Use
ablation studies
to assess
data contributions
.
View source
Data Quality
Correction
Questions
Can you? Should you? Can you
afford
not to?
View source
Minimizing
Data Quality Problems
Document purpose
, clean data, test the design, and audit
database queries
.
View source
Bad Data Qualities
Inconsistent
,
inaccurate
, incomplete,
obsolete
, duplicated entries.
View source
Binning
Grouping data into ranges for analysis.
View source
Smoothing
Removing
noise
and outliers to reveal underlying trends.
View source
Generalization
Aggregating
detailed
data into broader categories.
View source
Normalization
Scaling
data to a common range for comparison.
View source
Aggregation
Summarizing data to derive overall insights.
View source
Discretization
Transforming
continuous
data into discrete intervals.
View source
Exploratory Data Analysis
(EDA)
Methods to assess data's
viability
,
integrity
, and
usefullness
with respect to solving the problem's objective
View source
Hypothesis Testing
(HT)
Core part of
EDA
,
statistical
methods to validate assumptions about data.
View source
T-Tests
Comparative tests for average values of 2
datasets
.
View source
Chi-Squared Tests
Assessing differences between
observed
and expected data.
View source
Regression Tests
Modeling relationships between
variables
using curves.
View source
Metadata
Data that provides information about other data.
View source
Descriptive metadata
Describes the content and context of data specific to the
domain
in which the data lives (Arguably most important)
View source
Technical
metadata
Describes the information needed to make the metadata available to users, handles the interface between the data and the
hardware
View source
Operational metadata
Captures all of the
requirements
for the care and upkeep of the data.
View source
Organizational
metadata
Information related to the
organization
, policies, and reporting guidelines of data.
View source
Ontology
Standard vocabulary used in
metadata
.
View source
Resource Description Framework (RDF)
Primary language for defining
ontologies
.
View source
Data Curation
Ensures data is not only useful now, but also in the
future
View source
Digital Preservation
Long-term storage of
inactive
data.
View source
How does curation contribute to reproducibility of results?
Allows multiple analysts to use same dataset
View source
Security
Access control
policies for data authorization.
View source
See all 64 cards