DATA 710 - Midterm (Units 1-7)

Cards (64)

  • Five Vs of Big Data
    Volume, velocity, variety, veracity, and value
  • Veracity
    Accuracy, integrity, and fidelity of data
  • Pattern
    consistent trait or feature found in dataset
  • Problem Solving Steps
    Recognizing, defining, structuring, analyzing, interpreting, implementing.
  • Descriptive/Diagnostic Analytics
    Post-mortem analysis to know what happened and why it happened
  • Predictive Analytics
    Forecasts future outcomes based on historical data.
  • Prescriptive Analytics
    Recommends actions based on data analysis.
  • Supervised Learning
    Starts with known problems to find solutions.
  • Unsupervised Learning
    Explores data without predefined objective.
  • NARA Policies
    Governs electronic archive management focused on authenticity, integrity, chain of custody.
  • Good Data Quality
    Data that is clear, concise, accurate, trustworthy
  • Risks of Poor Data Quality
    Cost, time, and accuracy implications for analysis.
  • Causes of Poor Data Quality
    Data evolution, use evolution, bad initial design, missing data, redundant data, erroneous data
  • Handling Missing Data
    Fill null values using statistical analysis if relevant.
  • Handling Erroneous Data
    Use ablation studies to assess data contributions.
  • Data Quality Correction Questions

    Can you? Should you? Can you afford not to?
  • Minimizing Data Quality Problems
    Document purpose, clean data, test the design, and audit database queries.
  • Bad Data Qualities
    Inconsistent, inaccurate, incomplete, obsolete, duplicated entries.
  • Binning
    Grouping data into ranges for analysis.
  • Smoothing
    Removing noise and outliers to reveal underlying trends.
  • Generalization
    Aggregating detailed data into broader categories.
  • Normalization
    Scaling data to a common range for comparison.
  • Aggregation
    Summarizing data to derive overall insights.
  • Discretization
    Transforming continuous data into discrete intervals.
  • Exploratory Data Analysis (EDA)

    Methods to assess data's viability, integrity, and usefullness with respect to solving the problem's objective
  • Hypothesis Testing (HT)

    Core part of EDA, statistical methods to validate assumptions about data.
  • T-Tests
    Comparative tests for average values of 2 datasets.
  • Chi-Squared Tests
    Assessing differences between observed and expected data.
  • Regression Tests
    Modeling relationships between variables using curves.
  • Metadata
    Data that provides information about other data.
  • Descriptive metadata
    Describes the content and context of data specific to the domain in which the data lives (Arguably most important)
  • Technical metadata
    Describes the information needed to make the metadata available to users, handles the interface between the data and the hardware
  • Operational metadata
    Captures all of the requirements for the care and upkeep of the data.
  • Organizational metadata
    Information related to the organization, policies, and reporting guidelines of data.
  • Ontology
    Standard vocabulary used in metadata.
  • Resource Description Framework (RDF)
    Primary language for defining ontologies.
  • Data Curation
    Ensures data is not only useful now, but also in the future
  • Digital Preservation
    Long-term storage of inactive data.
  • How does curation contribute to reproducibility of results?
    Allows multiple analysts to use same dataset
  • Security
    Access control policies for data authorization.