UDAU

Cards (43)

  • Data
    A collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted
  • Datum
    A singular piece of data
  • Internal sources of data
    • Reports and records of the organization itself
  • External sources of data
    • Outside the organization
  • Secondary Data
    these are not originally collected rather obtained from
    already published/unpublished sources
  • Primary Data

    It is collected for the first time. It is original and more
    reliable.
  • Nominal
    level of measurement classifies data into mutually exclusive
    (nonoverlapping) categories in which no order or ranking can be imposed on the
    data.
  • Ordianal
    level of measurement classifies data into categories that can be ranked; however, precise differences between the ranks do not exist.
  • Interval
    level of measurement ranks data, and precise differences between units of measure do exist; however, there is no meaningful zero.
  • Ratio
    level of measurement possesses all the characteristics of interval measurement, and there exists a true zero. In addition, true ratios exist when the same variable is measured on two different members of the population.
  • Interval
    What level of measurement is SAT score?
  • Ratio
    What level of measurement is salary?
  • Nominal
    What level of measurement is ZIP code?
  • Ordinal
    What level of measurement is rating scale?
  • ordinal
    What level of measurement is small, medium, large?
  • data
    unorganized raw facts that need processing without which it is
    seemingly random and useless to humans.
  • information
    a group of data that collectively carry a logical meaning
  • data
    measured in bits and bytes
  • quantitative data collection
    measurable type of data collection method
  • 7 characteristics that define data quality
    Accuracy and Precision
    Legitimacy and Validity
    Reliability and Consistency
    Timeliness and Relevance
    Completeness and Comprehensiveness
    Availability and Accessibility
    Granularity and Uniqueness
  • Data cleansing or data cleaning
    is the process of identifying and removing (or correcting) inaccurate records from a dataset,
  • true
    true or false.
    After cleaning, a dataset should be uniform with other related datasets in the operation.
  • Exploratory data analysis, or EDA, is a (mainly) visual approach and philosophy that focuses on the initial ways by which one should explore a data set or experiment
  • Exploratory data analysis
    Provides a variety of tools for quickly summarizing and gaining insight into a set of data
  • Main aspects of EDA
    • Openness - A person exploring the data should be open to all possibilities prior to its exploration
    • Skepticism - One must ensure that the obvious story the data tells is not misleading
  • General purpose of EDA
    To take a general view of some given data without making any assumptions about it, and to get a feel for the data and what it might mean as opposed to reject or accept some sort of premise around it before we begin its exploration
  • EDA
    Lets the data speak for itself instead of trying to force the data into some sort of pre-determined model
  • Uses of EDA
    • Catching mistakes and anomalies
    • Gaining new insights into data
    • Detecting outliers in data
    • Testing assumptions
    • Identifying important factors in the data
    • Understanding relationships
    • Helping figure out our next steps with respect to the data
  • Methods to summarize data
    • Numerical summarization
    • Data visualization
  • Numerical summarization
    Calculating measures like mean, median, mode, variance, and standard deviation
  • Numerical summarization example
    • Given the data: 7, 1, 8, 6, 5, 4, 4, 8, 8, 1
    • Find the mean, median, and mode
  • Finding the mean
    Add up all the values
    1. Divide the sum by the number of values
  • Finding the median (even-numbered situation)
    Arrange the values in ascending order
    1. Identify the middle numbers
    2. Get the average of the middle numbers
  • Finding the median (odd-numbered situation)
    Arrange the values in ascending order
    1. Identify the middle number
  • Mode
    The value that occurs most frequently in the data set
  • Finding the population variance
    Get the mean
    1. Subtract each number from the mean
    2. Square the values from step 2
    3. Add all the squared values
    4. Divide the sum by the number of values
  • Standard deviation
    The square root of the variance
  • Data visualization
    A graphical representation of information and data that reveals hidden information through simple charts and diagrams
  • Common general types of data visualization
    • Charts
    • Tables
    • Graphs
    • Maps
    • Infographics
    • Dashboards
  • Boxplot
    A graph of a data set obtained by drawing a horizontal line from the minimum data value to Q1, drawing a horizontal line from Q3 to the maximum data value, and drawing a box whose vertical sides pass through Q1 and Q3 with a vertical line inside the box passing through the median or Q2