A collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted
Datum
A singular piece of data
Internal sources of data
Reports and records of the organization itself
External sources of data
Outside the organization
Secondary Data
these are not originally collected rather obtained from
already published/unpublished sources
Primary Data
It is collected for the first time. It is original and more
reliable.
Nominal
level of measurement classifies data into mutually exclusive
(nonoverlapping) categories in which no order or ranking can be imposed on the
data.
Ordianal
level of measurement classifies data into categories that can be ranked; however, precise differences between the ranks do not exist.
Interval
level of measurement ranks data, and precise differences between units of measure do exist; however, there is no meaningful zero.
Ratio
level of measurement possesses all the characteristics of interval measurement, and there exists a true zero. In addition, true ratios exist when the same variable is measured on two different members of the population.
Interval
What level of measurement is SAT score?
Ratio
What level of measurement is salary?
Nominal
What level of measurement is ZIP code?
Ordinal
What level of measurement is rating scale?
ordinal
What level of measurement is small, medium, large?
data
unorganized raw facts that need processing without which it is
seemingly random and useless to humans.
information
a group of data that collectively carry a logical meaning
data
measured in bits and bytes
quantitative data collection
measurable type of data collection method
7 characteristics that define data quality
Accuracy and Precision
• Legitimacy and Validity
• Reliability and Consistency
• Timeliness and Relevance
• Completeness and Comprehensiveness
• Availability and Accessibility
• Granularity and Uniqueness
Data cleansing or data cleaning
is the process of identifying and removing (or correcting) inaccurate records from a dataset,
true
true or false.
After cleaning, a dataset should be uniform with other related datasets in the operation.
Exploratory data analysis, or EDA, is a (mainly) visual approach and philosophy that focuses on the initial ways by which one should explore a data set or experiment
Exploratory data analysis
Provides a variety of tools for quickly summarizing and gaining insight into a set of data
Main aspects of EDA
Openness - A person exploring the data should be open to all possibilities prior to its exploration
Skepticism - One must ensure that the obvious story the data tells is not misleading
General purpose of EDA
To take a general view of some given data without making any assumptions about it, and to get a feel for the data and what it might mean as opposed to reject or accept some sort of premise around it before we begin its exploration
EDA
Lets the data speak for itself instead of trying to force the data into some sort of pre-determined model
Uses of EDA
Catching mistakes and anomalies
Gaining new insights into data
Detecting outliers in data
Testing assumptions
Identifying important factors in the data
Understanding relationships
Helping figure out our next steps with respect to the data
Methods to summarize data
Numerical summarization
Data visualization
Numerical summarization
Calculating measures like mean, median, mode, variance, and standard deviation
Numerical summarization example
Given the data: 7, 1, 8, 6, 5, 4, 4, 8, 8, 1
Find the mean, median, and mode
Finding the mean
Add up all the values
Divide the sum by the number of values
Finding the median (even-numbered situation)
Arrange the values in ascending order
Identify the middle numbers
Get the average of the middle numbers
Finding the median (odd-numbered situation)
Arrange the values in ascending order
Identify the middle number
Mode
The value that occurs most frequently in the data set
Finding the population variance
Get the mean
Subtract each number from the mean
Square the values from step 2
Add all the squared values
Divide the sum by the number of values
Standard deviation
The square root of the variance
Data visualization
A graphical representation of information and data that reveals hidden information through simple charts and diagrams
Common general types of data visualization
Charts
Tables
Graphs
Maps
Infographics
Dashboards
Boxplot
A graph of a data set obtained by drawing a horizontal line from the minimum data value to Q1, drawing a horizontal line from Q3 to the maximum data value, and drawing a box whose vertical sides pass through Q1 and Q3 with a vertical line inside the box passing through the median or Q2