What is one example of where data was used in policy?
New Jersey in 1992 - increased the minimum wage from $4.25 to $5.05 increasing employment by 13% in NewJersey relative to Pennsylvania
What is one example of data being used in business?
Amazon'srecommendation system - 35% of Amazon's total sales are through this.
What is a variable?
Anything that can be measured and can differ across entities or across time.
What is a categorical variable?
A type of variable that can take on one of a limited, fixed number of values, which represent distinct categories or groups.
What is an example of a categorical variable?
Colours
What is a numerical variable?
A measurable variable in which arithmetic operations are applicable.
What is an ordinal variable?
A type of categorical variable that has a clear ordered relationship between its categories, meaning the categories can be ranked or sorted.
What is an example of an ordinal variable?
Education level
What is an important aspect of categories?
They can be numbered, but they're not numerical values.
What is an important aspect of ordinal variables?
Their values can be numbered but they are not usually numerical variables.
What is the value of the variable?
The possible outcomes or potential values that a variable can take based on its definition.
What is the realised (observed) value?
The actual value that the variable takes after the experiment or observation is made; it's the concrete outcome from a specific trial or data point.
What is the preferred condition to use a pie chart?
When there are around 3-5categories.
What can the shape of a histogram tell us?
The probability distribution.
What is a population?
A well-defined collection of units to which we want to generalise a set of findings or a statistical model.
What is a sample?
A much smallercollection of units derived form a population used to determine truths about that target population.
What are the two characteristics that make a sample good?
Representative - sample includes only members of the population being studied.
Random - every member of the population studied has an equal chance of being selected for the sample which prevents bias.
What are the three common measures of centre?
mean - average value
median - middle value
mode - most common value
What is the best use case for mode?
For categorical or discrete data
What is the best use case for mean?
For data that is symmetrically distributed without outliers.
What is the best use case for median?
For skeweddistributions when outliers are present.
What can the spread/variability of a distribution be described using?
The percentiles
What is the five-number summary?
Minimumvalue
Q1
Median
Q3
Maximumvalue
What are the ranges for outliers?
1.5 x IQR < Q1
1.5 x IQR > Q3
Why may outliers appear?
Measurement errors
Dataentry errors
Sampling errors
Naturalvariation (genuine rare events)
Skeweddistribution
Unexpected externalfactors
What are the two measures of spread?
Standard deviation
Variance
What does the Standard Deviation show?
The spread of observations around the mean.
What does the size of the value of the standard deviation and variance mean?
the standard deviation and variance will be large if the observations are widely spread about the mean and small if the observations are close to the mean.
Why do we use n - 1 instead of n in the formula for S.D.?
We use it to account for the constraint in data as the last deviation is already determined as the sum of deviations is zero.
What does a positive skew look like?
Tail of the distribution stretches longer to the right (towards higher values)
What is the order of the measures of centre in a positiveskew?
Mean > Median > Mode as the mean is pulled up by the higher values in the tail.
What does a negativeskew look like?
Tail of the distribution stretches longer to the left (towards lower values)
What is the order of the measures of centre in a negativeskew?
Mean < Median < Mode as the mean is pulled down by the lower values in the tail.