Sum of all the values divided by the number of observations
What is the median?
The value at the centre of the dataset, equal proportions above and below.
What is the mode?
The most frequent observations
What does unimodal mean?
A distribution which only has one peak - so will have a single highest data point (a normal distribution)
What does bimodal mean?
A bimodal has two peaks (modes) - suggests the data may come from two different groups
What is a symmetric distribution?
This is when the left and right sides are mirror images of each other around a central point (usually a mean or median)
What are the key characteristics of a symmetric distribution?
The mean, median and mode are usually the same and lie in the middle, if you fold the distribution at it's centre both halves will match perfectly
What is a gaussian (normal) distribution?
Bell shaped probability distribution that is symmetric around it's mean.
It is symmetric - the left and right sides are mirror images
it is a bell shaped curve - the highest point is at the mean
probability decreases as you move away from it
mean = median = mode - all three measures of central tendency are the same
Continuous data
Independent observations
Vary from result to result in an unpredictable manner, although some values are more likely than others
What is a skewed distribution?
This is when data is not represented symmetrically around the mean. It can be either left or right skewed where the data will be longer on either the left or right
Negatively skewed
peak is more to the right, the mode is bigger followed by the median and then the mean
Positively skewed
peak is more to the left, the mean is bigger followed by the median and then the mode - opposite of the negatively skewed graph
How do you calculate standard deviation?
Calculate the mean of the data
Subtract the mean from each data point (so you know how much each point deviates from the mean)
Square the output of (step 2) for each data point
Add all the outputs of (step 3) together
Divide the output of (step 4) by n-1, where n is the number of data points
Take the square root of the output of (step 5)
What is an estimator?
This is a rule used to create an estimate
(eg the assumption of a normal distribution may be used to estimate the statistical properties of a distribution from a sample thereof)
What makes a good estimator?
Ubiased - not yield systematic errors
Consistent - the estimates should converge as the sample gets larger
Doesn't need to be precise
What is the gaussian (normal) distribution determined by?
The mean and standard deviation
What is the z-score?
The number of standard deviations an observation is above or below the mean
How do you convert a value to a z-score?
Subtract the mean
Divide by the standard deviation
Allows the comparison of observations from different normal distributions
What is the equation to calculate z score?
z = x - u/ o where x = the value you want to convert, u = the mean of the dataset, o = the standard deviation of the dataset
How do we interpret the z scores?
z = 0 - the value is exactly at the mean
z > 0 - the value is above the mean
z < 0 - the value is below the mean
|z| > 2 - the value is far away from the mean (unusual in a normal distribution)
what is a p value?How likely it is to get a result like this if H0 (null hypothesis) is true
If null hypothesis is true, p value gives the probability of obtaining a test statistic at least as extreme as the one obtained
If p is smaller, the evidence to reject H0 is stronger
What is a confidence interval?
A range of values that is likely to contain the true population parameter (e.g., the mean) with a certain level of confidence (e.g., 95%)
What is a two tailed test?
A two-tailed test is a type of hypothesis test where you check for differences in both directions (greater than and less than). It is used when you want to determine whether a sample mean is significantly different (either higher or lower) from a population mean, rather than just greater or smaller.
What does a 95% confidence interval mean?
A 95% confidence interval means that if you were to repeat an experiment or study many times, about 95% of the calculated confidence intervals from those repetitions would contain the true population parameter (like the true mean or proportion).