Statistics

Created by

Tory Harratt

Cards (127)

Statistics: The science of gathering, presenting, analysing and interpreting data
Some basic terms:
– Population – the whole
– Census – data gathered from the entire population
– Sample – a subset of the population
Descriptive statistics – summarizes the characteristics of a dataset
Inferential statistics – using sample data to reach conclusions about the population from which the sample was drawn.
Parameter
– Descriptive measure of the population
– Represented by Greek letters
e.g.
μ - population mean
σ² - population variance
σ - population standard deviation
Statistic
– Descriptive measure of a sample
– Represented by roman letters
e.g.
x̄ - sample means
s² - sample variance
s - sample standard deviation
Elements– the entities on which data are collected
Observations– the set of measurements obtained for a particular element
Variable– a characteristic of interest for the element
Levels of data - Nominal
Labels or names used to identify an attribute of the element.
Data cannot be ordered
Levels of data - Ordinal
Numbers/categories are used to indicate a rank or order.
Relative magnitudes are meaningful
Levels of data - Interval
Data shows the properties of ordinal data
Interval between the values is expressed in terms of a fixed unit of measure
Always numeric
Levels of data - Ratio
Has properties of interval data
The location of origin is zero – nothing exists for the variable at zero point
Variables such height, time and weight
Different types of dataset –Time series
Values of a variable at different points in time
Different types of dataset –Panel data
Observations on individuals overtime
Bar chart: the value of each category in categorical data is represented by a bar
Pie chart: the area of each section represents the relative frequency
Graphical method - histogram
Shows data summarised in frequency density (frequency/class width)
It provides information about the shape of a distribution
Similar to a bar chart, but it corrects for differences in class widths. If all class widths are identical, a bar chart and histogram have the same shape
Ogive
Cumulative distribution plot - distribution curve in which the frequencies are cumulative
Group data into intervals:
Frequency: Number of times a particular value occurs
Cumulative frequency: the sum of successive frequencies
Relative frequency: Proportion of observations in each class
Cumulative relative frequency: The sum of successive relative frequencies
Arithmetic mean – population mean
Can be applied to interval and ratio data
Is affected by all the values in the dataset (potential problem: extreme values/outliers in the dataset)
For a population:
𝜇 = ∑𝑥𝑖 / N
Where:
𝜇 - population mean
𝑥𝑖 - value of the variable
𝑁 - number of observations
Arithmetic mean – sample mean
For a sample :
x̄ = (𝑥1+𝑥2 + 𝑥3 + ⋯ 𝑥𝑛)/𝑛
Or
x̄ = ∑ 𝑥𝑖 / n
Where
x̄ - sample mean
𝑥𝑖 - value of the variable
n - number of observations
Arithmetic mean – grouped data
Grouped data: values have been organised into a frequency distribution
For a population: 𝜇 = ∑ fi𝑥mid_i / N
For a sample: x̄ = ∑ fi𝑥mid_i / n
Where
fi : frequency in each class i
xmid_i : the mid-point value of each class i
n: number of observations
Median
The value that divides the ordered sample into two parts, with equal numbers of observations in each part
Can be applied to ordinal, interval and ratio data
Not affected by extreme values
Median – cont’d
The median’s position in an ordered row:
Sample with an odd number of observations, the (𝑛𝑛 + 1)/2th observation:
12, 13, 13, 16, 17, 19, 20(7 + 1)/2 = 4th observation
Sample with an even number of observations, the average value between the(𝑛/2)th and (𝑛/2 + 1)th ordered observation:
12, 13, 13, 16, 17, 19, 20, 28
Median is between 16 and 17: 16.5
Median – grouped data
2 steps involved:
Calculate the class interval that contains the median observation
2) Calculate the value of median using the following formula:
median = xl + (xu - xl) {((N+1)/2) - F / f}
Where:
xl - lower limit of class interval
xu - upper limit of class interval
N - no. of observations
F - cumulative frequency of the class intervals up to the one containing the median
f - the frequency for the class interval containing the median
Mode
The most frequently occurring value in the dataset
Can be applied to all levels of data (Nominal, Ordinal,Interval and Ratio)
Bi-modal: two values have the highest number of occurrences in the dataset
Multi-modal: three or more values have the highest number of occurrences in the dataset
Additional measures – quartiles
Quartiles divide a distribution into four equal parts
Additional measures – quartiles
Quartiles divide a distribution into four equal parts
Procedure:
Calculate the class interval which contains the quartile
Calculate the quartiles using the following formulae:
Q1 = xl + (xu - xl) {((N + 1) / 4) - F / f}
Q3 = xl + (xu - xl) {((N + 1)3 / 4) - F / f}
Choice of measures of central location
• Depends on the purpose/question being asked
• Arithmetic mean is used the most widely
• If observations are symmetrically distributed (and unimodal), mean, median and mode are identical
• If data is not symmetrically distributed, then “skewed”distribution
Range:
•The difference between the largest and smallest observation
•Simplest, easy to calculate
•Ignores all data points except the extremes
Interquartile Range:
•The difference between 1st and 3rd quartile
• Defines the range of the middle 50% of observations
Mean Absolute Deviation (MAD)
The average of the absolute deviations from the mean
Focus on the dispersion around the central location of the data
For a population: MAD = ∑ | x - 𝜇 | / N
For a sample: MAD = ∑ | x - x̄ | / n
Note: In Business/Investment Evaluation....
Mean considered a measure of return
Standard Deviation considered a measure of risk, asit measures the range of an investment's performance.
The greater the standard deviation, the greater the investment's volatility.
Variance for a population
A measure that makes use of all of the information available
The population variance is given as:
𝜎2 = ∑ (x - 𝜇)^2 / N
or
for grouped data:
𝜎2 = ∑ f (xmid - ��)^2 / N
Where:
𝜎2 - variance of the population
x - each value in the dataset
xmid - mid-point value of each class interval for grouped data
𝜇 - population mean
N - number of observations
f - frequency in each class interval for grouped data
Standard deviation for a population
The square root of variance.
Population standard deviation is given as:
𝜎 = square root of ∑ (x - 𝜇)^2 / N
for grouped data:
𝜎 = square root of ∑ f(xmid - 𝜇)^2 / N
Where:
𝜎 - standard deviation of the population
x - each value in the dataset
xmid - mid-point value of each class interval for grouped data
𝜇 - population mean
N - number of observations
f - frequency in each class interval for grouped data
Standard deviation for a sample
Sample standard deviation is given as:
s = square root of ∑ ( x - x̄)^2 / n - 1
or
for grouped data:
s = square root of ∑ f( xmid- x̄)^2 / n - 1
Where:
s - standard deviation of the sample
x - each value in the dataset;
xmid - mid-point value of each class interval for grouped data
x̄ - sample mean
n - number of observations
f - frequency in each class interval for grouped data
Variance and standard deviation –simplified formulae
Alternatively, the following simplified formulae can be used(recommended):
Variance of a population:
𝜎^2 = ∑ x^2 - N𝜇^2 / N
or for grouped data:
𝜎^2 = ∑ fx^2mid - N𝜇^2 / N
Variance of a sample:
s^2 = ∑ x^2 - nx̄^2 / n
or for grouped data:
s^2 = ∑f x^2mid - nx̄^2 / n
Standard deviation is the square root of variance
Coefficient of variation
A measure of relative dispersion (independent of units of measurements):
For a population:
cofficient of variation = 𝜎/𝜇
For a sample:
cofficient of variation = s/x̄
Where:
𝜎 - standard deviation of the population
𝜇 - population mean
s - standard deviation of the sample
x̄ - sample mean
A numerical measure of the likelihood that an event will occur
P(E) = No of favourable occurrences / Total number of possible outcomes
Probability values are assigned on a scale from 0 to 1
The frequentist view
The proportion of an outcome obtained in the trials as the number of trials approaches infinity Again, using the previous example: what is the probability of “heads” occurring on the toss of coin?