Statistics

Cards (127)

  • Statistics: The science of gathering, presenting, analysing and interpreting data
  • Some basic terms:
    Population – the whole
    – Census – data gathered from the entire population
    – Sample – a subset of the population
  • Descriptive statistics – summarizes the characteristics of a dataset
  • Inferential statistics – using sample data to reach conclusions about the population from which the sample was drawn.
  • Parameter
    Descriptive measure of the population
    – Represented by Greek letters
    e.g.
    μ - population mean
    σ² - population variance
    σ - population standard deviation
  • Statistic
    Descriptive measure of a sample
    – Represented by roman letters
    e.g.
    x̄ - sample means
    s² - sample variance
    s - sample standard deviation
  • Elements– the entities on which data are collected
  • Observations– the set of measurements obtained for a particular element
  • Variable– a characteristic of interest for the element
  • Levels of data - Nominal
    • Labels or names used to identify an attribute of the element.
    • Data cannot be ordered
  • Levels of data - Ordinal
    • Numbers/categories are used to indicate a rank or order.
    • Relative magnitudes are meaningful
  • Levels of data - Interval
    • Data shows the properties of ordinal data
    • Interval between the values is expressed in terms of a fixed unit of measure
    • Always numeric
  • Levels of data - Ratio
    • Has properties of interval data
    • The location of origin is zero – nothing exists for the variable at zero point
    • Variables such height, time and weight
  • Different types of dataset –Time series
    Values of a variable at different points in time
  • Different types of dataset –Panel data
    Observations on individuals overtime
  • Bar chart: the value of each category in categorical data is represented by a bar
  • Pie chart: the area of each section represents the relative frequency
  • Graphical method - histogram
    • Shows data summarised in frequency density (frequency/class width)
    • It provides information about the shape of a distribution
    • Similar to a bar chart, but it corrects for differences in class widths. If all class widths are identical, a bar chart and histogram have the same shape
  • Ogive
    Cumulative distribution plot - distribution curve in which the frequencies are cumulative
  • Group data into intervals:
    • Frequency: Number of times a particular value occurs
    • Cumulative frequency: the sum of successive frequencies
    • Relative frequency: Proportion of observations in each class
    • Cumulative relative frequency: The sum of successive relative frequencies
  • Arithmetic mean – population mean
    Can be applied to interval and ratio data
    Is affected by all the values in the dataset (potential problem: extreme values/outliers in the dataset)
    For a population:
    𝜇 = ∑𝑥𝑖 / N
    Where:
    𝜇 - population mean
    𝑥𝑖 - value of the variable
    𝑁 - number of observations
  • Arithmetic mean – sample mean
    For a sample :
    x̄ = (𝑥1+𝑥2 + 𝑥3 + ⋯ 𝑥𝑛)/𝑛
    Or
    x̄ = ∑ 𝑥𝑖 / n
    Where
    x̄ - sample mean
    𝑥𝑖 - value of the variable
    n - number of observations
  • Arithmetic mean – grouped data
    Grouped data: values have been organised into a frequency distribution
    For a population: 𝜇 = ∑ fi𝑥mid_i / N
    For a sample: x̄ = ∑ fi𝑥mid_i / n
    Where
    fi : frequency in each class i
    xmid_i : the mid-point value of each class i
    n: number of observations
  • Median
    • The value that divides the ordered sample into two parts, with equal numbers of observations in each part
    • Can be applied to ordinal, interval and ratio data
    • Not affected by extreme values
  • Median – cont’d
    The median’s position in an ordered row:
    Sample with an odd number of observations, the (𝑛𝑛 + 1)/2th observation:
    12, 13, 13, 16, 17, 19, 20(7 + 1)/2 = 4th observation
    Sample with an even number of observations, the average value between the(𝑛/2)th and (𝑛/2 + 1)th ordered observation:
    12, 13, 13, 16, 17, 19, 20, 28
    Median is between 16 and 17: 16.5
  • Median – grouped data
    2 steps involved:
    Calculate the class interval that contains the median observation
    2) Calculate the value of median using the following formula:
    median = xl + (xu - xl) {((N+1)/2) - F / f}
    Where:
    xl - lower limit of class interval
    xu - upper limit of class interval
    N - no. of observations
    F - cumulative frequency of the class intervals up to the one containing the median
    f - the frequency for the class interval containing the median
  • Mode
    The most frequently occurring value in the dataset
    Can be applied to all levels of data (Nominal, Ordinal,Interval and Ratio)
    Bi-modal: two values have the highest number of occurrences in the dataset
    Multi-modal: three or more values have the highest number of occurrences in the dataset
  • Additional measures – quartiles
    Quartiles divide a distribution into four equal parts
    Additional measures – quartiles
    Quartiles divide a distribution into four equal parts
    Procedure:
    1. Calculate the class interval which contains the quartile
    2. Calculate the quartiles using the following formulae:
    Q1 = xl + (xu - xl) {((N + 1) / 4) - F / f}
    Q3 = xl + (xu - xl) {((N + 1)3 / 4) - F / f}
  • Choice of measures of central location
    • Depends on the purpose/question being asked
    Arithmetic mean is used the most widely
    • If observations are symmetrically distributed (and unimodal), mean, median and mode are identical
    • If data is not symmetrically distributed, then “skewed”distribution
  • Range:
    •The difference between the largest and smallest observation
    •Simplest, easy to calculate
    •Ignores all data points except the extremes
  • Interquartile Range:
    •The difference between 1st and 3rd quartile
    • Defines the range of the middle 50% of observations
  • Mean Absolute Deviation (MAD)
    The average of the absolute deviations from the mean
    Focus on the dispersion around the central location of the data
    For a population: MAD = ∑ | x - 𝜇 | / N
    For a sample: MAD = ∑ | x - x̄ | / n
  • Note: In Business/Investment Evaluation....
    Mean considered a measure of return
    Standard Deviation considered a measure of risk, asit measures the range of an investment's performance.
    The greater the standard deviation, the greater the investment's volatility.
  • Variance for a population
    A measure that makes use of all of the information available
    The population variance is given as:
    𝜎2 = ∑ (x - 𝜇)^2 / N
    or
    for grouped data:
    𝜎2 = ∑ f (xmid - ��)^2 / N
    Where:
    𝜎2 - variance of the population
    x - each value in the dataset
    xmid - mid-point value of each class interval for grouped data
    𝜇 - population mean
    N - number of observations
    f - frequency in each class interval for grouped data
  • Standard deviation for a population
    The square root of variance.
    Population standard deviation is given as:
    𝜎 = square root of ∑ (x - 𝜇)^2 / N
    for grouped data:
    𝜎 = square root of ∑ f(xmid - 𝜇)^2 / N
    Where:
    𝜎 - standard deviation of the population
    x - each value in the dataset
    xmid - mid-point value of each class interval for grouped data
    𝜇 - population mean
    N - number of observations
    f - frequency in each class interval for grouped data
  • Standard deviation for a sample
    Sample standard deviation is given as:
    s = square root of ∑ ( x - )^2 / n - 1
    or
    for grouped data:
    s = square root of ∑ f( xmid- x̄)^2 / n - 1
    Where:
    s - standard deviation of the sample
    x - each value in the dataset;
    xmid - mid-point value of each class interval for grouped data
    x̄ - sample mean
    n - number of observations
    f - frequency in each class interval for grouped data
  • Variance and standard deviation –simplified formulae
    Alternatively, the following simplified formulae can be used(recommended):
    Variance of a population:
    𝜎^2 = ∑ x^2 - N𝜇^2 / N
    or for grouped data:
    𝜎^2 = ∑ fx^2mid - N𝜇^2 / N
    Variance of a sample:
    s^2 = ∑ x^2 - nx̄^2 / n
    or for grouped data:
    s^2 = ∑f x^2mid - nx̄^2 / n
    Standard deviation is the square root of variance
  • Coefficient of variation
    A measure of relative dispersion (independent of units of measurements):
    For a population:
    cofficient of variation = 𝜎/𝜇
    For a sample:
    cofficient of variation = s/x̄
    Where:
    𝜎 - standard deviation of the population
    𝜇 - population mean
    s - standard deviation of the sample
    x̄ - sample mean
    • A numerical measure of the likelihood that an event will occur
    P(E) = No of favourable occurrences / Total number of possible outcomes
    • Probability values are assigned on a scale from 0 to 1
  • The frequentist view
    The proportion of an outcome obtained in the trials as the number of trials approaches infinity Again, using the previous example: what is the probability of “heads” occurring on the toss of coin?