Statistics

    Cards (127)

    • Statistics: The science of gathering, presenting, analysing and interpreting data
    • Some basic terms:
      Population – the whole
      – Census – data gathered from the entire population
      – Sample – a subset of the population
    • Descriptive statistics – summarizes the characteristics of a dataset
    • Inferential statistics – using sample data to reach conclusions about the population from which the sample was drawn.
    • Parameter
      Descriptive measure of the population
      – Represented by Greek letters
      e.g.
      μ - population mean
      σ² - population variance
      σ - population standard deviation
    • Statistic
      Descriptive measure of a sample
      – Represented by roman letters
      e.g.
      x̄ - sample means
      s² - sample variance
      s - sample standard deviation
    • Elements– the entities on which data are collected
    • Observations– the set of measurements obtained for a particular element
    • Variable– a characteristic of interest for the element
    • Levels of data - Nominal
      • Labels or names used to identify an attribute of the element.
      • Data cannot be ordered
    • Levels of data - Ordinal
      • Numbers/categories are used to indicate a rank or order.
      • Relative magnitudes are meaningful
    • Levels of data - Interval
      • Data shows the properties of ordinal data
      • Interval between the values is expressed in terms of a fixed unit of measure
      • Always numeric
    • Levels of data - Ratio
      • Has properties of interval data
      • The location of origin is zero – nothing exists for the variable at zero point
      • Variables such height, time and weight
    • Different types of dataset –Time series
      Values of a variable at different points in time
    • Different types of dataset –Panel data
      Observations on individuals overtime
    • Bar chart: the value of each category in categorical data is represented by a bar
    • Pie chart: the area of each section represents the relative frequency
    • Graphical method - histogram
      • Shows data summarised in frequency density (frequency/class width)
      • It provides information about the shape of a distribution
      • Similar to a bar chart, but it corrects for differences in class widths. If all class widths are identical, a bar chart and histogram have the same shape
    • Ogive
      Cumulative distribution plot - distribution curve in which the frequencies are cumulative
    • Group data into intervals:
      • Frequency: Number of times a particular value occurs
      • Cumulative frequency: the sum of successive frequencies
      • Relative frequency: Proportion of observations in each class
      • Cumulative relative frequency: The sum of successive relative frequencies
    • Arithmetic mean – population mean
      Can be applied to interval and ratio data
      Is affected by all the values in the dataset (potential problem: extreme values/outliers in the dataset)
      For a population:
      𝜇 = ∑𝑥𝑖 / N
      Where:
      𝜇 - population mean
      𝑥𝑖 - value of the variable
      𝑁 - number of observations
    • Arithmetic mean – sample mean
      For a sample :
      x̄ = (𝑥1+𝑥2 + 𝑥3 + ⋯ 𝑥𝑛)/𝑛
      Or
      x̄ = ∑ 𝑥𝑖 / n
      Where
      x̄ - sample mean
      𝑥𝑖 - value of the variable
      n - number of observations
    • Arithmetic mean – grouped data
      Grouped data: values have been organised into a frequency distribution
      For a population: 𝜇 = ∑ fi𝑥mid_i / N
      For a sample: x̄ = ∑ fi𝑥mid_i / n
      Where
      fi : frequency in each class i
      xmid_i : the mid-point value of each class i
      n: number of observations
    • Median
      • The value that divides the ordered sample into two parts, with equal numbers of observations in each part
      • Can be applied to ordinal, interval and ratio data
      • Not affected by extreme values
    • Median – cont’d
      The median’s position in an ordered row:
      Sample with an odd number of observations, the (𝑛𝑛 + 1)/2th observation:
      12, 13, 13, 16, 17, 19, 20(7 + 1)/2 = 4th observation
      Sample with an even number of observations, the average value between the(𝑛/2)th and (𝑛/2 + 1)th ordered observation:
      12, 13, 13, 16, 17, 19, 20, 28
      Median is between 16 and 17: 16.5
    • Median – grouped data
      2 steps involved:
      Calculate the class interval that contains the median observation
      2) Calculate the value of median using the following formula:
      median = xl + (xu - xl) {((N+1)/2) - F / f}
      Where:
      xl - lower limit of class interval
      xu - upper limit of class interval
      N - no. of observations
      F - cumulative frequency of the class intervals up to the one containing the median
      f - the frequency for the class interval containing the median
    • Mode
      The most frequently occurring value in the dataset
      Can be applied to all levels of data (Nominal, Ordinal,Interval and Ratio)
      Bi-modal: two values have the highest number of occurrences in the dataset
      Multi-modal: three or more values have the highest number of occurrences in the dataset
    • Additional measures – quartiles
      Quartiles divide a distribution into four equal parts
      Additional measures – quartiles
      Quartiles divide a distribution into four equal parts
      Procedure:
      1. Calculate the class interval which contains the quartile
      2. Calculate the quartiles using the following formulae:
      Q1 = xl + (xu - xl) {((N + 1) / 4) - F / f}
      Q3 = xl + (xu - xl) {((N + 1)3 / 4) - F / f}
    • Choice of measures of central location
      • Depends on the purpose/question being asked
      Arithmetic mean is used the most widely
      • If observations are symmetrically distributed (and unimodal), mean, median and mode are identical
      • If data is not symmetrically distributed, then “skewed”distribution
    • Range:
      •The difference between the largest and smallest observation
      •Simplest, easy to calculate
      •Ignores all data points except the extremes
    • Interquartile Range:
      •The difference between 1st and 3rd quartile
      • Defines the range of the middle 50% of observations
    • Mean Absolute Deviation (MAD)
      The average of the absolute deviations from the mean
      Focus on the dispersion around the central location of the data
      For a population: MAD = ∑ | x - 𝜇 | / N
      For a sample: MAD = ∑ | x - x̄ | / n
    • Note: In Business/Investment Evaluation....
      Mean considered a measure of return
      Standard Deviation considered a measure of risk, asit measures the range of an investment's performance.
      The greater the standard deviation, the greater the investment's volatility.
    • Variance for a population
      A measure that makes use of all of the information available
      The population variance is given as:
      𝜎2 = ∑ (x - 𝜇)^2 / N
      or
      for grouped data:
      𝜎2 = ∑ f (xmid - ��)^2 / N
      Where:
      𝜎2 - variance of the population
      x - each value in the dataset
      xmid - mid-point value of each class interval for grouped data
      𝜇 - population mean
      N - number of observations
      f - frequency in each class interval for grouped data
    • Standard deviation for a population
      The square root of variance.
      Population standard deviation is given as:
      𝜎 = square root of ∑ (x - 𝜇)^2 / N
      for grouped data:
      𝜎 = square root of ∑ f(xmid - 𝜇)^2 / N
      Where:
      𝜎 - standard deviation of the population
      x - each value in the dataset
      xmid - mid-point value of each class interval for grouped data
      𝜇 - population mean
      N - number of observations
      f - frequency in each class interval for grouped data
    • Standard deviation for a sample
      Sample standard deviation is given as:
      s = square root of ∑ ( x - )^2 / n - 1
      or
      for grouped data:
      s = square root of ∑ f( xmid- x̄)^2 / n - 1
      Where:
      s - standard deviation of the sample
      x - each value in the dataset;
      xmid - mid-point value of each class interval for grouped data
      x̄ - sample mean
      n - number of observations
      f - frequency in each class interval for grouped data
    • Variance and standard deviation –simplified formulae
      Alternatively, the following simplified formulae can be used(recommended):
      Variance of a population:
      𝜎^2 = ∑ x^2 - N𝜇^2 / N
      or for grouped data:
      𝜎^2 = ∑ fx^2mid - N𝜇^2 / N
      Variance of a sample:
      s^2 = ∑ x^2 - nx̄^2 / n
      or for grouped data:
      s^2 = ∑f x^2mid - nx̄^2 / n
      Standard deviation is the square root of variance
    • Coefficient of variation
      A measure of relative dispersion (independent of units of measurements):
      For a population:
      cofficient of variation = 𝜎/𝜇
      For a sample:
      cofficient of variation = s/x̄
      Where:
      𝜎 - standard deviation of the population
      𝜇 - population mean
      s - standard deviation of the sample
      x̄ - sample mean
      • A numerical measure of the likelihood that an event will occur
      P(E) = No of favourable occurrences / Total number of possible outcomes
      • Probability values are assigned on a scale from 0 to 1
    • The frequentist view
      The proportion of an outcome obtained in the trials as the number of trials approaches infinity Again, using the previous example: what is the probability of “heads” occurring on the toss of coin?