Lec2

Cards (30)

  • Note that the course pack provided to you in any form is intended only for your use in connection with the course that you are enrolled in. It is not for distribution or sale. Permission should be obtained from your instructor for any use other than for what it is intended.
  • At the end of this unit, the student will
    • Perform correlation analysis
    • Build linear models using simple or multiple linear regression analysis
    • Perform diagnostic checking on the adopted linear model
    • Perform remedial measures if necessary
    • Interpret results of linear regression analysis
  • Correlation analysis
    Determine if two measurements X and Y taken from the same sample or population are associated/related/dependent on each other
  • Pearson Product-Moment Correlation Coefficient
    A measure of the strength of the linear relationship existing between two variables, X and Y, that is independent of their respective scales of measurement
  • Assumptions for Pearson Product-Moment Correlation Coefficient
    • Both variables are measured at the interval or ratio level
    • There should be no significant outliers
    • The variables should be approximately normally distributed
  • Correlation coefficient (ρ)
    • Takes on values between -1 and 1, inclusive
    • A positive ρ means the line slopes upward to the right, a negative ρ means it slopes downward to the right
    • When ρ is 1 or -1, there is a perfect linear relationship between X and Y
    • A ρ close to 1 or -1 indicates a strong linear relationship, but does not necessarily imply X causes Y or Y causes X
    • If ρ = 0 then there is no linear correlation between X and Y, but there may still be a non-linear association
  • Pearson product moment coefficient of correlation (r)
    • Used to estimate ρ based on a random sample
    • -1 < r < 1
    • Verbal description of strength of correlation: ±0.00-0.25 no/weak, ±0.26-0.50 moderately weak, ±0.51-0.75 moderately strong, ±0.76-1.00 strong to perfect
  • Scatterplots with approximate values of r
    • r ≈ 0, r ≈ ±0.5, r ≈ ±1
  • Computing Pearson correlation coefficient (r)
    1. Calculate Σxi, Σyi, Σxi^2, Σyi^2
    2. Plug into formula: r = [n(Σxiyi) - (Σxi)(Σyi)] / √[(n(Σxi^2) - (Σxi)^2)(n(Σyi^2) - (Σyi)^2)]
  • Testing hypothesis about correlation coefficient
    Ho: ρ = ρ0
    Ha: ρ < ρ0, ρ > ρ0, ρ ≠ ρ0
    Test statistic: t = (r - ρ0)√(n-2) / √(1-r^2)
    Critical region: |t| > /2(n-2)
  • Simple linear regression

    Predicting a quantitative variable Y based on a single predictor variable X, assuming an approximately linear relationship
  • General equation of a straight line
    y = β0 + β1x, where β0 is the y-intercept and β1 is the slope
  • Deterministic model

    Linear model y = β0 + β1x where a value of x determines the value of y with no error
  • Probabilistic model
    Linear model y = β0 + β1x + ε where ε is a random error and the observed y varies randomly around the mean E(y|X=x) = β0 + β1x
  • Simple linear regression model
    Y = β0 + β1X + ε, where Y is the response variable, X is the explanatory/predictor variable, ε is the random error, β0 is the y-intercept, and β1 is the slope
  • Multiple regression model

    Y = β0 + β1X1 + β2X2 + ... + βkXk + ε, with multiple explanatory variables X1, X2, ..., Xk
  • Examples of applications
    • Predicting sales from TV advertising expenditures
    • Predicting college grade from entrance exam score
  • Assumptions of the linear regression model
    • Response variable is measured at interval or ratio level
    • Relationship between response and predictor is linear
    • No significant outliers
    • Observations are independent
  • Fitting a simple linear regression model

    1. Construct a scatterplot of X versus Y
    2. Obtain the equation that best fits the data using least squares method
    3. Compute for b0 and b1
    4. Evaluate the model
    5. Obtain the residuals and check assumptions
    6. Interpret the model and use for predictions
  • Residual
    The difference between the observed response and the predicted value of Y given the value of the predictor
  • Assumptions of the simple linear regression model
    • The response variable is measured in either interval or ratio level
    • The relationship is linear between the response and predictor variable
    • There are no significant outliers in the data
    • The observations are independent
    • The data shows homoscedasticity
  • The residuals (errors) of the regression line are approximately normally distributed
  • Coefficient of determination (R^2)

    The proportion of the variability in the observed values of Y that can be explained by X
  • Estimating b0 and b1
    1. b1 = (n*sum(XiYi) - (sum(Xi)*sum(Yi))) / (n*sum(Xi^2) - (sum(Xi))^2)
    2. b0 = y_bar - b1*x_bar
  • Diagnostic checks for assumptions
    • Linearity
    • Presence of outliers
    • Normality
    • Independence
    • Homogeneity/Homoscedasticity
  • It is important to first test the model assumptions before reading the result of the linear regression analysis
  • When extending the simple linear regression to multiple independent variables, multicollinearity or correlation among these predictors should be checked
  • Example: Investigating the relationship between GPI and starting salary
    1. Construct a scatterplot
    2. Obtain the equation using least squares method
    3. Compute b0 and b1
    4. Interpret the model
  • Data
    • -0.168
    • 0.028
    • 9
    • 2.6
    • 15.7
    • 6.760
    • 246.490
    • 40.820
    • 16.632
    • -0.932
    • 0.868
    • 10
    • 3.2
    • 18.6
    • 10.240
    • 345.960
    • 59.520
    • 18.989
    • -0.389
    • 0.151
    • 11
    • 3.0
    • 19.5
    • 9.000
    • 380.250
    • 58.500
    • 18.203
    • 1.297
    • 1.682
    • 12
    • 2.2
    • 15.0
    • 4.840
    • 225.000
    • 33.000
    • 15.061
    • -0.061
    • 0.004
    • 13
    • 2.8
    • 18.0
    • 7.840
    • 324.000
    • 50.400
    • 17.417
    • 0.583
    • 0.339
    • 14
    • 3.2
    • 20.0
    • 10.240
    • 400.000
    • 64.000
    • 18.989
    • 1.011
    • 1.023
    • 15
    • 2.9
    • 19.0
    • 8.410
    • 361.000
    • 55.100
    • 17.810
    • 1.190
    • 1.416
    • 16
    • 3.0
    • 17.4
    • 9.000
    • 302.760
    • 52.200
    • 18.203
    • -0.803
    • 0.645
    • 17
    • 2.6
    • 17.3
    • 6.760
    • 299.290
    • 44.980
    • 16.632
    • 0.668
    • 0.446
    • 18
    • 3.3
    • 18.1
    • 10.890
    • 327.610
    • 59.730
    • 19.381
    • -1.281
    • 1.642
    • 19
    • 2.9
    • 18.0
    • 8.410
    • 324.000
    • 52.200
    • 17.810
    • 0.190
    • 0.036
    • 20
    • 2.4
    • 16.2
    • 5.760
    • 262.440
    • 38.880
    • 15.846
    • 0.354
    • 0.125
    • 21
    • 2.8
    • 17.5
    • 7.840
    • 306.250
    • 49.000
    • 17.417
    • 0.083
    • 0.007
    • 22
    • 3.7
    • 21.3
    • 13.690
    • 453.690
    • 78.810
    • 20.953
    • 0.347
    • 0.121
    • 23
    • 3.1
    • 17.2
    • 9.610
    • 295.840
    • 53.320
    • 18.596
    • -1.396
    • 1.948
    • 24
    • 2.8
    • 17.0
    • 7.840
    • 289.000
    • 47.600
    • 17.417
    • -0.417
    • 0.174
    • 25
    • 3.5
    • 19.6
    • 12.250
    • 384.160
    • 68.600
    • 20.167
    • -0.567
    • 0.321
    • 26
    • 2.7
    • 16.6
    • 7.290
    • 275.560
    • 44.820
    • 17.025
    • -0.425
    • 0.180
    • 27
    • 2.6
    • 15.0
    • 6.760
    • 225.000
    • 39.000
    • 16.632
    • -1.632
    • 2.663
    • 28
    • 3.2
    • 18.4
    • 10.240
    • 338.560
    • 58.880
    • 18.989
    • -0.589
    • 0.346
    • 29
    • 2.9
    • 17.3
    • 8.410
    • 299.290
    • 50.170
    • 17.810
    • -0.510
    • 0.260
    • 30
    • 3.0
    • 18.5
    • 9.000
    • 342.250
    • 55.500
    • 18.203
    • 0.297
    • 0.088
  • STAT106: ADVANCED STATISTICAL ANALYSIS