Lec2

Created by

IntellectualGelding17514

Cards (30)

Note that the course pack provided to you in any form is intended only for your use in connection with the course that you are enrolled in. It is not for distribution or sale. Permission should be obtained from your instructor for any use other than for what it is intended.
At the end of this unit, the student will
Perform correlation analysis
Build linear models using simple or multiple linear regression analysis
Perform diagnostic checking on the adopted linear model
Perform remedial measures if necessary
Interpret results of linear regression analysis
Correlation analysis 
Determine if two measurements X and Y taken from the same sample or population are associated/related/dependent on each other
Pearson Product-Moment Correlation Coefficient
A measure of the strength of the linear relationship existing between two variables, X and Y, that is independent of their respective scales of measurement
Assumptions for Pearson Product-Moment Correlation Coefficient
Both variables are measured at the interval or ratio level
There should be no significant outliers
The variables should be approximately normally distributed
Correlation coefficient (ρ) 
Takes on values between -1 and 1, inclusive
A positive ρ means the line slopes upward to the right, a negative ρ means it slopes downward to the right
When ρ is 1 or -1, there is a perfect linear relationship between X and Y
A ρ close to 1 or -1 indicates a strong linear relationship, but does not necessarily imply X causes Y or Y causes X
If ρ = 0 then there is no linear correlation between X and Y, but there may still be a non-linear association
Pearson product moment coefficient of correlation (r)
Used to estimate ρ based on a random sample
-1 < r < 1
Verbal description of strength of correlation: ±0.00-0.25 no/weak, ±0.26-0.50 moderately weak, ±0.51-0.75 moderately strong, ±0.76-1.00 strong to perfect
Scatterplots with approximate values of r
r ≈ 0, r ≈ ±0.5, r ≈ ±1
Computing Pearson correlation coefficient (r)
1. Calculate Σxi, Σyi, Σxi^2, Σyi^2
2. Plug into formula: r = [n(Σxiyi) - (Σxi)(Σyi)] / √[(n(Σxi^2) - (Σxi)^2)(n(Σyi^2) - (Σyi)^2)]
Testing hypothesis about correlation coefficient
Ho: ρ = ρ0
Ha: ρ < ρ0, ρ > ρ0, ρ ≠ ρ0
Test statistic: t = (r - ρ0)√(n-2) / √(1-r^2)
Critical region: |t| > tα/2(n-2)
Simple linear regression 
Predicting a quantitative variable Y based on a single predictor variable X, assuming an approximately linear relationship
General equation of a straight line
y = β0 + β1x, where β0 is the y-intercept and β1 is the slope
Deterministic model 
Linear model y = β0 + β1x where a value of x determines the value of y with no error
Probabilistic model 
Linear model y = β0 + β1x + ε where ε is a random error and the observed y varies randomly around the mean E(y|X=x) = β0 + β1x
Simple linear regression model
Y = β0 + β1X + ε, where Y is the response variable, X is the explanatory/predictor variable, ε is the random error, β0 is the y-intercept, and β1 is the slope
Multiple regression model 
Y = β0 + β1X1 + β2X2 + ... + βkXk + ε, with multiple explanatory variables X1, X2, ..., Xk
Examples of applications 
Predicting sales from TV advertising expenditures
Predicting college grade from entrance exam score
Assumptions of the linear regression model
Response variable is measured at interval or ratio level
Relationship between response and predictor is linear
No significant outliers
Observations are independent
Fitting a simple linear regression model 
1. Construct a scatterplot of X versus Y
2. Obtain the equation that best fits the data using least squares method
3. Compute for b0 and b1
4. Evaluate the model
5. Obtain the residuals and check assumptions
6. Interpret the model and use for predictions
Residual 
The difference between the observed response and the predicted value of Y given the value of the predictor
Assumptions of the simple linear regression model
The response variable is measured in either interval or ratio level
The relationship is linear between the response and predictor variable
There are no significant outliers in the data
The observations are independent
The data shows homoscedasticity
The residuals (errors) of the regression line are approximately normally distributed
Coefficient of determination (R^2) 
The proportion of the variability in the observed values of Y that can be explained by X
Estimating b0 and b1
1. b1 = (n*sum(XiYi) - (sum(Xi)*sum(Yi))) / (n*sum(Xi^2) - (sum(Xi))^2)
2. b0 = y_bar - b1*x_bar
Diagnostic checks for assumptions
Linearity
Presence of outliers
Normality
Independence
Homogeneity/Homoscedasticity
It is important to first test the model assumptions before reading the result of the linear regression analysis
When extending the simple linear regression to multiple independent variables, multicollinearity or correlation among these predictors should be checked
Example: Investigating the relationship between GPI and starting salary 
1. Construct a scatterplot
2. Obtain the equation using least squares method
3. Compute b0 and b1
4. Interpret the model
Data
-0.168
0.028
9
2.6
15.7
6.760
246.490
40.820
16.632
-0.932
0.868
10
3.2
18.6
10.240
345.960
59.520
18.989
-0.389
0.151
11
3.0
19.5
9.000
380.250
58.500
18.203
1.297
1.682
12
2.2
15.0
4.840
225.000
33.000
15.061
-0.061
0.004
13
2.8
18.0
7.840
324.000
50.400
17.417
0.583
0.339
14
3.2
20.0
10.240
400.000
64.000
18.989
1.011
1.023
15
2.9
19.0
8.410
361.000
55.100
17.810
1.190
1.416
16
3.0
17.4
9.000
302.760
52.200
18.203
-0.803
0.645
17
2.6
17.3
6.760
299.290
44.980
16.632
0.668
0.446
18
3.3
18.1
10.890
327.610
59.730
19.381
-1.281
1.642
19
2.9
18.0
8.410
324.000
52.200
17.810
0.190
0.036
20
2.4
16.2
5.760
262.440
38.880
15.846
0.354
0.125
21
2.8
17.5
7.840
306.250
49.000
17.417
0.083
0.007
22
3.7
21.3
13.690
453.690
78.810
20.953
0.347
0.121
23
3.1
17.2
9.610
295.840
53.320
18.596
-1.396
1.948
24
2.8
17.0
7.840
289.000
47.600
17.417
-0.417
0.174
25
3.5
19.6
12.250
384.160
68.600
20.167
-0.567
0.321
26
2.7
16.6
7.290
275.560
44.820
17.025
-0.425
0.180
27
2.6
15.0
6.760
225.000
39.000
16.632
-1.632
2.663
28
3.2
18.4
10.240
338.560
58.880
18.989
-0.589
0.346
29
2.9
17.3
8.410
299.290
50.170
17.810
-0.510
0.260
30
3.0
18.5
9.000
342.250
55.500
18.203
0.297
0.088
STAT106: ADVANCED STATISTICAL ANALYSIS