A set of statistical techniques that allow one to assess the relationship between one dependent variable (DV) and several independent variables (IVs)
Why Use Regression?
Prediction: fitting a predictive model to an observed dataset, then using that model to make predictions about an outcome from a new set of explanatory variables
Explanation: fit a model to explain the relationships between a set of variables
Explanatory variable (x)
Exposure, predictor, independent variable
Dependent variable (y)
Outcome, response
Univariate (aka, simple) linear regression
Single explanatory variable
Multivariate (multiple) regression
Multiple explanatory variables
Linear Regression Characteristics
Dependent variable must be continuous
Explanatory variables can be continuous or categorical
Purpose of regression
Quantitative Description and Explanation of Relationships
Estimating (Predicting) Unknown Values of the Dependent Variable
Factors Affecting Regression
Linear Regression Models
A specific type of data modeling, where a straight line is fit "neatly in the middle" of the data to explain the relationship between the variables
Linear Regression Model
1. y = b0 + b1x
2. y is the predicted value (the criterion)
3. b1x is the slope of the line
4. b0 is the y-intercept when x = 0
5. x represents the value of the predictor
Linear Regression Models
Indicate a "trend" in the data based on a regression line
Ideal for making predictions: once we have a line, we can make predictions for the Y value (the criterion) for each X value (the predictor)
Commonly used for prediction and trend description
How to draw the regression line
LS Method: The regression line is the "best fit line" that minimizes the sum of the squared deviations between each point and the line
Ordinary Least Squares (OLS)
A foundational method for fitting a regression line<|>Principle: Minimize the sum of squared differences between observed values and those predicted by the line
OLS Process
1. Calculate residuals (differences between observed and predicted values)
2. Square residuals and sum them up
3. Choose line parameters (slope, intercept) that minimize this sum
Simple Linear Regression Formula
y = β0 + β1 X + ϵ<|>Y is the outcome variable<|>X is the predictor variable<|>β0 is the intercept (the value of Y when X = 0)<|>β1 is the slope (the change in Y for a one-unit increase in X)<|>ϵ is the error term (the difference between the observed and predicted values of Y)
Standardized vs. Unstandardized Regression Coefficients
Interpreting Standardized vs. Unstandardized Regression Coefficients in Simple Regression
Unstandardized Coefficients (b): Direct interpretation in original units
Standardized Coefficients (β): Interpretation in terms of standard deviations
squared (Coefficient of Determination)
Represents the proportion of variance in the dependent variable (Y) explained by the independent variable (X)<|>Ranges from 0 to 1<|>Higher R-squared values indicate a stronger relationship between the variables
Adjusted R-squared
Takes into account the number of predictors in the model<|>Adjusts for the inclusion of additional predictors that may not improve the model's explanatory power<|>More useful in multiple regression, but can also be used in simple regression for comparison purposes
Assumptions of Linear Regression
Linearity: There is a linear relationship between the independent variables and the dependent variable
Independence: Observations are independent of each other
Homoscedasticity: The variance of the error terms is constant across all levels of the independent variables
Normality: The error terms are normally distributed
No multicollinearity: In multiple regression, the independent variables are not highly correlated with each other
Violations of linear regression assumptions can lead to inaccurate or biased estimates
It is important to check and address these assumptions when performing linear regression analysis
Linearity
The relationship between the independent variables and the dependent variable is linear
Check with scatterplots or residual plots
Address violations with data transformations or non-linear models
Independence
Observations are independent of each other
Often assumed in random samples or experiments
Check with the Durbin-Watson test for time series data
Address violations with alternative models (e.g., time series models)
Homoscedasticity
The variance of the error terms is constant across all levels of the independent variables
Check with residual plots
Address violations with weighted least squares, data transformations, or robust regression methods
Normality
The error terms are normally distributed
Check with histograms, Q-Q plots, or normality tests (e.g., Shapiro-Wilk test)
Address violations with data transformations or robust regression methods
No multicollinearity
In multiple regression, the independent variables are not highly correlated with each other
Check with correlation coefficients or the Variance Inflation Factor (VIF)
Address violations by removing or combining highly correlated variables, or using dimensionality reduction techniques (e.g., PCA)
Simple linear regression
Models the relationship between a single predictor and a response variable
Multiple regression
Extends simple linear regression to include multiple predictors, allowing for a more comprehensive understanding of relationships and improving prediction accuracy
Benefits of multiple regression
Assess the impact of multiple factors on a response variable
Control for confounding variables
Build more robust and accurate models
Simple linear regression model
y = β0 + β1X + ϵ
Multiple regression model
y = β0 + β1X1 + β2X2 + · · · + βnXn + ϵ
Control variables
Variables included in the model to account for potential confounding factors
Implementing control in regression using Frisch-Waugh-Lovell Theorem
1. Regress Control Variables (Z) on IV (X): Obtain the residuals
2. Regress Control Variables (Z) on DV (Y): Obtain the residuals
3. Regress Residuals of X on Residuals of Y: The slope of this regression provides the effect of X on Y, controlling for Z
Regression coefficients
β0 (Intercept): Expected value of y when all predictors are zero<|>β1, β2, . . . , βn (Coefficients): Expected change in y for a one-unit increase in the corresponding predictor, holding all other predictors constant
Holding predictors constant is an important conceptual framework for understanding the unique contribution of each predictor in multiple regression
Adding control blocks in multiple regression
1. Identify primary predictors of interest and potential control variables
2. Group control variables into meaningful blocks
3. Add control blocks sequentially to the regression model and evaluate changes in primary predictor coefficients
Benefits of control blocks in multiple regression
Enhanced understanding of the relationships between predictors and the dependent variable<|>Identification of potential confounding factors<|>Systematic approach to adding control variables in the model<|>Improved model interpretability
Additive models assume the effects of predictors are independent, while interaction models allow the effects of predictors to depend on each other