Cards (91)

  • Regression Analysis
    A set of statistical techniques that allow one to assess the relationship between one dependent variable (DV) and several independent variables (IVs)
  • Why Use Regression?
    • Prediction: fitting a predictive model to an observed dataset, then using that model to make predictions about an outcome from a new set of explanatory variables
    • Explanation: fit a model to explain the relationships between a set of variables
  • Explanatory variable (x)

    Exposure, predictor, independent variable
  • Dependent variable (y)

    Outcome, response
  • Univariate (aka, simple) linear regression

    Single explanatory variable
  • Multivariate (multiple) regression
    Multiple explanatory variables
  • Linear Regression Characteristics

    • Dependent variable must be continuous
    • Explanatory variables can be continuous or categorical
  • Purpose of regression
    • Quantitative Description and Explanation of Relationships
    • Estimating (Predicting) Unknown Values of the Dependent Variable
  • Factors Affecting Regression
  • Linear Regression Models
    A specific type of data modeling, where a straight line is fit "neatly in the middle" of the data to explain the relationship between the variables
  • Linear Regression Model
    1. y = b0 + b1x
    2. y is the predicted value (the criterion)
    3. b1x is the slope of the line
    4. b0 is the y-intercept when x = 0
    5. x represents the value of the predictor
  • Linear Regression Models
    • Indicate a "trend" in the data based on a regression line
    • Ideal for making predictions: once we have a line, we can make predictions for the Y value (the criterion) for each X value (the predictor)
    • Commonly used for prediction and trend description
  • How to draw the regression line
    LS Method: The regression line is the "best fit line" that minimizes the sum of the squared deviations between each point and the line
  • Ordinary Least Squares (OLS)

    A foundational method for fitting a regression line<|>Principle: Minimize the sum of squared differences between observed values and those predicted by the line
  • OLS Process
    1. Calculate residuals (differences between observed and predicted values)
    2. Square residuals and sum them up
    3. Choose line parameters (slope, intercept) that minimize this sum
  • Simple Linear Regression Formula
    y = β0 + β1 X + ϵ<|>Y is the outcome variable<|>X is the predictor variable<|>β0 is the intercept (the value of Y when X = 0)<|>β1 is the slope (the change in Y for a one-unit increase in X)<|>ϵ is the error term (the difference between the observed and predicted values of Y)
  • Standardized vs. Unstandardized Regression Coefficients
  • Interpreting Standardized vs. Unstandardized Regression Coefficients in Simple Regression
    • Unstandardized Coefficients (b): Direct interpretation in original units
    • Standardized Coefficients (β): Interpretation in terms of standard deviations
    1. squared (Coefficient of Determination)

    Represents the proportion of variance in the dependent variable (Y) explained by the independent variable (X)<|>Ranges from 0 to 1<|>Higher R-squared values indicate a stronger relationship between the variables
  • Adjusted R-squared
    Takes into account the number of predictors in the model<|>Adjusts for the inclusion of additional predictors that may not improve the model's explanatory power<|>More useful in multiple regression, but can also be used in simple regression for comparison purposes
  • Assumptions of Linear Regression
    • Linearity: There is a linear relationship between the independent variables and the dependent variable
    • Independence: Observations are independent of each other
    • Homoscedasticity: The variance of the error terms is constant across all levels of the independent variables
    • Normality: The error terms are normally distributed
    • No multicollinearity: In multiple regression, the independent variables are not highly correlated with each other
  • Violations of linear regression assumptions can lead to inaccurate or biased estimates
  • It is important to check and address these assumptions when performing linear regression analysis
  • Linearity
    • The relationship between the independent variables and the dependent variable is linear
    • Check with scatterplots or residual plots
    • Address violations with data transformations or non-linear models
  • Independence
    • Observations are independent of each other
    • Often assumed in random samples or experiments
    • Check with the Durbin-Watson test for time series data
    • Address violations with alternative models (e.g., time series models)
  • Homoscedasticity
    • The variance of the error terms is constant across all levels of the independent variables
    • Check with residual plots
    • Address violations with weighted least squares, data transformations, or robust regression methods
  • Normality
    • The error terms are normally distributed
    • Check with histograms, Q-Q plots, or normality tests (e.g., Shapiro-Wilk test)
    • Address violations with data transformations or robust regression methods
  • No multicollinearity
    • In multiple regression, the independent variables are not highly correlated with each other
    • Check with correlation coefficients or the Variance Inflation Factor (VIF)
    • Address violations by removing or combining highly correlated variables, or using dimensionality reduction techniques (e.g., PCA)
  • Simple linear regression
    Models the relationship between a single predictor and a response variable
  • Multiple regression
    Extends simple linear regression to include multiple predictors, allowing for a more comprehensive understanding of relationships and improving prediction accuracy
  • Benefits of multiple regression
    • Assess the impact of multiple factors on a response variable
    • Control for confounding variables
    • Build more robust and accurate models
  • Simple linear regression model

    y = β0 + β1X + ϵ
  • Multiple regression model

    y = β0 + β1X1 + β2X2 + · · · + βnXn + ϵ
  • Control variables
    Variables included in the model to account for potential confounding factors
  • Implementing control in regression using Frisch-Waugh-Lovell Theorem
    1. Regress Control Variables (Z) on IV (X): Obtain the residuals
    2. Regress Control Variables (Z) on DV (Y): Obtain the residuals
    3. Regress Residuals of X on Residuals of Y: The slope of this regression provides the effect of X on Y, controlling for Z
  • Regression coefficients
    β0 (Intercept): Expected value of y when all predictors are zero<|>β1, β2, . . . , βn (Coefficients): Expected change in y for a one-unit increase in the corresponding predictor, holding all other predictors constant
  • Holding predictors constant is an important conceptual framework for understanding the unique contribution of each predictor in multiple regression
  • Adding control blocks in multiple regression
    1. Identify primary predictors of interest and potential control variables
    2. Group control variables into meaningful blocks
    3. Add control blocks sequentially to the regression model and evaluate changes in primary predictor coefficients
  • Benefits of control blocks in multiple regression
    Enhanced understanding of the relationships between predictors and the dependent variable<|>Identification of potential confounding factors<|>Systematic approach to adding control variables in the model<|>Improved model interpretability
  • Additive models assume the effects of predictors are independent, while interaction models allow the effects of predictors to depend on each other