Linear Regression

A Brief Review

Brandon M. Greenwell, PhD

University of Cincinnati

About me

Ames housing data

  • Data describing the sale of individual residential property in Ames, Iowa from 2006 to 2010
  • There are 2930 observations on 81 variables involved in assessing home values:
    • 23 nominal
    • 23 ordinal
    • 14 discrete
    • 20 continuous
  • Paper: https://jse.amstat.org/v19n3/decock.pdf

Ames housing data

Show R code
ames <- AmesHousing::make_ames()  # install.packages("AmesHousing")

ames$Sale_Price <- ames$Sale_Price / 10000

head(ames)

Ames housing data

We’ll focus on a handful of variables:

  • Sale_Price - Sale price of the house / $10K (response variable)
  • Gr_Liv_Area - Above grade (ground) living area square feet
  • Overall_Qual - Rates the overall material and finish of the house

Statistical relationships

Show R code
# Simulate data from different SLR models
set.seed(101)  # for reproducibility
x <- seq(from = 0, to = 4, length = 100)
y <- cbind(
  1 + x + rnorm(length(x)),  # linear
  1 + (x - 2)^2 + rnorm(length(x)),  # quadratic
  1 + log(x + 0.1) + rnorm(length(x), sd = 0.3),  # logarithmic
  1 + rnorm(length(x))  # no association
)

# Scatterplot of X vs. each Y in a 2-by-2 grid
par(mfrow = c(2, 2))
for (i in 1:4) {
  plot(x, y[, i], col = adjustcolor("cornflowerblue", alpha.f = 0.7),
       pch = 19, xlab = "X", ylab = "Y")
}

Are \(x\) and \(y\) correlated?

Show R code
set.seed(1051)  # for reproducibility
n <- 1000 
x <- rnorm(n)
y <- 1 + 0.001*x + rnorm(n)
plot(x, y, asp = 1, col = adjustcolor("black", alpha.f = 0.3))

Correlation is not causation

All models are wrong!

Also, see this talk by my old adviser, Thad Tarpey: “All Models are Right… most are useless.”

Modeling the mean response

  • Assume that \(Y \sim N\left(\mu, \sigma^2\right)\), where

\[\mu = \mu\left(x\right) = \beta_0 + \beta_1 x = E\left(Y|x\right)\]

  • In other words: \(Y \sim N\left(\beta_0 + \beta_1 x, \sigma^2\right)\)
  • Alternatively, we could write \(Y = \beta_0 + \beta_1 x + \epsilon\), where \(\epsilon \sim N\left(0, \sigma^2\right)\)
  • This is called the simple linear regression (SLR) model

The idea behind SLR

Image source

Simple Linear Regression (SLR)

Simple linear regression (SLR)

  • Data: \(\left\{(X_i, Y_i)\right\}_{i=1}^n\)

  • Model: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)

    • \(Y_i\) is a continuous response

    • \(X_i\) is a continuous predictor

    • \(\beta_0\) is the intercept of the regression line (also called the bias term)

    • \(\beta_1\) is the slope of the regression line

    • \(\epsilon_i \stackrel{iid}{\sim} D\left(0, \sigma^2\right)\)

Assumptions about the errors \(\epsilon_i\)

For \(i\) and \(j\) in \(\left\{1, 2, \dots, n\right\}\) and \(i \ne j\)

  1. \(\quad E\left(\epsilon_i\right) = 0\)

  2. \(\quad Var\left(\epsilon_i\right) = \sigma^2\) (homoscedacticity 😱)

  3. \(\quad Cov\left(\epsilon_i, \epsilon_j\right) = 0\) (independence)

Assumptions 1–3 can be summarized as \(\epsilon_i \stackrel{iid}{\sim} D\left(0, \sigma^2\right)\), where \(iid\) refers to independent and identically distributed.

Properties of SLR

  • Simple linear regression: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)

    • Assumes the model is linear in the regression coefficients \(\beta_0\) and \(\beta_1\)
  • The error term is a random variable; hence, \(Y_i\) is also a random variable (Why? 🤔)

    • What is \(E\left(Y_i|X_i\right)\) and \(Var\left(Y_i|X_i\right)\)?
  • \(Cor\left(Y_i, Y_j\right) = 0\) \(\forall i \ne j\) (Why? 🤔)

Inference for a single variable

  • Is it useful to test the hypothesis that Sale_Price = \(160K\)?
  • No! Because Sale_Price is not a constant, but a random variable whose value varies from home to home (and year to year, etc.)
  • We’re more interested in questions such as:

    • What is the chance that Sale_Price > \(160K\)? (above median sale price)
    • What is the chance that Sale_Price < \(105K\)? (lowest decile)
    • What is the chance that $129,500 < Sale_Price < \(213,500\)? (within IQR)

Distribution of Sale_Price

Can look at historgram and empirical CDF:

Show R code
par(mfrow = c(1, 2), las = 1)
hist(ames$Sale_Price, br = 50, xlab = "Sale price ($)", freq = FALSE, main = "")
plot(ecdf(ames$Sale_Price), xlab = "Sale price ($)", main = "",
     col = adjustcolor(1, alpha.f = 0.1))

Normality tests 🤮

  • Normality tests, like the Shapiro-Wilk1 and Anderson-Darling tests, can also be used to assess normality

    • I STRONGLY ADVISE AGAINST USING THEM!
  • No data is normally distributed, what we care about is whether a normal approximation is close enough!

  • Normality tests provide a \(p\)-value, which only gives a yes/no conclusion and is heavily dependent on sample size.

What can we do if the normality assumption isn’t justified?

  • Try transformations

    • Logarithm or square root for positive data
    • Power transformation (like the well-known Box-Cox procedure)
  • Try a more appropriate distribution (e.g., Poisson or gamma distribution)

  • Try more advanced approaches, like the nonparametric bootstrapping!

Least squares (LS) estimation

Idea of LS is to find \(\beta_0\) and \(\beta_1\) so that the sum of squared residuals (i.e., errors) is minimized: \[SSE = \sum_{i=1}^n\left(y_i - \beta_0 - \beta_1x_i\right)^2\]

Pretty straightforward optimization problem that leads to closed-form solution (but no point in memorizing the formulas!)

Concept of LS estimation

Image source

Sale_Price and Gr_Liv_Area

Show R code
plot(Sale_Price ~ Gr_Liv_Area, data = ames, las = 1,
     col = adjustcolor(1, alpha.f = 0.3))

SLR fit

Show R code
summary(fit <- lm(Sale_Price ~ Gr_Liv_Area, data = ames))

Call:
lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.347  -3.022  -0.197   2.273  33.432 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.3289634  0.3269703   4.064 4.94e-05 ***
Gr_Liv_Area 0.0111694  0.0002066  54.061  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.652 on 2928 degrees of freedom
Multiple R-squared:  0.4995,    Adjusted R-squared:  0.4994 
F-statistic:  2923 on 1 and 2928 DF,  p-value: < 2.2e-16

Is this a good fit?

Show R code
plot(Sale_Price ~ Gr_Liv_Area, data = ames,
     col = adjustcolor(1, alpha.f = 0.3))
abline(fit, lwd = 2, col = 2)  # add SLR fit

Which assumptions seem violated to some degree?

Residual diagnostics

  • The standard residual is defined as \(e_i = y_i - \hat{y}_i\) and can be regarded as the observed error
  • The residuals hold a lot of properties that make them useful for diagnosing potential issues with the model (e.g., suggesting potential transformations to try)
  • Many other kinds of residuals exist for different purposes (e.g., standardized, studentized, jackknife or PRESS residuals, etc.)

What can residual plots tell us?

  • Residuals vs. predictor values (checking non-linearity).

  • Residuals vs. fitted values (non-constant variance, non-linearity, and outliers)

  • Residuals vs. time or another sequence (checking independence)

  • Residuals vs. omitted predictor values (missing potentially important predictors)

  • Normal QQ plot of residuals (non-normality).

  • And much, much more!

Sale_Price ~ Gr_Liv_Area

Residual analysis:

Show R code
par(mfrow = c(2, 3), las = 1)
plot(fit, which = 1:6)

What assumptions appear to be in violation?

Let’s try a log transformation

Show R code
fit2 <- lm(log(Sale_Price) ~ Gr_Liv_Area, data = ames)
plot(log(Sale_Price) ~ Gr_Liv_Area, data = ames,
     col = adjustcolor(1, alpha.f = 0.3))
abline(fit2, lwd = 2, col = 2)  # add SLR fit

log(Sale_Price) ~ Gr_Liv_Area

Residual analysis:

Show R code
par(mfrow = c(2, 3), las = 1)
plot(fit2, which = 1:6)

Any better?

Multiple Linear Regression (MLR)

MLR in a nutshell 🥜

  • The (normal) multiple linear regression model assumes \(Y \sim N\left(\mu\left(\boldsymbol{x}\right), \sigma^2\right)\), where \[\mu\left(\boldsymbol{x}\right) = \beta_0 + \sum_{i=1}^p \beta_i x_i = \boldsymbol{x}^\top\boldsymbol{\beta}\]

  • LS estimation still provides unbiased estimate of \(\boldsymbol{\beta} = \left(\beta_0, \beta_1, \dots, \beta_p\right)^\top\): \(\hat{\boldsymbol{\beta}} = \left(\boldsymbol{X}^\top\boldsymbol{X}\right)^{-1}\boldsymbol{X}^\top\boldsymbol{y}\)

  • Fitted values: \(\hat{\boldsymbol{y}} = \boldsymbol{X}\left(\boldsymbol{X}^\top\boldsymbol{X}\right)^{-1}\boldsymbol{X}^\top\boldsymbol{y} = \boldsymbol{H}\boldsymbol{y}\)

  • \(\boldsymbol{H}\) is the well-known “hat matrix

Polynomial regression

  • Polynomial regression is just a special case of the MLR model

  • A second order model in a single predictor: \[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\]

  • A k-th order model in a single predictor (Typically \(k \le 3\)): \[Y = \beta_0 + \sum_{j=1}^k\beta_j X^j + \epsilon\]

Example: paper strength data

Data concerning the strength of kraft paper and the percentage of hardwood in the batch of pulp from which the paper was produced.

Show R code
# Load the hardwood conentration data
url <- "https://bgreenwell.github.io/uc-bana7052/data/hardwood.csv"
hardwood <- read.csv(url)
Show R code
fit1 <- lm(TsStr ~ HwdCon, data = hardwood)
fit2 <- lm(TsStr ~ HwdCon + I(HwdCon^2), data = hardwood)

par(mfrow = c(1, 2), las = 1)
investr::plotFit(fit1, pch = 19, col.fit = "red2", main = "Linear Fit")
investr::plotFit(fit2, pch = 19, col.fit = "red2", main = "Quadratic Fit")

Categorical variables

  • Categorical variables can be handled in a number of ways in linear models, including

R dummy encodes nominal factors by default:

Show R code
fit3 <- lm(log(Sale_Price) ~ Gr_Liv_Area + Central_Air + Paved_Drive, 
           data = ames)
summary(fit3)

Call:
lm(formula = log(Sale_Price) ~ Gr_Liv_Area + Central_Air + Paved_Drive, 
    data = ames)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2344 -0.1345  0.0073  0.1453  0.8502 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 1.428e+00  2.418e-02  59.059  < 2e-16 ***
Gr_Liv_Area                 5.183e-04  9.520e-06  54.440  < 2e-16 ***
Central_AirY                3.464e-01  2.068e-02  16.747  < 2e-16 ***
Paved_DrivePartial_Pavement 1.334e-01  3.733e-02   3.574 0.000358 ***
Paved_DrivePaved            3.085e-01  1.975e-02  15.618  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2574 on 2925 degrees of freedom
Multiple R-squared:  0.6018,    Adjusted R-squared:  0.6013 
F-statistic:  1105 on 4 and 2925 DF,  p-value: < 2.2e-16

How do you interpret the coefficients here?

Coefficient of determination

The coefficient of determination is the proportion of the variance in the dependent variable that is predictable from the independent variables in the model.

R-squared (\(R^2\))

  • \(R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\)

  • \(R^2\) will always increase as more terms are added to the model!

Adjusted R-squared (\(R_{adj}^2\))

  • \(R_{adj}^2 = 1 - \frac{MSE}{SST/\left(n - 1\right)}\)

  • Penalizes \(R^2\) if there are “too many” terms in the model

  • \(R_{adj}^2\) and \(MSE\) provide equivalent information

Variable/model selection

  • Variable/model selection is a very noisy problem! (Often best to avoid, if feasible)
  • Ask the domain experts about important variables (don’t just rely on algorithms)
  • P(selecting the “right” variables) = 0 (source)
  • “All models subsets of variables are wrong, but some are useful!”
  • In regression settings, regularization (e.g., ridge regression and the LASSO) is often more useful! (Think about the impact of multicollinearity on variable selection)

Data splitting

  • If prediction is the goal (e.g., compared to inference and hypothesis testing), the model performance should be assessed rigorously

  • Data splitting techniques ae key and the type of data splitting to use often depends on the situation (e.g., cross-sectional vs. time-series data)

  • In simplest terms, random split the data into two parts: A and B. Build a model on part A and see how well it does with predicting the response in part B.

  • Leakage is a huge concern here, so data splitting ALWAYS has to be done carefully!

Linear Regression as a GLM

We can define the (normal) linear regression model as a specific type of Generalized Linear Model (GLM):

  1. Random Component: The response variable \(Y\) follows a normal distribution: \(Y \sim N(\mu, \sigma^2)\).

  2. Systematic Component: The linear predictor \(\alpha\) is a linear function of the regressors: \(\eta = \boldsymbol{x}^\top\boldsymbol{\beta}\).

  3. Link Function: The link function \(g(\cdot)\) connects the mean response \(\mu\) to the linear predictor \(\eta\). For linear regression, we use the identity link: \[g(\mu) = \mu = \eta\]

This framework allows us to generalize to other types of response variables (e.g., binary, count) by changing the random component and the link function!

Questions?

```