Linear Regression

A Brief Review

Brandon M. Greenwell, PhD

University of Cincinnati

About me

👨‍🎓 B.S. & M.S. in Applied Statistics (WSU)
👨‍🎓 Ph.D. in Applied Matehmatics (AFIT)
🎬 Director, Data Science at 84.51˚
👨‍🏫 UC LCB adjunct (~8 years)
Some R packages 📦 :
- pdp (partial dependence plots)
- vip (variable importance plots)
- fastshap (faster SHAP values)
Some books 📚 :
- Hands-On Machine Learning with R
- Tree-Based Methods for Statistical Learning

Ames housing data

Data describing the sale of individual residential property in Ames, Iowa from 2006 to 2010
There are 2930 observations on 81 variables involved in assessing home values:
- 23 nominal
- 23 ordinal
- 14 discrete
- 20 continuous
Paper: https://jse.amstat.org/v19n3/decock.pdf

Ames housing data

Show R code

ames <- AmesHousing::make_ames()  # install.packages("AmesHousing")

ames$Sale_Price <- ames$Sale_Price / 10000

head(ames)

Ames housing data

We’ll focus on a handful of variables:

Sale_Price - Sale price of the house / $10K (response variable)
Gr_Liv_Area - Above grade (ground) living area square feet
Overall_Qual - Rates the overall material and finish of the house

Statistical relationships

Show R code

# Simulate data from different SLR models
set.seed(101)  # for reproducibility
x <- seq(from = 0, to = 4, length = 100)
y <- cbind(
  1 + x + rnorm(length(x)),  # linear
  1 + (x - 2)^2 + rnorm(length(x)),  # quadratic
  1 + log(x + 0.1) + rnorm(length(x), sd = 0.3),  # logarithmic
  1 + rnorm(length(x))  # no association
)

# Scatterplot of X vs. each Y in a 2-by-2 grid
par(mfrow = c(2, 2))
for (i in 1:4) {
  plot(x, y[, i], col = adjustcolor("cornflowerblue", alpha.f = 0.7),
       pch = 19, xlab = "X", ylab = "Y")
}

Are $x$ and $y$ correlated?

Show R code

set.seed(1051)  # for reproducibility
n <- 1000 
x <- rnorm(n)
y <- 1 + 0.001*x + rnorm(n)
plot(x, y, asp = 1, col = adjustcolor("black", alpha.f = 0.3))

Correlation is not causation

Fun with spurious correlations

All models are wrong!

Also, see this talk by my old adviser, Thad Tarpey: “All Models are Right… most are useless.”

Modeling the mean response

Assume that $Y \sim N\left(\mu, \sigma^2\right)$, where

\[\mu = \mu\left(x\right) = \beta_0 + \beta_1 x = E\left(Y|x\right)\]

In other words: $Y \sim N\left(\beta_0 + \beta_1 x, \sigma^2\right)$
Alternatively, we could write $Y = \beta_0 + \beta_1 x + \epsilon$, where $\epsilon \sim N\left(0, \sigma^2\right)$
This is called the simple linear regression (SLR) model

The idea behind SLR

Image source

Simple Linear Regression (SLR)

Simple linear regression (SLR)

Data: $\left\{(X_i, Y_i)\right\}_{i=1}^n$
Model: $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$
- $Y_i$ is a continuous response
- $X_i$ is a continuous predictor
- $\beta_0$ is the intercept of the regression line (also called the bias term)
- $\beta_1$ is the slope of the regression line
- $\epsilon_i \stackrel{iid}{\sim} D\left(0, \sigma^2\right)$

Assumptions about the errors $\epsilon_i$

For $i$ and $j$ in $\left\{1, 2, \dots, n\right\}$ and $i \ne j$

$\quad E\left(\epsilon_i\right) = 0$
$\quad Var\left(\epsilon_i\right) = \sigma^2$ (homoscedacticity 😱)
$\quad Cov\left(\epsilon_i, \epsilon_j\right) = 0$ (independence)

Assumptions 1–3 can be summarized as $\epsilon_i \stackrel{iid}{\sim} D\left(0, \sigma^2\right)$, where $iid$ refers to independent and identically distributed.

Properties of SLR

Simple linear regression: $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$
- Assumes the model is linear in the regression coefficients $\beta_0$ and $\beta_1$
The error term is a random variable; hence, $Y_i$ is also a random variable (Why? 🤔)
- What is $E\left(Y_i|X_i\right)$ and $Var\left(Y_i|X_i\right)$?
$Cor\left(Y_i, Y_j\right) = 0$ $\forall i \ne j$ (Why? 🤔)

Inference for a single variable

Is it useful to test the hypothesis that Sale_Price = $160K$?

No! Because Sale_Price is not a constant, but a random variable whose value varies from home to home (and year to year, etc.)

We’re more interested in questions such as:
- What is the chance that Sale_Price > $160K$? (above median sale price)
- What is the chance that Sale_Price < $105K$? (lowest decile)
- What is the chance that $129,500 < Sale_Price < $213,500$? (within IQR)

Distribution of `Sale_Price`

Can look at historgram and empirical CDF:

Show R code

par(mfrow = c(1, 2), las = 1)
hist(ames$Sale_Price, br = 50, xlab = "Sale price ($)", freq = FALSE, main = "")
plot(ecdf(ames$Sale_Price), xlab = "Sale price ($)", main = "",
     col = adjustcolor(1, alpha.f = 0.1))

Normality tests 🤮

Normality tests, like the Shapiro-Wilk¹ and Anderson-Darling tests, can also be used to assess normality
- I STRONGLY ADVISE AGAINST USING THEM!
No data is normally distributed, what we care about is whether a normal approximation is close enough!
Normality tests provide a $p$-value, which only gives a yes/no conclusion and is heavily dependent on sample size.

What can we do if the normality assumption isn’t justified?

Try transformations
- Logarithm or square root for positive data
- Power transformation (like the well-known Box-Cox procedure)
Try a more appropriate distribution (e.g., Poisson or gamma distribution)
Try more advanced approaches, like the nonparametric bootstrapping!

Least squares (LS) estimation

Idea of LS is to find $\beta_0$ and $\beta_1$ so that the sum of squared residuals (i.e., errors) is minimized: \[SSE = \sum_{i=1}^n\left(y_i - \beta_0 - \beta_1x_i\right)^2\]

Pretty straightforward optimization problem that leads to closed-form solution (but no point in memorizing the formulas!)

Concept of LS estimation

Image source

`Sale_Price` and `Gr_Liv_Area`

Show R code

plot(Sale_Price ~ Gr_Liv_Area, data = ames, las = 1,
     col = adjustcolor(1, alpha.f = 0.3))

SLR fit

Show R code

summary(fit <- lm(Sale_Price ~ Gr_Liv_Area, data = ames))


Call:
lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.347  -3.022  -0.197   2.273  33.432 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.3289634  0.3269703   4.064 4.94e-05 ***
Gr_Liv_Area 0.0111694  0.0002066  54.061  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.652 on 2928 degrees of freedom
Multiple R-squared:  0.4995,    Adjusted R-squared:  0.4994 
F-statistic:  2923 on 1 and 2928 DF,  p-value: < 2.2e-16

Is this a good fit?

Show R code

plot(Sale_Price ~ Gr_Liv_Area, data = ames,
     col = adjustcolor(1, alpha.f = 0.3))
abline(fit, lwd = 2, col = 2)  # add SLR fit

Which assumptions seem violated to some degree?

Residual diagnostics

The standard residual is defined as $e_i = y_i - \hat{y}_i$ and can be regarded as the observed error
The residuals hold a lot of properties that make them useful for diagnosing potential issues with the model (e.g., suggesting potential transformations to try)
Many other kinds of residuals exist for different purposes (e.g., standardized, studentized, jackknife or PRESS residuals, etc.)

What can residual plots tell us?

Residuals vs. predictor values (checking non-linearity).
Residuals vs. fitted values (non-constant variance, non-linearity, and outliers)
Residuals vs. time or another sequence (checking independence)
Residuals vs. omitted predictor values (missing potentially important predictors)
Normal QQ plot of residuals (non-normality).
And much, much more!

`Sale_Price ~ Gr_Liv_Area`

Residual analysis:

Show R code

par(mfrow = c(2, 3), las = 1)
plot(fit, which = 1:6)

What assumptions appear to be in violation?

Let’s try a log transformation

Show R code

fit2 <- lm(log(Sale_Price) ~ Gr_Liv_Area, data = ames)
plot(log(Sale_Price) ~ Gr_Liv_Area, data = ames,
     col = adjustcolor(1, alpha.f = 0.3))
abline(fit2, lwd = 2, col = 2)  # add SLR fit

`log(Sale_Price) ~ Gr_Liv_Area`

Residual analysis:

Show R code

par(mfrow = c(2, 3), las = 1)
plot(fit2, which = 1:6)

Any better?

Multiple Linear Regression (MLR)

MLR in a nutshell 🥜

The (normal) multiple linear regression model assumes $Y \sim N\left(\mu\left(\boldsymbol{x}\right), \sigma^2\right)$, where \[\mu\left(\boldsymbol{x}\right) = \beta_0 + \sum_{i=1}^p \beta_i x_i = \boldsymbol{x}^\top\boldsymbol{\beta}\]
LS estimation still provides unbiased estimate of $\boldsymbol{\beta} = \left(\beta_0, \beta_1, \dots, \beta_p\right)^\top$: $\hat{\boldsymbol{\beta}} = \left(\boldsymbol{X}^\top\boldsymbol{X}\right)^{-1}\boldsymbol{X}^\top\boldsymbol{y}$
Fitted values: $\hat{\boldsymbol{y}} = \boldsymbol{X}\left(\boldsymbol{X}^\top\boldsymbol{X}\right)^{-1}\boldsymbol{X}^\top\boldsymbol{y} = \boldsymbol{H}\boldsymbol{y}$
$\boldsymbol{H}$ is the well-known “hat matrix”

Polynomial regression

Polynomial regression is just a special case of the MLR model
A second order model in a single predictor: \[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\]
A k-th order model in a single predictor (Typically $k \le 3$): \[Y = \beta_0 + \sum_{j=1}^k\beta_j X^j + \epsilon\]

Example: paper strength data

Data concerning the strength of kraft paper and the percentage of hardwood in the batch of pulp from which the paper was produced.

Show R code

# Load the hardwood conentration data
url <- "https://bgreenwell.github.io/uc-bana7052/data/hardwood.csv"
hardwood <- read.csv(url)

Show R code

fit1 <- lm(TsStr ~ HwdCon, data = hardwood)
fit2 <- lm(TsStr ~ HwdCon + I(HwdCon^2), data = hardwood)

par(mfrow = c(1, 2), las = 1)
investr::plotFit(fit1, pch = 19, col.fit = "red2", main = "Linear Fit")
investr::plotFit(fit2, pch = 19, col.fit = "red2", main = "Quadratic Fit")

Categorical variables

Categorical variables can be handled in a number of ways in linear models, including
- Dummy encoding (nominal)
- Orthogonal polynomials (ordinal)

R dummy encodes nominal factors by default:

Show R code

fit3 <- lm(log(Sale_Price) ~ Gr_Liv_Area + Central_Air + Paved_Drive, 
           data = ames)
summary(fit3)


Call:
lm(formula = log(Sale_Price) ~ Gr_Liv_Area + Central_Air + Paved_Drive, 
    data = ames)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2344 -0.1345  0.0073  0.1453  0.8502 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 1.428e+00  2.418e-02  59.059  < 2e-16 ***
Gr_Liv_Area                 5.183e-04  9.520e-06  54.440  < 2e-16 ***
Central_AirY                3.464e-01  2.068e-02  16.747  < 2e-16 ***
Paved_DrivePartial_Pavement 1.334e-01  3.733e-02   3.574 0.000358 ***
Paved_DrivePaved            3.085e-01  1.975e-02  15.618  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2574 on 2925 degrees of freedom
Multiple R-squared:  0.6018,    Adjusted R-squared:  0.6013 
F-statistic:  1105 on 4 and 2925 DF,  p-value: < 2.2e-16

How do you interpret the coefficients here?

Coefficient of determination

The coefficient of determination is the proportion of the variance in the dependent variable that is predictable from the independent variables in the model.

R-squared ($R^2$)

$R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}$
$R^2$ will always increase as more terms are added to the model!

Adjusted R-squared ($R_{adj}^2$)

$R_{adj}^2 = 1 - \frac{MSE}{SST/\left(n - 1\right)}$
Penalizes $R^2$ if there are “too many” terms in the model
$R_{adj}^2$ and $MSE$ provide equivalent information

Variable/model selection

Variable/model selection is a very noisy problem! (Often best to avoid, if feasible)
Ask the domain experts about important variables (don’t just rely on algorithms)
P(selecting the “right” variables) = 0 (source)
“All ~~models~~ subsets of variables are wrong, but some are useful!”
In regression settings, regularization (e.g., ridge regression and the LASSO) is often more useful! (Think about the impact of multicollinearity on variable selection)

Data splitting

If prediction is the goal (e.g., compared to inference and hypothesis testing), the model performance should be assessed rigorously
Data splitting techniques ae key and the type of data splitting to use often depends on the situation (e.g., cross-sectional vs. time-series data)
In simplest terms, random split the data into two parts: A and B. Build a model on part A and see how well it does with predicting the response in part B.
Leakage is a huge concern here, so data splitting ALWAYS has to be done carefully!