A Brief Review
University of Cincinnati
We’ll focus on a handful of variables:
Sale_Price - Sale price of the house / $10K (response variable)Gr_Liv_Area - Above grade (ground) living area square feetOverall_Qual - Rates the overall material and finish of the house# Simulate data from different SLR models
set.seed(101) # for reproducibility
x <- seq(from = 0, to = 4, length = 100)
y <- cbind(
1 + x + rnorm(length(x)), # linear
1 + (x - 2)^2 + rnorm(length(x)), # quadratic
1 + log(x + 0.1) + rnorm(length(x), sd = 0.3), # logarithmic
1 + rnorm(length(x)) # no association
)
# Scatterplot of X vs. each Y in a 2-by-2 grid
par(mfrow = c(2, 2))
for (i in 1:4) {
plot(x, y[, i], col = adjustcolor("cornflowerblue", alpha.f = 0.7),
pch = 19, xlab = "X", ylab = "Y")
}Also, see this talk by my old adviser, Thad Tarpey: “All Models are Right… most are useless.”
\[\mu = \mu\left(x\right) = \beta_0 + \beta_1 x = E\left(Y|x\right)\]
Data: \(\left\{(X_i, Y_i)\right\}_{i=1}^n\)
Model: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)
\(Y_i\) is a continuous response
\(X_i\) is a continuous predictor
\(\beta_0\) is the intercept of the regression line (also called the bias term)
\(\beta_1\) is the slope of the regression line
\(\epsilon_i \stackrel{iid}{\sim} D\left(0, \sigma^2\right)\)
For \(i\) and \(j\) in \(\left\{1, 2, \dots, n\right\}\) and \(i \ne j\)
\(\quad E\left(\epsilon_i\right) = 0\)
\(\quad Var\left(\epsilon_i\right) = \sigma^2\) (homoscedacticity 😱)
\(\quad Cov\left(\epsilon_i, \epsilon_j\right) = 0\) (independence)
Assumptions 1–3 can be summarized as \(\epsilon_i \stackrel{iid}{\sim} D\left(0, \sigma^2\right)\), where \(iid\) refers to independent and identically distributed.
Simple linear regression: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)
The error term is a random variable; hence, \(Y_i\) is also a random variable (Why? 🤔)
\(Cor\left(Y_i, Y_j\right) = 0\) \(\forall i \ne j\) (Why? 🤔)
Sale_Price = \(160K\)?Sale_Price is not a constant, but a random variable whose value varies from home to home (and year to year, etc.)We’re more interested in questions such as:
Sale_Price > \(160K\)? (above median sale price)Sale_Price < \(105K\)? (lowest decile)Sale_Price < \(213,500\)? (within IQR)Sale_PriceCan look at historgram and empirical CDF:
Normality tests, like the Shapiro-Wilk1 and Anderson-Darling tests, can also be used to assess normality
No data is normally distributed, what we care about is whether a normal approximation is close enough!
Normality tests provide a \(p\)-value, which only gives a yes/no conclusion and is heavily dependent on sample size.
Try transformations
Try a more appropriate distribution (e.g., Poisson or gamma distribution)
Try more advanced approaches, like the nonparametric bootstrapping!
Idea of LS is to find \(\beta_0\) and \(\beta_1\) so that the sum of squared residuals (i.e., errors) is minimized: \[SSE = \sum_{i=1}^n\left(y_i - \beta_0 - \beta_1x_i\right)^2\]
Pretty straightforward optimization problem that leads to closed-form solution (but no point in memorizing the formulas!)
Sale_Price and Gr_Liv_Area
Call:
lm(formula = Sale_Price ~ Gr_Liv_Area, data = ames)
Residuals:
Min 1Q Median 3Q Max
-48.347 -3.022 -0.197 2.273 33.432
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3289634 0.3269703 4.064 4.94e-05 ***
Gr_Liv_Area 0.0111694 0.0002066 54.061 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.652 on 2928 degrees of freedom
Multiple R-squared: 0.4995, Adjusted R-squared: 0.4994
F-statistic: 2923 on 1 and 2928 DF, p-value: < 2.2e-16
Which assumptions seem violated to some degree?
Residuals vs. predictor values (checking non-linearity).
Residuals vs. fitted values (non-constant variance, non-linearity, and outliers)
Residuals vs. time or another sequence (checking independence)
Residuals vs. omitted predictor values (missing potentially important predictors)
Normal QQ plot of residuals (non-normality).
And much, much more!
Sale_Price ~ Gr_Liv_AreaResidual analysis:
What assumptions appear to be in violation?
log(Sale_Price) ~ Gr_Liv_AreaResidual analysis:
Any better?
The (normal) multiple linear regression model assumes \(Y \sim N\left(\mu\left(\boldsymbol{x}\right), \sigma^2\right)\), where \[\mu\left(\boldsymbol{x}\right) = \beta_0 + \sum_{i=1}^p \beta_i x_i = \boldsymbol{x}^\top\boldsymbol{\beta}\]
LS estimation still provides unbiased estimate of \(\boldsymbol{\beta} = \left(\beta_0, \beta_1, \dots, \beta_p\right)^\top\): \(\hat{\boldsymbol{\beta}} = \left(\boldsymbol{X}^\top\boldsymbol{X}\right)^{-1}\boldsymbol{X}^\top\boldsymbol{y}\)
Fitted values: \(\hat{\boldsymbol{y}} = \boldsymbol{X}\left(\boldsymbol{X}^\top\boldsymbol{X}\right)^{-1}\boldsymbol{X}^\top\boldsymbol{y} = \boldsymbol{H}\boldsymbol{y}\)
\(\boldsymbol{H}\) is the well-known “hat matrix”
Polynomial regression is just a special case of the MLR model
A second order model in a single predictor: \[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\]
A k-th order model in a single predictor (Typically \(k \le 3\)): \[Y = \beta_0 + \sum_{j=1}^k\beta_j X^j + \epsilon\]
Data concerning the strength of kraft paper and the percentage of hardwood in the batch of pulp from which the paper was produced.
Categorical variables can be handled in a number of ways in linear models, including
R dummy encodes nominal factors by default:
Call:
lm(formula = log(Sale_Price) ~ Gr_Liv_Area + Central_Air + Paved_Drive,
data = ames)
Residuals:
Min 1Q Median 3Q Max
-2.2344 -0.1345 0.0073 0.1453 0.8502
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.428e+00 2.418e-02 59.059 < 2e-16 ***
Gr_Liv_Area 5.183e-04 9.520e-06 54.440 < 2e-16 ***
Central_AirY 3.464e-01 2.068e-02 16.747 < 2e-16 ***
Paved_DrivePartial_Pavement 1.334e-01 3.733e-02 3.574 0.000358 ***
Paved_DrivePaved 3.085e-01 1.975e-02 15.618 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2574 on 2925 degrees of freedom
Multiple R-squared: 0.6018, Adjusted R-squared: 0.6013
F-statistic: 1105 on 4 and 2925 DF, p-value: < 2.2e-16
How do you interpret the coefficients here?
The coefficient of determination is the proportion of the variance in the dependent variable that is predictable from the independent variables in the model.
R-squared (\(R^2\))
\(R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\)
\(R^2\) will always increase as more terms are added to the model!
Adjusted R-squared (\(R_{adj}^2\))
\(R_{adj}^2 = 1 - \frac{MSE}{SST/\left(n - 1\right)}\)
Penalizes \(R^2\) if there are “too many” terms in the model
\(R_{adj}^2\) and \(MSE\) provide equivalent information
If prediction is the goal (e.g., compared to inference and hypothesis testing), the model performance should be assessed rigorously
Data splitting techniques ae key and the type of data splitting to use often depends on the situation (e.g., cross-sectional vs. time-series data)
In simplest terms, random split the data into two parts: A and B. Build a model on part A and see how well it does with predicting the response in part B.
Leakage is a huge concern here, so data splitting ALWAYS has to be done carefully!
We can define the (normal) linear regression model as a specific type of Generalized Linear Model (GLM):
Random Component: The response variable \(Y\) follows a normal distribution: \(Y \sim N(\mu, \sigma^2)\).
Systematic Component: The linear predictor \(\alpha\) is a linear function of the regressors: \(\eta = \boldsymbol{x}^\top\boldsymbol{\beta}\).
Link Function: The link function \(g(\cdot)\) connects the mean response \(\mu\) to the linear predictor \(\eta\). For linear regression, we use the identity link: \[g(\mu) = \mu = \eta\]
This framework allows us to generalize to other types of response variables (e.g., binary, count) by changing the random component and the link function!
```

BANA 7042: Statistical Modeling