Computational part (type and submit your answers in WORD only)

Please make sure that you include a cover page.

Question 1. (10 points) Alumni Donation Data (Multiple Linear Regression). Continue with the same data from homework 1 and fit a multiple linear regression model to the data, where the alumni giving rate is the response variable (\(Y\)), and the percentage of classes with fewer than 20 students (\(X_1\)) and Student/Faculty Ratio (\(X_2\)) as the predictors.

Explore various residual diagnostics and possible remedies, including but not limited to:

  1. Does the assumption of error normality appear to be violated?

  2. Does the assumption of constant error variance appear to be violated?

  3. Does there appear to be any \(Y\) or \(X\) outliers or influential points?

  4. Is there any concern about multicollinearity?

  5. What is the predicted alumni giving rate for an observation with (\(X_1 = 40\), \(X_2 = 5\))? Is there any concern about this prediction? Please explain.

Question 2. (10 points) Simulation Study (Simple Linear Regression). Assume mean function \(E\left(Y|X\right) = 10 + 5 X - 2X^2\)

  1. Generate data with \(X_1 \sim N\left(\mu = 3, \sigma = 0.5\right)\), sample size \(n = 100\), and error term \(\epsilon \sim N\left(\mu = 0, \sigma = 0.5\right)\).

  2. Fit a simple linear regression using just \(X\). What is the estimated regression equation? Please conduct model estimation, inference, and residual diagnostics. What do you conclude?

  3. Update the model from part b) by adding a quadratic term. Conduct model estimation, inference, and residual diagnostics. What do you conclude? Does this model seem to fit the data better? Please explain.

  4. What is the variance inflation factor (VIF) for the quadratic model? Any concern of multicollinearity?

  5. Now center the \(X\) variable and compare the VIF from d). What did you find? Which VIF is smaller? Please briefly explain the reason. Hint: you can center the \(X\) variable in R using the following pseudocode:

# Method 1: center x in the model formula
fit <- lm(y ~ x + I(x^2), data = my_data)
fit_centered <- lm(y ~ I(x - mean(x)) + I((x - mean(x))^2), data = my_data)

# Method 2: center x in the training data
my_data$x_centered <- my_data$x - mean(my_data$x)
fit_centered <- lm(y ~ x_centered + I(x_centered^2), data = my_data)

# Try it both ways to make sure you get the same answer!

Mathematics part (feel free to type or handwrite your solutions)

Question 3. (10 points)

  1. What are the assumptions of the (normal) linear regression model? (Be sure to include the assumptions of the random error term \(\epsilon\).) Please carefully state using bulleted points.

  2. Are the residuals from linear regression uncorrelated in general? Please explain. What is the distribution of residuals?

  3. Why do we plot the residuals against the fitted values (\(\widehat{Y}_i\)) rather than against the observed response values (\(Y_i\))? Please explain.

  4. What is multicollinearity? Computationally, what kind of problems can multicollinearity cause when using ordinary least squares? Please explain.