06. Linear models
2024
There are four main assumptions of the linear model specification \[Y_i \mid \mathbf{x}_i \sim \mathsf{normal}(\mathbf{x}_i\boldsymbol{\beta}, \sigma^2).\]
Our strategy is to create graphical diagnostic tools or perform hypothesis tests to ensure that there is no gross violation of the model underlying assumptions.
The mean is \[\mathsf{E}(Y_i \mid \mathbf{x}_i)=\beta_0 + \beta_1x_{i1} + \cdots + \beta_p x_{ip}.\]
Implicitly,
Use ordinary residuals \(\boldsymbol{e}\), which are uncorrelated with fitted values \(\widehat{\boldsymbol{y}}\) and explanatory variables (i.e., columns of \(\mathbf{X}\)).
Any local pattern or patterns (e.g., quadratic trend, cycles, changepoints, subgroups) are indicative of misspecification of the mean model.
Use local smoother (GAM or LOESS) to detect trends.
Look for pattern in the \(y\)-axis, not the \(x\)-axis!
Fix the mean model
The variance is the same for all observations, \(\mathsf{Va}(Y_i \mid \mathbf{x}_i) = \sigma^2\)
Typical heteroscedasticity patterns arise when
Use externally studentized residuals \(r_i\), which have equal variance.
Hypothesis tests:
Graphical diagnostics
# Extract externally studentized residuals
r <- rstudent(linmod.college1)
# Levene test (F-test for ANOVA)
car::leveneTest(r ~ rank, center = "mean", data = college)
## Levene's Test for Homogeneity of Variance (center = "mean")
## Df F value Pr(>F)
## group 2 50 <2e-16 ***
## 394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Breusch-Pagan (with a score test)
car::ncvTest(linmod.college1, var.formula = ~ rank)
## Non-constant Variance Score Test
## Variance formula: ~ rank
## Chisquare = 70, Df = 2, p = 6e-16
Specify a function for the variance, e.g.,
A model specification enables the use of likelihood ratio tests.
The model can be fitted via restricted maximum likelihood using the function gls
from package nlme
.
For the college data, we set \(Y_i \sim \mathsf{normal}(\mathbf{x}_i\boldsymbol{\beta}, \sigma^2_{\texttt{rank}_i})\) with three different variance parameters. This seemingly corrects the heteroscedasticity.
The t.test
and oneway.test
functions in R perform hypothesis test for equality of mean: for observation \(i\) from group \(j\), the model specified is
\[Y_{ij} \sim \mathsf{normal}(\mu_j, \sigma^2_j).\] These tests are often labelled by software under the name of Welch (1947).
There is no exact expression for the null distribution for equal variance, but approximations due to Satterthwaite (1946) are used as benchmarks.
We need enough people in each subgroup to reliably estimate both the mean and variance!
Economists often use sandwich estimators (White 1980), whereby we replace the estimator of the covariance matrix of \(\widehat{\boldsymbol{\beta}}\), usually \(S^2(\mathbf{X}^\top\mathbf{X})^{-1}\), by a sandwich estimator of the form
\[\widehat{\mathsf{Va}}_{\mathsf{HCE}}(\boldsymbol{\widehat{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\Omega}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\] with \(\boldsymbol{\Omega}\) a diagonal matrix.
Popular choices are heteroscedastic consistent matrices (MacKinnon and White 1985), e.g., taking \(\mathrm{diag}(\boldsymbol{\Omega})_i = e_i^2/(1-h_{ii})^2\), the so-called HC\({}_3\).
Replace \(\mathsf{Va}(\widehat{\boldsymbol{\beta}})\) by \(\widehat{\mathsf{Va}}_{\mathsf{HCE}}(\boldsymbol{\widehat{\beta}})\) in the formula of Wald tests.
vcov_HCE <- car::hccm(linmod.college1)
# Wald tests with sandwich matrix
w <- coef(linmod.college1) / sqrt(diag(vcov_HCE))
# Variance ratios
diag(vcov_HCE) / diag(vcov(linmod.college1))
## (Intercept) rankassociate rankfull fieldtheoretical
## 0.27 0.29 0.62 0.99
## sexwoman service years
## 0.41 2.19 1.76
# Compute p-values
pval <- 2*pt(abs(w),
df = linmod.college1$df.residual,
lower.tail = FALSE)
Multiplicative data of the form \[\begin{align*} \left(\begin{matrix} \text{moyenne $\mu$}\end{matrix}\right) \times \left(\begin{matrix} \text{random error} \end{matrix}\right) \end{align*}\] tend to have higher variability when the response is larger.
A log-transformation of the response, \(\ln Y\), makes the model additive, assuming \(Y > 0\).
Write the log-linear model \[\begin{align*} \ln Y = \beta_0+ \beta_0 +\beta_1 X_1 +\cdots + \beta_pX_p + \varepsilon \end{align*}\] in the original response scale as \[\begin{align*} Y &= \exp\left(\beta_0 +\beta_1 X_1 +\cdots + \beta_pX_p\right)\cdot \exp(\varepsilon), \end{align*}\] and thus \[\begin{align*} \mathsf{E}(Y \mid \mathbf{X}) = \exp(\beta_0 +\beta_1 X_1 +\cdots + \beta_pX_p) \times \mathsf{E}\{\exp(\varepsilon) \mid \mathbf{X}\}. \end{align*}\]
If \(\varepsilon \mid \mathbf{x} \sim \mathsf{normal}(\mu,\sigma^2)\), then \(\mathsf{E}\{\exp(\varepsilon) \mid \mathbf{x}\}= \exp(\mu+\sigma^2/2)\) and \(\exp(\varepsilon)\) follows a log-normal distribution.
An increase of one unit of \(X_j\) leads to a \(\beta_j\) increase of \(\ln Y\) without interaction or nonlinear term for \(X_j\), and this translates into a multiplicative increase of a factor \(\exp(\beta_j)\) on the original data scale for \(Y\).
Compare the ratio of \(\mathsf{E}(Y \mid X_1=x+1)\) to \(\mathsf{E}(Y \mid X_1=x)\), \[\begin{align*} \frac{\mathsf{E}(Y \mid X_1=x+1, X_2, \ldots, X_p)}{\mathsf{E}(Y \mid X_1=x, X_2, \ldots, X_p)} = \frac{\exp\{\beta_1(x+1)\}}{\exp(\beta_1 x)} = \exp(\beta_1). \end{align*}\] Thus, \(\exp(\beta_1)\) represents the ratio of the mean of \(Y\) when \(X_1=x+1\) in comparison to that when \(X_1=x\), ceteris paribus (and provided this statement is meaningful).
The percentage change is
Consider the case where both \(Y\) and \(X_1\) is log-transformed, so that \[\begin{align*} Y= X_1^{\beta_1}\exp(\beta_0 + \beta_2X_2 + \cdots + \beta_pX_p + \varepsilon) \end{align*}\] Taking the derivative of the left hand side with respect to \(X_1>0\), we get \[\begin{align*} \frac{\partial Y}{\partial X_1}&= \beta_1 X_1^{\beta_1-1}\exp(\beta_0 + \beta_2X_2 + \cdots + \beta_pX_p + \varepsilon)= \frac{\beta_1 Y}{X_1} \end{align*}\] and thus we can rearrange the expression so that \[\begin{align*} \frac{\partial X_1}{X_1}\beta_1 = \frac{\partial Y}{Y}; \end{align*}\] this is a partial elasticity, so \(\beta_1\) is interpreted as a \(\beta_1\) percentage change in \(Y\) for each percentage increase of \(X_1\), ceteris paribus.
Follows from sampling scheme (random sample), need context to infer whether this assumption holds.
Typical violations include
Nearby things are more alike, so the amount of ‘new information’ is smaller than the sample size.
When observations are positively correlated, the estimated standard errors reported by the software are too small.
This means we are overconfident and will reject the null hypothesis more often then we should if the null is true (inflated Type I error, or false positive).
Chapter 6 will deal with correlated data
The main idea is to assume instead that \[\boldsymbol{Y} \mid \mathbf{X} \sim \mathsf{normal}_n(\mathbf{X}\boldsymbol{\beta}, \boldsymbol{\Sigma})\] and model explicitly the \(n \times n\) variance matrix \(\boldsymbol{\Sigma}\), parametrized in terms of covariance parameters \(\boldsymbol{\psi}\).
For time series, we can look instead at a correlogram, i.e., a bar plot of the correlation between two observations \(h\) units apart as a function of the lag \(h\).
For \(y_1, \ldots, y_n\) and constant time lags \(h=0, 1, \ldots\) units, the autocorrelation at lag \(h\) is (Brockwell and Davis 2016, Definition 1.4.4) \[\begin{align*} r(h) = \frac{\gamma(h)}{\gamma(0)}, \qquad \gamma(h) = \frac{1}{n}\sum_{i=1}^{n-|h|} (y_i-\overline{y})(y_{i+h} - \overline{y}) \end{align*}\]
Without doubt the least important assumption.
Ordinary least squares are best linear unbiased estimators (BLUE) if the data are independent and the variance is constant, regardless of normality.
They are still unbiased and consistent if the variance is misspecified.
Tests for parameters are valid provided that each coefficient estimator is based on a sufficient number of observations.
Produce Student quantile-quantile plots of externally studentized residuals \(R_i \sim \mathsf{Student}(n-p-2)\).
For strictly positive data, one can consider a Box–Cox transformation, \[\begin{align*} y(\lambda)= \begin{cases} (y^{\lambda}-1)/\lambda, & \lambda \neq 0\\ \ln(y), & \lambda=0. \end{cases} \end{align*}\] The cases
are perhaps the most important because they yield interpretable models.
If we assume that \(\boldsymbol{Y}(\lambda) \sim \mathsf{normal}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)\), then the likelihood is \[\begin{align*} L(\lambda, \boldsymbol{\beta}, \sigma; \boldsymbol{y}, \mathbf{X}) &= (2\pi\sigma^2)^{-n/2} J(\lambda, \boldsymbol{y}) \times\\& \quad \exp \left[ - \frac{1}{2\sigma^2}\{\boldsymbol{y}(\lambda) - \mathbf{X}\boldsymbol{\beta}\}^\top\{\boldsymbol{y}(\lambda) - \mathbf{X}\boldsymbol{\beta}\}\right], \end{align*}\] where \(J\) denotes the Jacobian of the Box–Cox transformation, \(J(\lambda, \boldsymbol{y})=\prod_{i=1}^n y_i^{\lambda-1}\).
For each given value of \(\lambda\), the maximum likelihood estimator is that of the usual regression model, with \(\boldsymbol{y}\) replaced by \(\boldsymbol{y}(\lambda)\).
The profile log likelihood for \(\lambda\) is \[\begin{align*} \ell_{\mathsf{p}}(\lambda) = -\frac{n}{2}\ln(2\pi \widehat{\sigma}^2_\lambda) - \frac{n}{2} + (\lambda - 1)\sum_{i=1}^n \ln(y_i) \end{align*}\]
Box and Cox (1964) considered survival time for 48 animals based on a randomized trial. The poisons
data are balanced, with 3 poisons were administered with 4 treatments to four animals each.
We could consider a two-way ANOVA without interaction, given the few observations for each combination. The model would be of the form \[\begin{align*} Y &= \beta_0 + \beta_1 \texttt{poison}_2 + \beta_2\texttt{poison}_3 +\beta_3\texttt{treatment}_2 \\ &\qquad+ \beta_4\texttt{treatment}_3 +\beta_5\texttt{treatment}_4 + \varepsilon \end{align*}\]
The profile log likelihood for the Box–Cox transform parameter, suggests a value of \(\lambda=-1\) would be within the 95% confidence interval.
The reciprocal response \(Y^{-1}\) corresponds to the speed of action of the poison depending on both poison type and treatment.
The diagnostics plot for this model show no residual structure.
Outliers can impact the fit, the more so if they have high leverage.
Plot the Cook distance \(C_i\) as a function of the leverage \(h_{ii}\), where \[\begin{align*} C_i=\frac{r_i^2h_{ii}}{(p+1)(1-h_{ii})}. \end{align*}\]
Robust regression is less efficient (higher std. errors), but more robust to outliers.
The theory of robust statistics beyond the scope of the course.
Comment about transformations
We cannot compare models fitted to \(Y_i\) versus \(\ln Y_i\) using, e.g., information criteria or test, because models have different responses.
We can use however the Box–Cox likelihood, which includes the Jacobian of the transformation, to assess the goodness of fit and compare the model with \(\lambda=0\) versus \(\lambda=-1\).