4.5 Solutions

4.5.1 Exercise 7.1 - Study of growth hormones

We will use factor objects in the sequel. A factor encodes a matrix of binary variables, potentially identied using strings, so that the output is readable. R know how to handle the vector if it is passed to e.g. the function lm. By default, if the matrix spans \(\mathbf{1}_n\), the first level (in alphabetical order) is dropped and the intercept becomes the mean of this level.

##        y               x                 group   
##  Min.   :107.0   Min.   : 78.00   control   :10  
##  1st Qu.:122.0   1st Qu.: 93.75   thiouracil:10  
##  Median :140.0   Median :100.50                  
##  Mean   :142.4   Mean   :100.90                  
##  3rd Qu.:156.5   3rd Qu.:109.25                  
##  Max.   :185.0   Max.   :123.00
##                 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## groupcontrol    1 1 1 1 1 1 1 1 1  1  0  0  0  0  0  0  0  0  0  0
## groupthiouracil 0 0 0 0 0 0 0 0 0  0  1  1  1  1  1  1  1  1  1  1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
##    (Intercept)   x groupthiouracil x:groupthiouracil
## 1            1 114               0                 0
## 2            1 123               0                 0
## 3            1 111               0                 0
## 4            1 100               0                 0
## 5            1 104               0                 0
## 6            1 102               0                 0
## 7            1  94               0                 0
## 8            1 112               0                 0
## 9            1  90               0                 0
## 10           1 110               0                 0
## 11           1 109               1               109
## 12           1 101               1               101
## 13           1 100               1               100
## 14           1 100               1               100
## 15           1 101               1               101
## 16           1  92               1                92
## 17           1  95               1                95
## 18           1  93               1                93
## 19           1  78               1                78
## 20           1  89               1                89
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
##                        2.5 %      97.5 %
## groupthiouracil   -87.550757 134.1308431
## x:groupthiouracil  -1.594919   0.6083507

4.5.2 Exercise 7.2 - Electric production of windmills

The dataset windmill contains measurements of electricity output of wind turbine over 25 separate fifteen minute periods. We are interested in the relation between direct output and the average wind speed (measured in miles per hour) during the recording.

  1. Fit a linear model with wind speed as covariate and plot the standardized residuals against the fitted values. Do you notice any residual structure missed by the model mean specification? Try fitting a model using the reciprocal of wind speed as covariate. Comment on the adequacy of the models.
  2. Predict, using both models in term, the output of electricity given that the average wind speed in a given period is 5 miles per hour. Provide prediction interval for your estimates.
  3. Produce a standard Gaussian quantile-quantile plot of the standardized residuals. Superimpose approximate pointwise confidence intervals.

There is some structure left in the model output ~ velocity, since the smallest values occur at the endpoint of the output. There is less visible structure in the model with the reciprocal. The second model appears to fit better, since its \(\mathrm{R}^2\) value is 0.98 compared to 0.87 for the first model. Note that, in the second model, the intercept corresponds to infinite strength wind gusts.

## [1] TRUE

The predicted output is 1.34 units of electrity for the first model, while the point forecast is 1.59 for the model with the reciprocal velocity. Both intervals overlap, but the second one [1.39, 1.79] is considerable narrower than the first one, given by [0.84, 1.84].

## 
## Call:
## lm(formula = output ~ velocity - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51276 -0.17156  0.08675  0.20282  0.36832 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## velocity  0.25949    0.00715   36.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2364 on 24 degrees of freedom
## Multiple R-squared:  0.9821, Adjusted R-squared:  0.9814 
## F-statistic:  1317 on 1 and 24 DF,  p-value: < 2.2e-16

The function update changes the arguments of the linear model. Here, the . means keep all variables on lhs or rhs. You can also use it with a dataset to fit all the remaining variables after specifying the response variable, like for example lm(output ~ ., data = windmill) would have velocity as covariate.

We notice first that the confidence interval for \(\beta_0\), the intercept, includes zero, we cannot reject the null hypothesis that \(\beta_0=0\) at level \(0.95\).

The coefficient \(\beta_1\) corresponding to the effect of velocity has a smaller standard error than the first model. Does this make sense? If a model is correctly specified, addition of new variables that are unrelated does not introduce bias, but ncessarily inflates the standard errors by Gauss–Markov theorem. However, if the intercept should truly be there (this can be made necessary because of measurement errors) and \(\beta_0 \neq 0\), then the tests and confidence intervals will be invalid in the simplified model.

The multiple \(\mathrm{R}^2_c\) goes up the roof, but makes no sense here because it compares two models that are not nested (the model with a single mean versus which has no constant). A consequence of the removal of the intercept is that the average of the residuals is not zero anymore and that R returns different values for the Multiple R-squared.

If you remove the intercept in a lm object using -1, the value returned by summary for the coefficient Multiple R-squared is the \(R^2\), not \(R^2_c\)!

We now produce the quantile-quantile plots using the results described in Section 4.4.

The simulated pointwise confidence interval are shorter and account for the scaling.

4.5.3 Exercise 7.3 - Air traffic

First load the data set and plot the observations

##       1 
## 501.263
##         [,1]
## [1,] 501.263

We notice that the model does an overall good job at getting the big features, but misses many things. The first point is that the relationship is not quite linear: a residual plot shows a somewhat quadratic relation between the fitted values and the residuals. The second obvious feature not captured is the change in the variation (the amplitude of the wave pattern changes over time). Since the variance is increasing, a log-transformation may help stabilize it. The residuals are not apparently close to normal (the last values are systematically too large) and there is some skewness. The last few points have large leverage and drive the curve up.

Let us consider the log counts. The fit is much better and the quadratic relationship with the residuals vs fitted values is attenuated. While some points still have high leverage value, they are not considered outliers.

One hypothesis of the linear model that is clearly violated here is the independence of the errors. Even if the variance \(\mathsf{Var}(\boldsymbol{e})=\sigma^2\mathbf{M}_{\mathbf{X}}\) need not have independent errors, there is positive dependence from residual to residual. This is confirmed by looking at the autocorrelation, which indicates geometric decay. This will be covered in MATH 342 (Time Series), but you should just think here of shocks carrying through until the next period before the model reverts to the mean.

Ignoring the serial dependence in the error has consequences: the standard errors are too small (since errors are correlated, there is less units of information so we are overconfident in our uncertainty quantification).

4.5.4 Exercise 7.4 - Determinants of earnings

##                  2.5 %     97.5 %
## (Intercept) 2.29267414 2.51468758
## education   0.01030330 0.02940631
## pseduc      0.03325994 0.06318974

## 
## Call:
## lm(formula = lhwages ~ ., data = labour)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10953 -0.25249  0.02989  0.28176  1.27908 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.200103   0.055375  39.731  < 2e-16 ***
## education   0.026589   0.004673   5.690 1.38e-08 ***
## genderMale  0.256471   0.014771  17.363  < 2e-16 ***
## pseduc      0.037459   0.007322   5.116 3.31e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4154 on 3183 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1889 
## F-statistic: 248.3 on 3 and 3183 DF,  p-value: < 2.2e-16
##                  2.5 %     97.5 %
## (Intercept) 2.09152815 2.30867726
## education   0.01742684 0.03575089
## genderMale  0.22750913 0.28543379
## pseduc      0.02310256 0.05181598

The coefficient for gender is still statistically significative at level \(\alpha=5\%\) after adjusting for the education level.