3.3 Example: seasonal effects
Suppose we are interested in estimating the seasonal model for quarterly data. Since each observation is recorded on one trimester, it could be postulated that the time of the year affect the response. This effect can be built in the design and tested for.
Consider a design matrix whose columns include \((\mathbf{S}_1, \mathbf{S}_2, \mathbf{S}_3, \mathbf{S}_4)\). The entries of the seasonal dummies are 0/1 depending on the season; for example, \(\mathrm{S}_{i1}=1\) if the observation is recorded in the spring and \(\mathrm{S}_{i1}=0\) otherwise, so that, for time ordered response vectors starting in the spring, we can write \(\mathbf{S}_1=(1,0,0,0,1,0,0,\ldots)^\top\) and \(\mathbf{S}_1 + \mathbf{S}_2 + \mathbf{S}_3 + \mathbf{S}_4 = \mathbf{1}_n\). Thus, a regression model cannot include a mean and all four seasonal variables, but since \(\mathscr{S}(\mathbf{S}_1, \mathbf{S}_2, \mathbf{S}_3, \mathbf{S}_4) = \mathscr{S}(\mathbf{S}_1, \mathbf{S}_2, \mathbf{S}_3, \mathbf{1}_n) = \mathscr{S}(\mathbf{S}_2, \mathbf{S}_3, \mathbf{S}_4, \mathbf{1}_n)\), any set of four vectors can be used.
If we drop the constant vector \(\mathbf{1}_n\) from the regression (in R, by writing 0 or -1 on the right hand side of the lm formula), the coefficients \(\alpha_i\), \(i=1, \ldots, 4\) in regression
\[\boldsymbol{y} = \mathbf{S}_1\alpha_1 +\mathbf{S}_2\alpha_2  + \mathbf{S}_3\alpha_3 + \mathbf{S}_4\alpha_4 + \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
correspond to the intercept for season \(i\). If we include instead the dummy \(\mathbf{S}_1\) and fit
\[\boldsymbol{y} = \mathbf{1}_n\gamma_1 +\mathbf{S}_2\gamma_2  + \mathbf{S}_3\gamma_3 + \mathbf{S}_4\gamma_4 + \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon},\]
then the constant for season 1 is \(\alpha_1=\gamma_1\), and is \(\alpha_j=\gamma_1+\gamma_j\) for season \(j \in \{2, 3, 4\}\). The reference level is therefore the first season \(\gamma_1\) and the coefficients \(\{\gamma_j\}, j \in \{2, 3, 4\}\) are contrasts (the difference between the baseline and the seasonal level).
As just illustrated above, the interpretation of the coefficients when we have factors differs. Introducing factors lead to different intercepts for the different groups, whereas interactions with the factors lead to different slopes. Since we cannot include all levels of the factor as well as an intercept, the interpretation of the coefficient for the constant is the intercept of the baseline, i.e., the ommitted factor.
For simplicity, consider the simple case with an intercept and a single factor. Why do the coefficients \(\beta_1\) and \(\beta_1'\) associated to the first columns of models \[\begin{align*} \mathbf{X}\boldsymbol{\beta} = \begin{pmatrix} \mathbf{1}_{n_1} & \mathbf{0}_{n_1} \\ \mathbf{1}_{n_2} & \mathbf{1}_{n_2} \\ \end{pmatrix}\begin{pmatrix} \beta_1 & \beta_2\end{pmatrix}^\top, \qquad \mathbf{X}'\boldsymbol{\beta}' = \begin{pmatrix} \mathbf{1}_{n_1} & \mathbf{0}_{n_1} \\ \mathbf{0}_{n_2} & \mathbf{1}_{n_2} \\ \end{pmatrix}\begin{pmatrix} \beta_1' & \beta_2'\end{pmatrix}^\top \end{align*}\] have the same interpretation? For the matrix \(\mathbf{X}\) consisting of orthogonal regressors, this is clear. For the matrix \(\mathbf{X}'\), recall the FWL theorem: the regression coefficient \(\beta_1'\) is the same as that of the regression \(\mathbf{M}_{\mathbf{X}_2}\boldsymbol{y} = \mathbf{M}_{\mathbf{X}_2}\mathbf{1}_n + \boldsymbol{u}\) for \(\mathbf{X}_2 = (\mathbf{0}_{n_1}^\top, \mathbf{1}_{n_2}^\top)^\top\). The matrix \(\mathbf{M}_{\mathbf{X}_2}\) is a block matrix, whose first \(n_1 \times n_1\) block contains entries \(n_1^{-1}\) and the rest of the entries is zero. \(\mathbf{M}_{\mathbf{X}_2}\boldsymbol{y}\) does not affect the last \(n_2\) entries of \(\boldsymbol{y}\), while \(\mathbf{M}_{\mathbf{X}_2}\mathbf{1}_n = \mathbf{1}_n - \mathbf{X}_2\). This reasoning generalizes to more complex settings with a slope and other regressors.
The discussion is illustrated on a dataset consisting of quarterly measurements of the gas consumption in the United Kingdom, from 1960 to 1986.
data(UKgas)
quarter <- rep(1:4, length = length(UKgas)) #create vector 1, 2, 3, 4, 1, ...
is.factor(quarter)## [1] FALSE## [1] TRUE## [1] "factor"## [1] 1 2 3 4 1 2
## Levels: 1 2 3 4## [1] "1" "2" "3" "4"The first model is of the form \[ \boldsymbol{y} = \mathbf{Q}_1 \alpha_1 + \mathbf{Q}_2 \alpha_2 + \mathbf{Q}_3 \alpha_3 + \mathbf{Q}_4 \alpha_4 + \boldsymbol{\varepsilon}. \]
## quarterQ1 quarterQ2 quarterQ3 quarterQ4 
##  5.989307  5.586806  5.039595  5.700242##    quarterQ1 quarterQ2 quarterQ3 quarterQ4
## 1          1         0         0         0
## 2          0         1         0         0
## 3          0         0         1         0
## 4          0         0         0         1
## 5          1         0         0         0
## 6          0         1         0         0
## 7          0         0         1         0
## 8          0         0         0         1
## 9          1         0         0         0
## 10         0         1         0         0The model with all the quarterly dummies gives the quarterly average value \(\exp(\alpha_j)\) in quarter \(j\), \(j = 1, \ldots, 4\).
If we include an intercept, the first factor is selected as baseline and the coefficients of the quarters Q2 to Q4 are contrasts.
For this model, say
\[ \boldsymbol{y} = \mathbf{1}_n \gamma_1 + \mathbf{Q}_2 \gamma_2 + \mathbf{Q}_3 \gamma_3 + \mathbf{Q}_4 \gamma_4 + \boldsymbol{\varepsilon},
\]
#New parameterization, with a constant
gas_lm2 <- lm(log(UKgas) ~ quarter) #R drops collinear by default and fits with Q2-Q4
coef(gas_lm2)## (Intercept)   quarterQ2   quarterQ3   quarterQ4 
##   5.9893074  -0.4025014  -0.9497127  -0.2890659##    (Intercept) quarterQ2 quarterQ3 quarterQ4
## 1            1         0         0         0
## 2            1         1         0         0
## 3            1         0         1         0
## 4            1         0         0         1
## 5            1         0         0         0
## 6            1         1         0         0
## 7            1         0         1         0
## 8            1         0         0         1
## 9            1         0         0         0
## 10           1         1         0         0The estimated average gas consumption in millions of therms is \(\exp(\widehat{\gamma}_1)=\exp(\widehat{\alpha}_1)\) for the first quarter, \(\exp(\widehat{\alpha}_2)=\exp(\widehat{\gamma}_1+\widehat{\gamma}_2)\) for the second quarter, etc.
We can check that the interpretation is correct.
isTRUE(all.equal((coef(gas_lm2)[1] + coef(gas_lm2)[-1]), 
                 coef(gas_lm1)[-1], check.attributes = FALSE))## [1] TRUE