8.1 Bias and variance tradeoff

As we increase the penalty \(\lambda\), the values of the ridge coefficients are shrunk towards zero. The case \(\lambda \to \infty\) gives \(\hat{\boldsymbol{\beta}}_{\mathrm{ridge}}=\boldsymbol{0}_p\), whereas we retrieve the OLS estimator \(\hat{\boldsymbol{\beta}}\) when \(\lambda=0\).

The mean squared error of the ridge estimator is \[\begin{align*} \mathrm{MSE}(\hat{\boldsymbol{\beta}}_{\mathrm{ridge}}^{\lambda}) &= \sigma^2 \mathrm{tr}\left\{(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\right\} \\&\quad + \boldsymbol{\gamma}^\top \left\{ \mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1} + \mathbf{I}_p \right\} \left\{ (\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\mathbf{Z} + \mathbf{I}_p \right\}\boldsymbol{\gamma} \end{align*}\]

If we knew the true data generating mechanism (i.e. the \(\boldsymbol{\gamma}\) vector and \(\sigma^2\)), we could compute exactly the mean squared error (MSE) of the model and find the optimal bias-variance tradeoff that minimizes MSE. This is illustrated below in an artificial example. As \(\lambda \to \infty\), the bias grows and dominates MSE.

## [1] 7.227898

We can also look at the path of coefficient values \(\hat{\boldsymbol{\gamma}}_{\mathrm{ridge}}^{\lambda}\) as a function of \(\lambda\):

While the overall vector of coefficients are shrunk towards zero, the set of 10 first coefficients \(\boldsymbol{\gamma}\), which are exactly zero, stabilize around another value. Note that if we increase the penalty, from \(\lambda_1\) to \(\lambda_2\) with \(\lambda_1 < \lambda_2\), this does not necessarily imply that individual coefficient estimates decrease, i.e. \(|\hat{\beta}_j (\lambda_1)| \nleq |\hat{\beta}_j(\lambda_2)|\) even if \(\lambda_1 < \lambda_2\).

The coefficients \(\hat{\boldsymbol{\gamma}}^\lambda\) can be computed using an augmented linear regression, with response \((\boldsymbol{y}, \mathbf{0}_p)\) and regressor \([\mathbf{Z}^\top,\; \lambda^{1/2} \mathbf{I}_p]\). Alternatively, \[\hat{\boldsymbol{\gamma}}^\lambda = (\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\boldsymbol{y}.\]

We can also use the singular value decomposition of \(\mathbf{Z} = \mathbf{UDV}^\top\) to compute the coefficients\[\hat{\boldsymbol{\gamma}} = \sum_{j=1}^p \frac{d_j}{d_j^2+\lambda} \mathbf{u}_j^\top\boldsymbol{y} \mathbf{v}_j,\] where \(\mathbf{u}_j\) and \(\mathbf{v}_j\) are the \(j\)th column of \(\mathbf{U}\) and \(\mathbf{V}\), respectively. This is most useful for cross-validation where we want to change the value of \(\lambda\) repeatedly, as the SVD of \(\mathbf{Z}\) won’t change.

## [1] TRUE