8.1 Bias and variance tradeoff

As we increase the penalty λ, the values of the ridge coefficients are shrunk towards zero. The case λ gives \hat{\boldsymbol{\beta}}_{\mathrm{ridge}}=\boldsymbol{0}_p, whereas we retrieve the OLS estimator \hat{\boldsymbol{\beta}} when \lambda=0.

The mean squared error of the ridge estimator is \begin{align*} \mathrm{MSE}(\hat{\boldsymbol{\beta}}_{\mathrm{ridge}}^{\lambda}) &= \sigma^2 \mathrm{tr}\left\{(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\right\} \\&\quad + \boldsymbol{\gamma}^\top \left\{ \mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1} + \mathbf{I}_p \right\} \left\{ (\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\mathbf{Z} + \mathbf{I}_p \right\}\boldsymbol{\gamma} \end{align*}

If we knew the true data generating mechanism (i.e. the \boldsymbol{\gamma} vector and \sigma^2), we could compute exactly the mean squared error (MSE) of the model and find the optimal bias-variance tradeoff that minimizes MSE. This is illustrated below in an artificial example. As \lambda \to \infty, the bias grows and dominates MSE.

## [1] 7.227898

We can also look at the path of coefficient values \hat{\boldsymbol{\gamma}}_{\mathrm{ridge}}^{\lambda} as a function of \lambda:

While the overall vector of coefficients are shrunk towards zero, the set of 10 first coefficients \boldsymbol{\gamma}, which are exactly zero, stabilize around another value. Note that if we increase the penalty, from \lambda_1 to \lambda_2 with \lambda_1 < \lambda_2, this does not necessarily imply that individual coefficient estimates decrease, i.e. |\hat{\beta}_j (\lambda_1)| \nleq |\hat{\beta}_j(\lambda_2)| even if \lambda_1 < \lambda_2.

The coefficients \hat{\boldsymbol{\gamma}}^\lambda can be computed using an augmented linear regression, with response (\boldsymbol{y}, \mathbf{0}_p) and regressor [\mathbf{Z}^\top,\; \lambda^{1/2} \mathbf{I}_p]. Alternatively, \hat{\boldsymbol{\gamma}}^\lambda = (\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\boldsymbol{y}.

We can also use the singular value decomposition of \mathbf{Z} = \mathbf{UDV}^\top to compute the coefficients\hat{\boldsymbol{\gamma}} = \sum_{j=1}^p \frac{d_j}{d_j^2+\lambda} \mathbf{u}_j^\top\boldsymbol{y} \mathbf{v}_j, where \mathbf{u}_j and \mathbf{v}_j are the jth column of \mathbf{U} and \mathbf{V}, respectively. This is most useful for cross-validation where we want to change the value of \lambda repeatedly, as the SVD of \mathbf{Z} won’t change.

## [1] TRUE