8.1 Bias and variance tradeoff

As we increase the penalty $\lambda$ , the values of the ridge coefficients are shrunk towards zero. The case $\lambda \to \infty$ gives $\hat{\boldsymbol{\beta}}_{\mathrm{ridge}}=\boldsymbol{0}_p$ , whereas we retrieve the OLS estimator $\hat{\boldsymbol{\beta}}$ when $\lambda=0$ .

The mean squared error of the ridge estimator is $\begin{align*} \mathrm{MSE}(\hat{\boldsymbol{\beta}}_{\mathrm{ridge}}^{\lambda}) &= \sigma^2 \mathrm{tr}\left\{(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\right\} \\&\quad + \boldsymbol{\gamma}^\top \left\{ \mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1} + \mathbf{I}_p \right\} \left\{ (\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\mathbf{Z} + \mathbf{I}_p \right\}\boldsymbol{\gamma} \end{align*}$

If we knew the true data generating mechanism (i.e. the $\boldsymbol{\gamma}$ vector and $\sigma^2$ ), we could compute exactly the mean squared error (MSE) of the model and find the optimal bias-variance tradeoff that minimizes MSE. This is illustrated below in an artificial example. As $\lambda \to \infty$ , the bias grows and dominates MSE.

# Function to compute MSE of ridge estimator
mse_ridge <- function(gamma, lambda, Z, sigmasq = 1){
  ZtZ <- crossprod(Z)
  p <- ncol(Z)
  W <- solve(ZtZ + lambda*diag(p))
  bias <- c((W %*% ZtZ - diag(p)) %*% gamma)
  varia <- sigmasq * diag( crossprod(Z %*% W))
  list(bias = bias, variance = varia, mse = sum(bias^2 + varia))
}
set.seed(9876)
# Create fake data
Z <- matrix(rnorm(n = 20L*50L, mean = 0, sd = 1), ncol = 20L)
# Center and renormalize Z
Z <- apply(Z, 2, scale)
# Create coefficient vector
gamma <- c(rep(0, 10L), runif(10L))

# Create sequence of lambda and matrix to store results
nlam <- 401L
lambda_seq <- seq(0, 100, length = nlam)
mse <- matrix(0, nrow = nlam, ncol = 3)
gammahat <- matrix(0, nrow = nlam, ncol = ncol(Z))
for(i in 1:nlam){
  # evaluate bias + variance for each lambda
  mse_i <- mse_ridge(gamma = gamma, lambda = lambda_seq[i], Z = Z)
  gammahat[i,] <- gamma + mse_i$bias
  mse[i,1] <-  sum(mse_i$bias^2)
  mse[i,2] <-  sum(mse_i$variance)
  mse[i,3] <- mse_i$mse
}
# Plot the results as a function of lambda
matplot(lambda_seq, mse, type = "l", lty = 1, 
        bty = "l", xlab = expression(lambda), col = 3:1, 
        ylab = "Mean squared error decomposition")
abline(h = mse[1,3], lty = 2)
legend(x = "topleft", legend = c("sq. bias", "variance", "mse"), 
       col = 3:1, lty = 1, bty = "n")

# Compute the value of lambda that minimizes MSE
lambdaopt <- optimize(f = function(x){
  mse_ridge(gamma = gamma, lambda = x, Z = Z)$mse
  }, interval = c(0, 30))$minimum
lambdaopt

## [1] 7.227898

We can also look at the path of coefficient values $\hat{\boldsymbol{\gamma}}_{\mathrm{ridge}}^{\lambda}$ as a function of $\lambda$ :

While the overall vector of coefficients are shrunk towards zero, the set of 10 first coefficients $\boldsymbol{\gamma}$ , which are exactly zero, stabilize around another value. Note that if we increase the penalty, from $\lambda_1$ to $\lambda_2$ with $\lambda_1 < \lambda_2$ , this does not necessarily imply that individual coefficient estimates decrease, i.e. $|\hat{\beta}_j (\lambda_1)| \nleq |\hat{\beta}_j(\lambda_2)|$ even if $\lambda_1 < \lambda_2$ .

The coefficients $\hat{\boldsymbol{\gamma}}^\lambda$ can be computed using an augmented linear regression, with response $(\boldsymbol{y}, \mathbf{0}_p)$ and regressor $[\mathbf{Z}^\top,\; \lambda^{1/2} \mathbf{I}_p]$ . Alternatively, $\hat{\boldsymbol{\gamma}}^\lambda = (\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\boldsymbol{y}.$

We can also use the singular value decomposition of $\mathbf{Z} = \mathbf{UDV}^\top$ to compute the coefficients $\hat{\boldsymbol{\gamma}} = \sum_{j=1}^p \frac{d_j}{d_j^2+\lambda} \mathbf{u}_j^\top\boldsymbol{y} \mathbf{v}_j,$ where $\mathbf{u}_j$ and $\mathbf{v}_j$ are the $j$ th column of $\mathbf{U}$ and $\mathbf{V}$ , respectively. This is most useful for cross-validation where we want to change the value of $\lambda$ repeatedly, as the SVD of $\mathbf{Z}$ won’t change.

# Create response vector from model
y <- Z %*% gamma + rnorm(nrow(Z))
# Compute ridge coefficients
ridge_lm_coef <- function(y, Z, lambda){
 Z <- scale(Z) 
 c(solve(crossprod(Z) + lambda*diag(ncol(Z))) %*% crossprod(Z, y - mean(y)))
}
lambda0 <- 2
#Compare coefficients obtained from fitting using augmented regression
augmy <- c(y - mean(y), rep(0, ncol(Z)))
augmZ <- rbind(Z, sqrt(lambda0)*diag(ncol(Z)))
augmlm <- lm(augmy ~ -1 + augmZ)
isTRUE(all.equal(as.vector(augmlm$coef),
      ridge_lm_coef(y = y, Z = Z, lambda = lambda0)))

## [1] TRUE