8.1 Bias and variance tradeoff
As we increase the penalty λ, the values of the ridge coefficients are shrunk towards zero. The case λ→∞ gives \hat{\boldsymbol{\beta}}_{\mathrm{ridge}}=\boldsymbol{0}_p, whereas we retrieve the OLS estimator \hat{\boldsymbol{\beta}} when \lambda=0.
The mean squared error of the ridge estimator is \begin{align*} \mathrm{MSE}(\hat{\boldsymbol{\beta}}_{\mathrm{ridge}}^{\lambda}) &= \sigma^2 \mathrm{tr}\left\{(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\right\} \\&\quad + \boldsymbol{\gamma}^\top \left\{ \mathbf{Z}^\top\mathbf{Z}(\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1} + \mathbf{I}_p \right\} \left\{ (\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\mathbf{Z} + \mathbf{I}_p \right\}\boldsymbol{\gamma} \end{align*}
If we knew the true data generating mechanism (i.e. the \boldsymbol{\gamma} vector and \sigma^2), we could compute exactly the mean squared error (MSE) of the model and find the optimal bias-variance tradeoff that minimizes MSE. This is illustrated below in an artificial example. As \lambda \to \infty, the bias grows and dominates MSE.
# Function to compute MSE of ridge estimator
mse_ridge <- function(gamma, lambda, Z, sigmasq = 1){
ZtZ <- crossprod(Z)
p <- ncol(Z)
W <- solve(ZtZ + lambda*diag(p))
bias <- c((W %*% ZtZ - diag(p)) %*% gamma)
varia <- sigmasq * diag( crossprod(Z %*% W))
list(bias = bias, variance = varia, mse = sum(bias^2 + varia))
}
set.seed(9876)
# Create fake data
Z <- matrix(rnorm(n = 20L*50L, mean = 0, sd = 1), ncol = 20L)
# Center and renormalize Z
Z <- apply(Z, 2, scale)
# Create coefficient vector
gamma <- c(rep(0, 10L), runif(10L))
# Create sequence of lambda and matrix to store results
nlam <- 401L
lambda_seq <- seq(0, 100, length = nlam)
mse <- matrix(0, nrow = nlam, ncol = 3)
gammahat <- matrix(0, nrow = nlam, ncol = ncol(Z))
for(i in 1:nlam){
# evaluate bias + variance for each lambda
mse_i <- mse_ridge(gamma = gamma, lambda = lambda_seq[i], Z = Z)
gammahat[i,] <- gamma + mse_i$bias
mse[i,1] <- sum(mse_i$bias^2)
mse[i,2] <- sum(mse_i$variance)
mse[i,3] <- mse_i$mse
}
# Plot the results as a function of lambda
matplot(lambda_seq, mse, type = "l", lty = 1,
bty = "l", xlab = expression(lambda), col = 3:1,
ylab = "Mean squared error decomposition")
abline(h = mse[1,3], lty = 2)
legend(x = "topleft", legend = c("sq. bias", "variance", "mse"),
col = 3:1, lty = 1, bty = "n")
# Compute the value of lambda that minimizes MSE
lambdaopt <- optimize(f = function(x){
mse_ridge(gamma = gamma, lambda = x, Z = Z)$mse
}, interval = c(0, 30))$minimum
lambdaopt
## [1] 7.227898
We can also look at the path of coefficient values \hat{\boldsymbol{\gamma}}_{\mathrm{ridge}}^{\lambda} as a function of \lambda:
While the overall vector of coefficients are shrunk towards zero, the set of 10 first coefficients \boldsymbol{\gamma}, which are exactly zero, stabilize around another value. Note that if we increase the penalty, from \lambda_1 to \lambda_2 with \lambda_1 < \lambda_2, this does not necessarily imply that individual coefficient estimates decrease, i.e. |\hat{\beta}_j (\lambda_1)| \nleq |\hat{\beta}_j(\lambda_2)| even if \lambda_1 < \lambda_2.
The coefficients \hat{\boldsymbol{\gamma}}^\lambda can be computed using an augmented linear regression, with response (\boldsymbol{y}, \mathbf{0}_p) and regressor [\mathbf{Z}^\top,\; \lambda^{1/2} \mathbf{I}_p]. Alternatively, \hat{\boldsymbol{\gamma}}^\lambda = (\mathbf{Z}^\top\mathbf{Z} + \lambda \mathbf{I}_p)^{-1}\mathbf{Z}^\top\boldsymbol{y}.
We can also use the singular value decomposition of \mathbf{Z} = \mathbf{UDV}^\top to compute the coefficients\hat{\boldsymbol{\gamma}} = \sum_{j=1}^p \frac{d_j}{d_j^2+\lambda} \mathbf{u}_j^\top\boldsymbol{y} \mathbf{v}_j, where \mathbf{u}_j and \mathbf{v}_j are the jth column of \mathbf{U} and \mathbf{V}, respectively. This is most useful for cross-validation where we want to change the value of \lambda repeatedly, as the SVD of \mathbf{Z} won’t change.
# Create response vector from model
y <- Z %*% gamma + rnorm(nrow(Z))
# Compute ridge coefficients
ridge_lm_coef <- function(y, Z, lambda){
Z <- scale(Z)
c(solve(crossprod(Z) + lambda*diag(ncol(Z))) %*% crossprod(Z, y - mean(y)))
}
lambda0 <- 2
#Compare coefficients obtained from fitting using augmented regression
augmy <- c(y - mean(y), rep(0, ncol(Z)))
augmZ <- rbind(Z, sqrt(lambda0)*diag(ncol(Z)))
augmlm <- lm(augmy ~ -1 + augmZ)
isTRUE(all.equal(as.vector(augmlm$coef),
ridge_lm_coef(y = y, Z = Z, lambda = lambda0)))
## [1] TRUE