8 Penalized regression methods

There are two reasons one may go down the route of penalized regression models: one is to trade-off bias for reduced variance in the hope that doing so will improve the predictive accuracy of the model. The other is in cases where the number of covariates \(p\) is large and the matrix \(\mathbf{X}\) is close to being non-invertible.

Throughout, we will work with centered and standardized inputs. One reason for doing so is to make our inference invariant to affine transformations, much like for \(R^2_c\). We can consider a design matrix \([ \mathbf{1}_n \; \mathbf{X}]\) with rescaled columns so that \(\mathrm{Var}(\mathbf{X}_j)=1\) for \(j=1, \ldots, p\) and take \(\mathbf{Z} \equiv \mathbf{X} - \mathbf{1}_n \bar{\mathbf{X}}\), where \(\bar{\mathbf{X}}\) is the row vector of column means of \(\mathbf{X}\). This is the orthogonal decomposition \(\mathbf{H}_{\mathbf{1}_n} + \mathbf{H}_{\mathbf{M}_{\mathbf{1}_n}\mathbf{X}}\) and since the variables are orthogonal, the coefficient \(\beta_0\) corresponding to the intercept is \(\bar{\boldsymbol{y}}\). Having scaled inputs \(\mathbf{Z}\) ensures that we penalize equally every covariate and that the model is invariant to change of units for the regressors.

The two methods briefly covered in the course are ridge regression and LASSO. The first one consists of imposing a \(l_2\) penalty on the coefficients, the second a \(l_1\) penalty. The advantage of the former is that the optimization problem is convex and differentiable and the solution can be found using an augmented linear regression. We will solely focus on ridge regression in the sequel.

Our objective function for ridge regression takes the form \[Q(\beta_0, \boldsymbol{\gamma}) = (\boldsymbol{y} - \beta_0 \mathbf{1}_n -\mathbf{Z}\boldsymbol{\gamma})^\top(\boldsymbol{y} - \beta_0 \mathbf{1}_n -\mathbf{Z}\boldsymbol{\gamma}) + \lambda \|\boldsymbol{\gamma}\|^2_2. \]