Regularization

Ridge

Add notes from missed class.

The \(\lambda I_n\) adds

Lasso

Shrinkage effect

If each feature is independent, the design matrix (centerd and standardized) will have \(x_j^T x_k = 1(j=k)\) because they are orthogonal.

The least squares estimator will be reevalauted as

\(\hat{\beta}^{OLS} = (X^T X)^{-1} X^T y = X^T y\)

because \(X^T X = I\). In other words, each predictor can be treated as a basis.

In ridge regression, the \(X^T X + \lambda I_p\) becomes a diagonal matrix of lambdas.

\(\hat{\beta}^R_\lambda = (X^T X + \lambda I_p)^{-1} X^T y = \frac{1}{1 + \lambda} X^T y = \frac{\beta^{OLS}}{1 + \lambda}\)

Therefore, the larger \(\lambda\) is, the smaller the magnitude of each \(\beta_j\). This is a linear function. Here, it is applied uniformally. However, in practice, features are not orthogonal. So, each coefficient is affected nonuniformally. It should shrink more on the smaller eigenvalues.

This will shirnk the prediction by the same scale.

\(\hat{y}\) is scaled by (\(1 + \lambda\))

We know that the \(\hat{\beta}^{OLS}\) is unbiased. \(E \hat{\beta} = (X^T X)^{-1} X^T E(Y) = \beta\) because \(E(Y) = X\beta\).

However, the ridge regression is a biased estimator: \(E \hat{\beta}_\lambda^R = \frac{\beta}{1 + \lambda} \neq \beta\)

Tradeoff between bias and variance. With collinearity, variance explodes (because we cannot take inverse bceause it is singular). Ridge regression will regularize the variance.

For OLS, the \(MSE(\hat{y}^{OLS}) = E(\hat{y}^{OLS} - f(x))^2 = \sigma^2_{OLS}\)

But, consider the ridge regression solution \(\hat{y}_\lambda^R = E(\hat{y}^{OLS} / (1 + \lambda) - f(x))^2 = \frac{\sigma^2_{OLS}}{(1 + \lambda)^2} + \frac{\lambda^2}{(1 + \lambda)^2}f^2(x)\)

In some cases, the \(\sigma_{OLS}\) can be very large due to singularity of \(X^TX\). The shrinkage will increase bias but will reduce variance.

Proof:

\(E(\frac{\hat{y}^{OLS}}{(1 + \lambda)} - f(x))\)

\(= E[(\frac{\hat{y}}{1 + \lambda} - \frac{f(x)}{1 + \lambda}) - (\frac{\lambda}{1 + \lambda} f(x))]\)

\(= \frac{\sigma^2_{OLS}}{1 + \lambda} + (\frac{\lambda}{1 + \lambda})^2 f^2(x)\)

PCA

\(X^T X = (V D U^T) (U D V^T) = U D V^T\)

\(F = XV = UD\) where \(f_j = X v_j\) is the projections to the PC directions and are the principal components.

Note \(f^T_j f_j = d^2_j\) and \(f^T_i f_j = 0\) because orthogonal.

So, with our PCs as \(X\),

\(y = X\beta = y - UDV\beta = y - F \alpha = y - \sigma_{j=1}^p \alpha_j f_j\)

Regression after PCA agains all PCs

\(\alpha = V \beta\) and \(||\alpha||_2 = ||\beta||_2\)

Therefore, the L2 norm of \(\beta\) is the same as L2 norm of \(\alpha\)

The ridge regression therefore

\(min_\beta ()...\)

is equivalent to

\(min_\alpha ()...\)

This allows \(F = UD\) to satisfy the property \(F^T F = DU^T UD = D^2\)

The ridge estimator

\(\hat{\alpha}_\lambda^R = (F^T F + \lambda I)^{-1} F^T y\)

\(= diag(\frac{d_j}{d_j^2 + \lambda}) U^T y\)

$:nbsphinx-math:hat{alpha}:nbsphinx-math:`lambda`^R = (:nbsphinx-math:`frac{d_j^2}{d_j^2 + lambda}`) :nbsphinx-math:`hat{alpha}`:nbsphinx-math:`lambda`^{OLS} $

The smaller the \(d_j\), the more shrinkage.

The prediction is \(\hat{y}^R = F \alpha^R\)

Partial Least Squares

Gradually capture the inform in \(x\) corresponding to the information in \(y\).

\(z_1 = \sum_{j=1}^p \phi_j x_j\)

where \(\phi_j = \frac{<y, x_j>}{<x_j, x_j>} = <y, x_j>\)

We adjust the predictors by removing the effect containted on \(Z_1\)

\(e_j = X_j - Z_1 x_j\)

We do not want duplicate information. Info in \(Z_1\) should not be in \(Z_2\).

Degrees of freedom

We can evaluate the effective degree-of-freedom because

In OLS, the trace of \(H\) is \(p\) because \(\hat{y} = Hy = X(X^T X)^{-1}X^T y = (X^T X)^{-1}X^TX y = tr(I_p) = p\)

The ridge regerssion project \(y\) to

\(\hat{\beta}^R_\lambda = (X^T X + \lambda I_p)^{-1} X^T y = S_\lambda y\)

We measure the effective degrees-of-freedom of ridge regression by

\(tr(S_\lambda) < p\)

Lasso

Subset selection minimizes

\((y - X\beta)^T (y - X\beta) + \lambda ||\beta||_1\)

where \(1 = ||\beta||_1 = |\beta_1| + |\beta_2| = \sum_{j=1}^p |\beta_j|\). Therefore, a diamond (linear functions) is constructed.

L2 norm encourages small \(\beta_j\) but L1 will encourage sparse solutions.