Penalized Linear Regression and Model Selection
This lecture discusses some further considerations with regards to Bayesian linear regression. Penalized regression methods can be thought of as applying “shrinkage” to regression parameters towards zero to improve model fit and prediction.
Note that this is only meant to be an introduction of these topics. For a more comprehensive overview, Bayesian Data Analysis by Gelman et al. is a good reference.
Diabetes example
The motivating example used comes from Hoff, page 161. Consider a sample of \(n=442\)
diabetes patients for whom a measure of disease progression after one year, \(y\)
, is recorded along with baseline measurements for ten other variables \(x_1, \cdots, x_{10}\)
.
Since it is believed the relationship is nonlinear with numerous interactions, we consider a total of \(p=64\)
predictor variables, obtained from:
- main effects of the ten variables,
- interactions of the form
\(x_i x_j\)
for all\(i \neq j\)
, and - quadratic terms
\(x_i^2\)
.
We will fit regression models using a training sample of \(n_{\text{train}} = 342\)
patients. The model is then tested by evaluating how well it predicts disease progression on the remaining test sample of \(n_{\text{test}}=100\)
patients.
|
|
|
|
Bayesian linear regression
Suppose we are interested in regressing \(Y_i\)
on predictor variables \(x_{i1}, x_{i2}, \cdots, x_{ip}\)
for \(i = 1, \cdots, n\)
. The sampling model in our Bayesian linear regression model
looked like this:
where
\begin{equation} \mu_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} \tag{1} \end{equation}
Here the prior models are:
\begin{gathered} \beta_0, \cdots, \beta_p \overset{iid}{\sim} \mathcal{N}(0, \kappa^2) \text{ for some large } \kappa > 0 \\ \sigma^2 \sim \text{IG}(0.01, 0.01) \end{gathered}
Nothing new so far!
High-dimensional linear regression
Some issues can arise in linear regression when we have a large number (\(p\)
) of predictor variables.
First, many of the predictor variables may have minimal impact on explaining the response variable. For example, in Eq. (1) many of the \(\beta_i\)
’s may be equal or close to zero.
Second, there is no unique optimal set of regression coefficients when the number of predictors \(p\)
is larger than the sample size \(n\)
. Recall that in the ordinary least squares formulation, the OLS regression coefficients estimate \(\hat\beta_{\text{OLS}}\)
minimizes the following:
The solution in matrix notation is:
When \(p > n\)
, \(\boldsymbol{X}'\boldsymbol{X}\)
is not full rank and is thus not invertible.
The third problem is overfitting
. We may have a model built on an initial set of training data which fits very well, but does not generalize to new data well. This is known as the Bias-variance trade-off
.
In the illustration above, the green model is “simple”, which means it’s biased but also has less variability in the estimated line. The red model is complex and not biased at all, but it’s highly variable depending on the observed training data. When given a new data set, we might trust the green model more than the red one as it generalizes better.
Penalized linear regression
Penalized regression methods address these three issues by estimating biased regression coefficients in exchange for lower variance, and generally improves model fit and predictive performance. Common penalized regression methods include:
- Ridge regression
- LASSO
- Elastic net regression – a combination of ridge and LASSO
All of these involve adding a penalty term to the function being minimized, i.e. the optimal regression coefficients \(\hat\beta_{\text{pen}}\)
minimize:
This penalty term differs in the three methods.
Ridge regression
Ridge regression was introduced by Hoerl and Kennard in 1970. It produces biased estimates of regression coefficients which are “shrunk” towards zero (but not equal to zero). Ridge regression finds the optimal regression coefficients \(\hat\beta_{\text{ridge}}\)
by minimizing:
where \(\lambda\)
is a constant weight term, and larger \(\lambda\)
leads to more shrinkage. The value of \(\lambda\)
is typically selected via cross validation.
From a Bayesian perspective, consider the following prior specification for the Bayesian linear regression model:
where
\begin{gathered} \beta_0 \sim \mathcal{N}(0, \kappa^2) \text{ for some } \kappa > 0, \\ \beta_1, \cdots, \beta_p \mid \sigma^2 \overset{iid}{\sim} \mathcal{N} \left(0, \frac{\sigma^2}{\lambda} \right), \\ \sigma^2 \sim \text{IG}(0.01, 0.01) \end{gathered}
Hoerl and Kennard showed that the posterior modes for regression coefficients under this prior are equivalent to \(\hat\beta_{\text{ridge}}\)
, the solution obtained via traditional ridge regression.
A benefit here is that we obtain posteriors for our biased regression parameters and could compute credible intervals, which is more challenging in the frequentist formulation. We do lose the pure interpretability of the coefficients themselves as we’re introducing bias into those values.
Instead of fixing \(\lambda\)
ahead of time (or use cross validation in the traditional setting), from the Bayesian perspective it might make more sense to place a prior distribution on \(\lambda\)
, e.g.
\begin{gathered} \beta_1, \cdots, \beta_p \mid \sigma^2, \lambda \overset{iid}{\sim} \mathcal{N} \left(0, \frac{\sigma^2}{\lambda} \right), \\ \sigma^2 \sim \text{IG}(0.01, 0.01) \\ \lambda \sim \text{Gamma}(a, b) \end{gathered}
We may use uninformative priors such as \(a=1\)
and \(b=2\)
for \(\lambda\)
.
LASSO
The Least Absolute Shrinkage and Selection Operator (LASSO) finds the optimal regression coefficients \(\hat\beta_{\text{LASSO}}\)
by minimizing:
This method was popularized by Tibshirani in 1996. It simultaneously estimates regression coefficients while setting the ones that are “unimportant” equal to zero, i.e. it does variable selection. Similar to ridge regression, a larger \(\lambda\)
leads to more regression coefficients being set to zero, and the value of \(\lambda\)
is typically set via cross validation.
In Bayesian LASSO, consider the following priors:
where
\begin{gathered} \beta_0 \sim \mathcal{N}(0, \kappa^2) \text{ for some } \kappa > 0 \\ \beta_1, \cdots, \beta_p \mid \sigma^2 \overset{iid}{\sim} \text{Laplace}\left( 0, \frac{\sqrt{\sigma^2}}{\lambda} \right) \\ \sigma^2 \sim \text{IG}(0.01, 0.01) \end{gathered}
Park and Casella (2008) Showed that the posterior modes for regression coefficients under this prior are equivalent to \(\hat\beta_{\text{LASSO}}\)
, the solution obtained via the traditional LASSO formulation.
Some notes
From a Bayesian perspective, the posterior mode results are interesting but not necessarily useful, since one of the benefits of a Bayesian analysis is obtaining a full posterior distribution, rather than just a single point estimate.
In particular, this means that the Bayesian LASSO doesn’t really perform variable selection, but it does still shrink regression parameters for improved fit. It also makes more sense to place a prior distribution on \(\lambda\)
instead of fixing it, e.g. as suggested by Park and Casella:
\begin{gathered} \beta_1, \cdots, \beta_p \mid \sigma^2, \lambda \overset{iid}{\sim} \text{Laplace} \left( 0, \frac{\sqrt{\sigma^2}}{\lambda} \right) \\ \sigma^2 \sim \text{IG}(0.01, 0.01) \\ \lambda^2 \sim \text{Gamma}(a, b) \end{gathered}
Diabetes example
Now that we have the theories explained, let’s look at the posterior credible intervals for regression coefficients under the original prior specification, Bayesian ridge, and Bayesian LASSO of the diabetes example1.
|
|
In Figure 2(a) we have the posterior CIs under the traditional Bayesian linear regression setting. Note that the first ten intervals correspond to the main effects, and the rest are interaction and quadratic terms. Typically with those regression models, we hope many of the higher-order terms to be close to zero, because we probably don’t expect them to explain \(y\)
much.
We can see that there’s a lot of variability in the coefficient values. One of the reasons for this is we have a large number of predictor variables relative to the sample size. Under the Bayesian ridge setting as shown in Figure 2(b), most of the coefficients are pushed towards zero. The posterior CIs are also narrower for a lot of the coefficients. In Bayesian LASSO we observe similar effects.
The coefficients are more concentrated around zero with less variability due to shrinkage. We’re essentially specifying a prior saying the coefficients are centered at zero and have values similar to each other, meaning that their values are all going to be similar to zero. This is a stronger prior than the previous model.
Comparison of posteriors
So what effect does this have on individual regression coefficients? When the number of predictors is large, the estimation of the coefficients can be unstable. This can be seen by looking at trace plots and posterior densities of the regression coefficients. Here we take \(\beta_{57}\)
as an example.
|
|
The trace plot doesn’t look great – the instability in the chains comes from increased variation in estimating regression coefficients when \(p\)
is large. The trace plots of the Bayesian ridge and LASSO show much more of the convergent behavior because we are restricting the values of \(\beta\)
by introducing bias.
|
|
For this specific coefficient, we have from the density plots \(\hat\beta_{\text{OLS}, 57} = 0.4\)
and \(\hat\beta_{\text{ridge}, 57} = \hat\beta_{\text{LASSO}, 57} = 0.15\)
. The ridge and LASSO gave different posteriors for the regression coefficient, but both are pushed towards zero as we’ve seen in the CI plots.
Model comparison
One traditional way to compare these three models is to quantify how well they predict disease progression response on a test sample. We can predict the response value \(\hat{y}_i\)
under each model, and compare to the true value \(y_i\)
.
Posterior-based predictions
For the \(i\)
-th patient in the test sample, we have its vector of covariantes \(x_i\)
. The posterior predictive distribution
for its response is:
We sampled this in JAGS and stored the values in the Y_pred
column. We get a whole distribution for the predictions for each patient, and we may consider taking \(\hat{y}_i\)
to be the posterior predictive mean
, i.e.
We can then quantify how well each of the three models predict our test data by computing the squared-error loss
:
The prediction errors for the three models are summarized in the table below. Both of the penalized regression methods demonstrate improved predictive performance on the test samples2.
Prior | Prediction error |
---|---|
Original | 0.6057 |
Ridge-type | 0.5467 |
LASSO-type | 0.5320 |
Model selection
We compared the three models above by how good their predictions were. Given several candidate models, how can we select the “best” model? In general, we tend to favor models which are:
- Scientifically reasonable, perhaps backed by existing theory.
- Performs well at predicting test data. Sometimes this is not the model that makes most scientific sense!
- Interpretable.
- Parsimonious, i.e. we generally prefer simpler models.
Different applications may place different weights on these criterion, and as a result, there are many ways to evaluate and select models. Here we introduce a few popular Bayesian techniques for model selection
. Justifying the use of a model may involve using one or more of those techniques in conjunction with other considerations that are application-dependent.
Bayes factor
Suppose we have two competing models denoted \(M_1\)
and \(M_2\)
, with respective parameter vectors \(\theta_1\)
and \(\theta_2\)
. The Bayes factor
of \(M_1\)
to \(M_2\)
is defined as:
It can be shown that the above formula reduces to the following:
where
is the marginal likelihood
of our data under model \(M_i\)
3. In the integral we have the sampling model for model \(M_i\)
and the prior for \(\theta_i\)
under \(M_i\)
.
The idea is if the Bayes factor is greater than one, then we would prefer model 1 to model 2 as the marginal likelihood of data under model 1 is higher. A rule of thumb is \(\text{BF} > 100\)
is strong evidence in favor of model \(M_1\)
.
A few caveats with Bayes factors:
- Bayes factor only compares two models.
- The marginal likelihoods computed in the Bayes factor are generally very difficult to compute or approximate unless priors are conjugate.
- Bayes factors cannot be used for models with improper priors as the marginal likelihood can’t be computed.
Bayes factors are also problematic if models are nested with the dimension of \(\theta_1 < \theta_2\)
, e.g.
\begin{gathered} M_1: \mu_i = \beta_0 + \beta_1 x_{1i} \\ M_2: \mu_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} \end{gathered}
\(M_1\)
is nested within \(M_2\)
, since we could set \(\beta_2 = 0\)
in \(M_2\)
to get \(M_1\)
. If we have a large number of predictors, we often want to see whether including a predictor would increase model performance.
Deviance information criterion
Another thing we might do is to compute model-based criterion measures (similar to AIC and BIC), which systematically assign scores to different models. The score should penalize models with a large number of parameters. This is generally beneficial when models get complex and methods like stepwise regression becomes difficult.
Suppose we are interested in estimating parameters \(\theta\)
under a model with sampling density \(p(y \mid \theta)\)
. Let \(\hat\theta\)
be the posterior mean
, perhaps estimated using MCMC. The deviance information criterion
(DIC) for this model is defined as:
where the first term is called the deviance
, and the second term is an estimate of the effective number of model parameters, which is used to penalize against overly complex models. Models with smaller values of DIC are preferred.
The DIC is preferred from a Bayesian perspective compared to other criterion like AIC or BIC because:
- DIC depends on the posterior mean, which incorporates prior information.
- DIC can be easily computed from samples generated via MCMC.
- AIC and BIC are more “frequentist-driven”, as they both rely on maximum likelihood estimates instead of the posterior mean.
In JAGS, the function dic.samples
generates samples from the posterior while simultaneously computing the DIC for the model4.
Other methods
Stochastic search methods are often used for large model spaces, i.e. many candidate models arising from the large number of predictors. Predictive checking looks at posterior predictive distributions and compare those to true values. Cross validation is an example of this.
The R code for obtaining the posterior samples:
↩︎1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
# Specify data, parameters, and initial values p <- ncol(X_train) dataList <- list( "n_train" = nrow(X_train), # training sample size "p" = p, # number of variables "Y" = diab_train$y, # response variable "X_train" = X_train, # training data "n_test" = nrow(X_test), # test sample size "X_test" = X_test # test data ) # JAGS settings adaptSteps <- 1000 # number of steps to "tune" the samplers burnInSteps <- 1000 # number of steps to "burn-in" the samplers nChains <- 3 # number of chains to run numSavedSteps <- 1000 # total number of steps in chains to save thinSteps <- 20 # number of steps to "thin" (1 = keep every step) nIter <- ceiling((numSavedSteps*thinSteps)/nChains) # steps per chain run_jags_model <- function(model, init.vals, params, model.data = dataList, num.chains = nChains, adapt.steps = adaptSteps, burnin.steps = burnInSteps, num.iters = nIter, thin.steps = thinSteps, dic.mode = F) { # Define model m <- textConnection(model) jagsModel <- jags.model( m, data = model.data, inits = init.vals, n.chains = num.chains, n.adapt = adapt.steps ) close(m) # Burn-in if (burnin.steps > 0) { update(jagsModel, n.iter = burnin.steps) } if (dic.mode) { # Generate DIC samples dic.samples( jagsModel, variable.names = params, n.iter = num.iters, thin = thin.steps, type = "pD") } else { # Return posterior samples coda.samples( jagsModel, variable.names = params, n.iter = num.iters, thin = thin.steps ) } } ######################### # (1) Linear regression # ######################### params1 <- c("beta_0", "beta", "sig2", "Y_pred") initValues1 <- list( "beta_0" = 0, "beta" = rep(0, p), "tau2" = 1 ) # Run model in JAGS trad_diab <- " model { # Sampling model for (i in 1:n_train) { Y[i] ~ dnorm(mu[i], tau2) mu[i] = beta_0 + inprod(beta, X_train[i, ]) } # Prior model beta_0 ~ dnorm(0, 1e-10) for (j in 1:p) { beta[j] ~ dnorm(0, 1) } tau2 ~ dgamma(0.01, 0.01) # Need to have model calculate sig2 = 1/tau2 sig2 = 1/tau2 # Prediction on test data for (i in 1:n_test) { mu_pred[i] = beta_0 + inprod(beta, X_test[i, ]) Y_pred[i] ~ dnorm(mu_pred[i], tau2) } } " codaSamples1 <- run_jags_model(trad_diab, initValues1, params1) ######################## ## (2) Bayesian ridge ## ######################## params2 <- c("beta_0", "beta", "sig2", "lambda", "Y_pred") initValues2 <- list( "beta_0" = 0, "beta" = rep(0, p), "tau2" = 1, "lambda" = 0.1 ) # Run model in JAGS ridge_diab <- " model { # Sampling model for (i in 1:n_train) { Y[i] ~ dnorm(mu[i], tau2) mu[i] = beta_0 + inprod(beta, X_train[i, ]) } # Prior model beta_0 ~ dnorm(0, 1e-10) for (j in 1:p) { beta[j] ~ dnorm(0, lambda*tau2) } lambda ~ dgamma(1, 2) tau2 ~ dgamma(0.01, 0.01) # Need to have model calculate sig2 = 1/tau2 sig2 = 1/tau2 # Prediction on test data for (i in 1:n_test) { mu_pred[i] = beta_0 + inprod(beta, X_test[i, ]) Y_pred[i] ~ dnorm(mu_pred[i], tau2) } } " codaSamples2 <- run_jags_model(ridge_diab, initValues2, params2) ######################## ## (3) Bayesian LASSO ## ######################## params3 <- c("beta_0", "beta", "sig2", "lambda", "Y_pred") initValues3 <- list( "beta_0" = 0, "beta" = rep(0, p), "tau2" = 1, "lambda2" = 0.1 ) # Run model in JAGS lasso_diab <- " model { # Sampling model for (i in 1:n_train) { Y[i] ~ dnorm(mu[i], tau2) mu[i] = beta_0 + inprod(beta, X_train[i, ]) } # Prior model beta_0 ~ dnorm(0, 1e-10) for(j in 1:p){ beta[j] ~ ddexp(0, sqrt(lambda2*tau2)) } lambda2 ~ dgamma(1,2) tau2 ~ dgamma(0.01,0.01) # Need to have model calculate sig2 and lambda sig2 = 1/tau2 lambda = sqrt(lambda2) # Prediction on test data for (i in 1:n_test) { mu_pred[i] = beta_0 + inprod(beta, X_test[i, ]) Y_pred[i] ~ dnorm(mu_pred[i], tau2) } } " codaSamples3 <- run_jags_model(lasso_diab, initValues3, params3)
In this case the LASSO seems to perform better than the ridge, but often times the ridge performs slightly better. The LASSO is usually more interpretable as it sets some coefficients to zero. ↩︎
In Bayes’ Theorem:
$$ p(\theta \mid y) = \frac{p(y \mid \theta) p(\theta)}{p(y)} $$We often ignore the denominator part, which is exactly this marginal likelihood value. This is sometimes called the model
evidence
. ↩︎The DICs of the three models in the diabetes example are:
1 2 3 4 5 6 7 8 9
## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph information: ## Observed stochastic nodes: 342 ## Unobserved stochastic nodes: 166 ## Total graph size: 30131 ## ## Initializing model
1 2 3 4 5 6 7 8 9
## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph information: ## Observed stochastic nodes: 342 ## Unobserved stochastic nodes: 167 ## Total graph size: 30134 ## ## Initializing model
1 2 3 4 5 6 7 8 9
## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph information: ## Observed stochastic nodes: 342 ## Unobserved stochastic nodes: 167 ## Total graph size: 30136 ## ## Initializing model
↩︎Original Ridge LASSO 779.49 763.42 761.33
Apr 26 | A Bayesian Perspective on Missing Data Imputation | 11 min read |
Apr 19 | Bayesian Generalized Linear Models | 8 min read |
Apr 05 | Bayesian Linear Regression | 18 min read |
Mar 29 | Hierarchical Models | 18 min read |
Mar 22 | Metropolis-Hastings Algorithms | 17 min read |