Bayesian Linear Regression
Unlike many of the previous lectures, a lot of this might not feel very different from traditional approaches because we are placing many uninformative priors. However, the Bayesian approach does have some benefits, which we will touch upon as we go through 3 different regression models on the following data set.
London schools example
The student.txt
data set from Goldstein et al.1 contains externally standardized results of 1978 students’ performance on the General Certificate of Secondary Examination (GCSE), an exam taken by British students finishing Year 11 (about age 15). These students were drawn from 38 schools within inner London. In addition, the following information is included in the data:
- Standardized scores on the London Reading Test (LRT), an examination required for all British students at age 11.
- Students were rated into one of three groups based on a Verbal Reasoning (VR) test taken at age 11 –
\(\text{VR}=1\)
being the best score and\(\text{VR}=3\)
being the worst score. - Gender at birth coded 0 if male and 1 if female.
Our goal is to model the relationship between GCSE performance and these covariates. With a relatively small number of covariates, we can examine their pairwise relationships with the response variable. We begin by ignoring which school each student attends for now, and plot GCSE against LRT.
|
|
|
|
Both variables are centered at zero. It seems in general a higher LRT score corresponds to a higher GCSE score.
Next we have the gender at birth variables. The GCSE scores by gender are summarized in the following table.
Gender | Sample size | Mean | Standard deviation |
---|---|---|---|
Female | 1209 | 0.0926 | 0.9700 |
Male | 769 | -0.1453 | 1.0351 |
There are fewer males than females in the sample, and it seems like females outperform the average and the males tend to perform below par.
The final variable is VR, and we observe similar trends where \(\text{VR}=1\)
is the best-performing group, whereas \(\text{VR}=3\)
performed the worst.
VR | Sample size | Mean | Standard deviation |
---|---|---|---|
1 | 521 | 0.8101 | 0.8716 |
2 | 1160 | -0.1310 | 0.8426 |
3 | 297 | -0.9083 | 0.7385 |
Model 1 - multiple linear regression
Same as always, we need to first specify our sampling model. This part of the model relates \(Y_{ij}\)
to the explanatory variables. Let \(Y_{ij}\)
be the values of the GCSE for student \(i\)
within school \(j\)
:
where \(\boldsymbol{\beta}\)
is a collection of five regression coefficients \((\beta_1, \cdots, \beta_5)\)
, and the “regression function” is:
Assuming conditional independence across the samples, the probability density function is:
Next up is the prior model. The unknown values in the sampling model are \(\mu_{ij}\)
and \(\sigma^2\)
, but note that the mean function \(\mu_{ij}\)
is determined by the regression coefficients \(\beta\)
’s, so we need to specify priors for six parameters.
Suppose we don’t have any information about the parameters, then:
We’ve seen the inverse-gamma prior multiple times for the variance term. For the normal priors, they are proper priors but are virtually “flat”, so it’s almost like putting a uniform prior. The probability density function is:
We cannot write down the full posterior distribution in closed-form. However, for our prior specification, it can be shown that the full conditional distributions do exist in closed-form. The ramifications of that is a Gibbs sampler can be used to generate approximate samples from the posterior2.
Posterior samples via JAGS
Now we can draw posterior samples using JAGS. Again, the model specified below is without the effect of schools.
|
|
|
|
Now we actually draw the samples (takes quite a while):
|
|
The posterior summaries of \(\beta\)
and \(\sigma^2\)
are given below. First thing we might notice is VR1 (\(\beta_3\)
) appears to have the strongest positive connection with GCSE scores. VR2 also has a slight positive effect.
The gender (\(\beta_5\)
) boxplot is primarily above zero, which indicates females generally have higher GCSE scores. LRT (\(\beta_2\)
) also appears to have a very small but positive relationship with GCSE. Recall that LRT spanned (-30, 30), and we should consider standardizing the variable to get a better sense of the relationship between LRT and GCSE.
Finally, the intercept indicating that the baseline level (LRT=0, Gender=0, VR3), is negative.
Notes
As with traditional linear regression methods, we should consider the following items.
Is the goal of the model explanatory or prediction? If explanatory, are there any predictor variables which are redundant or provide minimal information in explaining changes in the response?
Consider transformations if assumptions do not appear to be satisfied.
- Standardization of the predictor variables.
- If GCSE wasn’t normally distributed, maybe try log(GCSE).
- Other transformations of the predictors if we observe improved correlation with the response in the EDA.
Including higher-order interaction terms. For example,
$$ \mu_{ij} = \beta_1 + \beta_2 x_1 + \beta_3 x_2 + \beta_4 x_1 x_2 $$
There’s of course other things to consider (e.g. regularization), and we’ll talk about some other types of priors in the next lecture.
Model 2 - adding a school effect
Our first linear regression model ignores which school each student attended. Should we account for a school effect in our model? We might be interested in which schools are on average performing better when the other variables are factored out.
The boxplot shows some variability across the schools. Some are almost centered around 1 indicating high-performance. Note that there’s also varying sample sizes, and some schools only have one observation.
So how could Model 1 be modified to account for school-specific average GCSE performance? We may specify the sampling model as:
where \(\boldsymbol{\alpha} = (\alpha_1, \cdots, \alpha_J)\)
, \(\boldsymbol{\beta} = (\beta_1, \cdots, \beta_5)\)
, and
This is very similar to Model 1, except in the regression function we now have an \(\alpha_j\)
term that models school-specific effect for school \(j\)
.
Under this new model, we assume that the predictors have the same relationship with the response variable regardless of the school. This is called a random intercept model
, where the slopes are the same but the intercepts differ.
Next we specify priors for all our parameters. For the \(\beta\)
’s we would use exactly the same prior as in Model 1:
For the \(\alpha\)
’s, since the schools were randomly selected from inner London, we would expect them to share some similarities after accounting for all other variables. So following the setup in the hierarchical model, the prior for the school effects could be:
where \(\sigma_\alpha^2\)
models the variability between schools. Assuming the \(\alpha\)
’s are dependent encourages shrinkage of school effects, and improves estimation by borrowing information from other schools.
We also have priors for the variance parameters:
\begin{gathered} \sigma_\alpha^2 \sim \text{IG}(0.01, 0.01) \\ \sigma^2 \sim \text{IG}(0.01, 0.01) \end{gathered}
We assume all of these parameters are independent. The final Model 2 is a hierarchical linear model with school-specific random intercepts, i.e. treating the intercept as a random effect
.
Posterior samples via JAGS
Using the code below, we can obtain posterior samples of the parameters.
|
|
|
|
|
|
The summaries for the \(\beta\)
’s are very similar to the ones we got in Model 1. This is because Model 1 “averaged” all the school effects.
What’s more interesting is the school-specific intercepts. Any \(\alpha\)
value that’s above zero indicates a school that outperformed the baseline. School 9 appears to be the “best” school in terms of the median, and the worst-performing school could be school 17. For schools 31-38 with smaller sample sizes, their school effects are pushed towards zero and the posterior samples have a lot of variation.
Figure 5 can be difficult to read with this many schools. An alternative is to produce an overall ranking of schools3 by averaging over the sample-to-sample rankings of school-specific effect parameters \(\alpha_1, \cdots, \alpha_{38}\)
:
Finally, we can compute the posterior density for the intraclass correlation
defined by the quantity
to estimate the proportion of total variation due to school-specific effects. This is of interest because the schools are a random sample of a population of schools. If the ICC is high, then it means the grouping explains a lot of the total variation.
The ICC is centered around 0.08, which mean about 8% of the total variation in GCSE scores is attributed to school effects, and the remaining 92% is just associated with variation from student to student. We care about quantities like this more in random effect models (vs. fixed effect models) because we can generalize this and infer about all schools instead of just those 38 schools.
Model 3 - adding school-level covariates
The final question we may want to consider is what factors about the schools themselves drive GCSE performance? For example, what factors attributed to school 9 influenced high GCSE performance?
A separate school.txt
data set contains information about each of the 38 schools. Each school belongs to one of four denominations – public (baseline category), church of England (CE), Roman Catholic (RC), or other. The School gender is either mixed gender (baseline category), girls only, or boys only.
School | CE | RC | Other | Girls | Boys |
---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 1 |
6 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 1 |
8 | 0 | 1 | 0 | 0 | 0 |
9 | 0 | 1 | 0 | 0 | 0 |
10 | 0 | 1 | 0 | 0 | 1 |
11 | 0 | 0 | 1 | 0 | 0 |
12 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 1 | 0 |
14 | 0 | 0 | 0 | 0 | 1 |
15 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 1 | 0 |
17 | 1 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 1 |
19 | 0 | 1 | 0 | 0 | 1 |
20 | 0 | 0 | 1 | 0 | 1 |
21 | 0 | 0 | 0 | 0 | 0 |
22 | 0 | 1 | 0 | 0 | 0 |
23 | 0 | 0 | 0 | 0 | 0 |
24 | 0 | 0 | 0 | 0 | 1 |
25 | 0 | 0 | 0 | 1 | 0 |
26 | 0 | 0 | 0 | 1 | 0 |
27 | 0 | 0 | 0 | 0 | 0 |
28 | 0 | 1 | 0 | 0 | 1 |
29 | 0 | 1 | 0 | 1 | 0 |
30 | 0 | 0 | 0 | 0 | 1 |
31 | 1 | 0 | 0 | 0 | 0 |
32 | 0 | 0 | 0 | 0 | 0 |
33 | 0 | 0 | 0 | 0 | 0 |
34 | 0 | 0 | 0 | 0 | 1 |
35 | 0 | 0 | 0 | 0 | 1 |
36 | 0 | 1 | 0 | 0 | 0 |
37 | 0 | 1 | 0 | 0 | 0 |
38 | 0 | 0 | 1 | 0 | 0 |
This additional information might give us insight to help explain some of the variation in the school-specific baseline performance. Comparing GCSE scores by school denomination and school gender4, we can see public schools are somewhat average, and RC schools seem to perform best; there’s not a large difference between different school gender types.
To incorporate this into our model, we can alter our school-specific level of the previous hierarchical linear regression model (Model 2). The student-level of our sampling model stays the same:
and the regression function is the same as in Model 2, only with additional (st) superscripts indicating student-specific effects. For the school-level sampling model:
The difference here is instead of a normal distribution centered at zero, we have a \(\mu_j^{(\text{sc})}\)
term which we will regress school effects on the school-specific variables:
Note that there’s no intercept term because we want the mean of the normal distribution to be zero.
With this, our prior model is:
Posterior samples
With the posterior samples obtained via JAGS5, we may first examine the student parameters.
Once again we get the same relationships with models 1 and 2, where VR1 is the best predictor of GCSE scores. The benefit is we’ve now accounted for all the other variables.
The posterior summaries of the school-specific parameters \(\beta^{(\text{sc})}\)
show that RC schools tend to have higher GCSE performance. Girls-only and boys-only schools also tend to have higher GCSE performance when all other variables are accounted for, although the difference wasn’t so significant in Figure 7.
Final notes
We discussed three regression models in this lecture, starting from the traditional multiple regression model (Model 1). Combining linear regression with hierarchical modeling yields Bayesian formulations of linear mixed models (Models 2 & 3). We focused on models with random intercepts, but we could have also considered models with random slopes or random intercepts and slopes.
If we have many predictors in JAGS, it will be much more convenient to use the inprod
function when specifying the model, i.e.:
|
|
where X[i, ]
is the design matrix.
In the next lecture, we will talk about model selection and account for non-normal response data.
Goldstein, H., Rasbash, J., Yang, M., Woodhouse, G., Pan, H., Nuttall, D., & Thomas, S. (1993). A Multilevel Analysis of School Examination Results. Oxford Review of Education, 19(4), 425-433. Retrieved April 7, 2021, from http://www.jstor.org/stable/1050563 ↩︎
JAGS will recognize this automatically. In Bayesian sampling, a Gibbs sampler is usually preferable to the Metropolis-Hastings algorithm, because the latter need to compute the acceptance ratio and is thus slower. ↩︎
R code for generating the school rankings figure:
↩︎1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# Used rank of mean instead of mean of ranks. # With the large sample size results are the same. mcmc2 %>% select(starts_with("alpha")) %>% pivot_longer(everything(), names_to = "School", values_to = "Value") %>% mutate(School = str_replace(School, "^alpha\\[(\\d+)\\]$", "\\1")) %>% group_by(School) %>% summarise( post_mean = mean(Value), lowerCI = quantile(Value, c(0.025,0.975))[1], upperCI = quantile(Value, c(0.025,0.975))[2] ) %>% ungroup() %>% arrange(post_mean) %>% mutate(School = fct_inorder(School)) %>% ggscatter(x = "post_mean", y = "School", xlab = "") + geom_segment(aes(x = lowerCI, xend = upperCI, y = School, yend = School)) + xlim(-1, 1)
The R code for generating the boxplots comparing GCSE scores by school denomination and school gender:
↩︎1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
dat_p <- dat %>% select(GCSE = Y, School) %>% inner_join(schools, by = "School") p1 <- dat_p %>% mutate(Denomination = case_when( CE == 1 ~ "CE", RC == 1 ~ "RC", Other == 1 ~ "Other", T ~ "Public" )) %>% ggboxplot(x = "Denomination", y = "GCSE") p2 <- dat_p %>% mutate(`School gender` = case_when( Girls == 1 ~ "Girls", Boys == 1 ~ "Boys", T ~ "Mixed" )) %>% ggboxplot(x = "School gender", y = "GCSE") p1 | p2
R code for specifying Model 3:
↩︎1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
p.schools <- ncol(schools) - 1 dataList3 <- list( "n" = n, "p" = p, "n.schools" = n.schools, "Y" = dat$Y, "LRT" = dat$LRT, "VR1" = dat$VR1, "VR2" = dat$VR2, "Gender" = dat$Gender, "School" = dat$School, "p.schools" = p.schools, "CE" = schools$CE, "RC" = schools$RC, "Other" = schools$Other, "Girls" = schools$Girls, "Boys" = schools$Boys ) parameters3 <- c("alpha", "beta_st", "beta_sc", "sig2", "sig2.alpha") initsValues3 <- list( "alpha" = rep(0, n.schools), "beta_st" = rep(0, p), "beta_sc" = rep(0, p.schools), "tau2" = 1, "tau2.alpha" = 1 ) model3 <- textConnection(" model { # Likelihood - in JAGS, normal distribution is parameterized by # mean theta and precision = tau2 = 1/sig2 for (i in 1:n) { Y[i] ~ dnorm(mu_st[i], tau2) mu_st[i] = alpha[School[i]]+beta_st[1]+beta_st[2]*LRT[i]+beta_st[3]*VR1[i]+beta_st[4]*VR2[i]+beta_st[5]*Gender[i] } for(j in 1:n.schools){ alpha[j] ~ dnorm(mu_sc[j],tau2.alpha) mu_sc[j] = beta_sc[1]*CE[j]+beta_sc[2]*RC[j]+beta_sc[3]*Other[j]+beta_sc[4]*Girls[j]+beta_sc[5]*Boys[j] } # Priors for (i in 1:p) { beta_st[i] ~ dnorm(0, 1e-10) } for(j in 1:p.schools){ beta_sc[j] ~ dnorm(0, 1e-10) } tau2 ~ dgamma(0.01, 0.01) tau2.alpha ~ dgamma(0.01, 0.01) # Need to have model calculate variances sig2 = 1/tau2 sig2.alpha = 1/tau2.alpha } ") jagsModel3 <- jags.model(model3, data = dataList3, inits = initsValues3, n.chains = nChains, n.adapt = adaptSteps) close(model3) if (burnInSteps > 0) { update(jagsModel3, n.iter = burnInSteps) } codaSamples3 <- coda.samples(jagsModel3, variable.names = parameters3, n.iter = nIter, thin = thinSteps)
Apr 26 | A Bayesian Perspective on Missing Data Imputation | 11 min read |
Apr 19 | Bayesian Generalized Linear Models | 8 min read |
Apr 12 | Penalized Linear Regression and Model Selection | 18 min read |
Mar 29 | Hierarchical Models | 18 min read |
Mar 22 | Metropolis-Hastings Algorithms | 17 min read |