Estimation
Introduction
We start by defining what a linear model is. Suppose we’re interested in the response $Y$ in terms of three predictors^{1}, $X_1$, $X_2$ and $X_3$. One very general form for the model would be:
$$ Y = f(X_1, X_2, X_3) + \epsilon $$
where $f$ is some unknown function and $\epsilon$ is the error in this representation. Typically we don’t have enough data to estimate $f$ directly (even with just three predictors), so we usually have to assume that it has some more restricted form.
Linear model
One of the possibilities is perhaps a linear model:
$$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon $$
where $\beta_i, i = 0, 1, 2, 3$ are unknown parameters
, and $\beta_0$ is called the intercept
term. The problem is thus reduced to the estimation of four parameters rather than the infinite dimensional $f$. In a linear model the parameters enter linearly, meaning that the predictors do not have to be linear.
In other words, linear models can be curved. For example, the following models are all linear models:
Model type  Formula 

Null model  $Y_i = \beta_0 + \epsilon_i$ 
Simple linear model  $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ 
Quadratic model  $Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \epsilon_i$ 
Linear mixed model  $Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \epsilon_i$ 
Mixed model w/ interaction  $Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \beta_3 X_{i1} X_{i2} + \epsilon$ 
$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 \log X_{i2} + \beta_3 X_{i1} X_{i2} + \epsilon$ 
However, this
$$ Y_i = \beta_0 + \beta_1 X_{i1}^{\beta_2} + \epsilon_i $$
is not a linear model. Some models can be transformed to linearity, for example
$$ Y_i = \beta_0 X_i^{\beta_1} \epsilon_i $$
can be linearized by taking logs.
Matrix representation
All of the models above can be expressed as a general linear model, provided that we can decide what the predictors and the parameters are. Suppose there’s $n$ observations in our data, we have:
$$
\begin{equation}\label{eq:linearmodeltabular}
\begin{gathered}
Y_1 = \beta_0 + \beta_1 X_{11} + \beta_2 X_{12} + \cdots + \beta_k X_{1k} + \epsilon_1 \\
Y_2 = \beta_0 + \beta_1 X_{21} + \beta_2 X_{22} + \cdots + \beta_k X_{2k} + \epsilon_2 \\
\vdots \\
Y_n = \beta_0 + \beta_1 X_{n1} + \beta_2 X_{22} + \cdots + \beta_k X_{nk} + \epsilon_n
\end{gathered}
\end{equation}
$$
We want a general solution to estimating the parameters of a linear model. For some special cases (e.g. simple linear regression) a simple formulae can be found, but for a method that works in all cases we’ll need matrix algebra. The model in $\eqref{eq:linearmodeltabular}$ can be conveniently written in matrix notation as:
$$ \boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\epsilon} $$
where $\boldsymbol{y} = (y_1, \cdots, y_n)^\prime$, $\boldsymbol{\epsilon} = (\epsilon_1, \cdots, \epsilon_n)^\prime$, $\boldsymbol{\beta} = (\beta_0, \cdots, \beta_k)^\prime$ and
$$
\boldsymbol{X} = \begin{pmatrix}
1 & X_{11} & \cdots & X_{1k} \\
1 & X_{21} & \cdots & X_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
1 & X_{n1} & \cdots & X_{nk}
\end{pmatrix}
$$
The column of ones incorporates the intercept term. In the null model
where there’s no predictor, we simply have
$$ \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} \mu + \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} $$
We can assume that $E[\boldsymbol{\epsilon}] = 0$ because otherwise we could just merge the nonzero expectation for the error term into the mean $\mu$.
Estimation
The unknown parameters in the model $\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\epsilon}$ are the regression coefficient $\boldsymbol{\beta}$ and the error variance $\sigma^2$. The purpose of collecting the data is to estimate and make inferences about these parameters.
Estimating $\boldsymbol{\beta}$
The regression model partitions the response into a systematic component $X\beta$ and a random component $\epsilon$. We want to choose $\beta$ such that the systematic part explains as much of the response as possible, leaving just random variation in the residuals.
Geometrically speaking, the response $\boldsymbol{y}$ lies in an $n$dimensional space ($\boldsymbol{y} \in \mathbb{R}^n$), while $\boldsymbol{\beta} \in \mathbb{R}^p$ where $n$ is the number of observations and $p$ is the number of parameters^{2}.
To find $\boldsymbol{\beta}$ such that $\boldsymbol{X\beta}$ is as close to $Y$ as possible, we’d be looking for predicted/fitted values $\hat{Y}$ in $p$ dimensions that best represent the original $Y$, which is apparently found by projecting $Y$ orthogonally onto the model space spanned by $X$. The fitted values are found by
$$ \boldsymbol{\hat{Y}} = \boldsymbol{X\hat\beta} = \boldsymbol{HY} $$
where $H$ is an orthogonal projection matrix. $\hat\beta$, the regression coefficients
, are the best estimates of $\beta$ within the model space. The difference between the actual response $Y$ and the predicted response $\hat{Y}$ is denoted by $\hat\epsilon$ and is called the residual
.
Least squares estimation
The estimation of $\boldsymbol{\beta}$ can be considered from a nongeometric point of view. If we define the best estimate of $\boldsymbol{\beta}$ as the one which minimizes the sum of the squared errors:
$$ \sum\epsilon_i^2 = \boldsymbol{\epsilon^\prime\epsilon} = (\boldsymbol{ y  X\beta})^\prime(\boldsymbol{ y  X\beta}) $$
Expand this and we get
$$ \boldsymbol{y^\prime y}  2\boldsymbol{\beta X^\prime y} + \boldsymbol{\beta^\prime X^\prime X \beta} $$
Differentiating with respect to $\boldsymbol{\beta}$ and setting to zero, we find that $\boldsymbol{\hat\beta}$ satisfies
$$ \boldsymbol{X^\prime X \hat\beta} = \boldsymbol{X^\prime y} $$
These are called the normal equations
. Now if $\boldsymbol{X^\prime X}$ is invertible,
$$
\begin{aligned}
\boldsymbol{\hat\beta} &= (\boldsymbol{X^\prime X})^{1} \boldsymbol{X^\prime y} \\
\boldsymbol{X\hat\beta} &= \boldsymbol{X}(\boldsymbol{X^\prime X})^{1} \boldsymbol{X^\prime y} \\
\boldsymbol{\hat{y}} &= \boldsymbol{Hy}
\end{aligned}
$$
where $\boldsymbol{H} = \boldsymbol{X}(\boldsymbol{X^\prime X})^{1} \boldsymbol{X}^\prime$ is the $n \times n$ hat matrix
and is the orthogonal projection of $\boldsymbol{y}$ onto the space spanned by $X$ mentioned above. The matrix is symmetric and idempotent. A few useful quantities can be represented using $H$:
 Predicted/fitted values: $\boldsymbol{\hat{y}} = \boldsymbol{Hy} = \boldsymbol{X\hat\beta}$
 Residuals: $\boldsymbol{\hat\epsilon} = \boldsymbol{y  X\hat\beta  y  \hat{y}} = (\boldsymbol{IH})\boldsymbol{y}$
 Residual sum of squares (RSS): $\boldsymbol{\hat\epsilon^\prime\hat\epsilon} = \boldsymbol{y}^\prime(\boldsymbol{IH})^\prime(\boldsymbol{IH})\boldsymbol{y} = \boldsymbol{y}^\prime(\boldsymbol{IH})\boldsymbol{y}$
Later we’ll show that the ordinary least squares
(OLS) estimator is the best possible estimate of $\boldsymbol{\beta}$ when the errors are uncorrelated and have equal variance, i.e. $Var(\boldsymbol{\epsilon}) = \sigma^2\boldsymbol{I}$. $\boldsymbol{\hat\beta}$ is unbiased and has variance $(\boldsymbol{X^\prime X})^{1} \sigma^2$.
$\boldsymbol{\hat\beta}$ is a good estimate from several aspects. First, it results from an orthogonal projection onto the model space, so it makes sense geometrically. Second, if the errors are independent and identically normally distributed, it’s the maximum likelihood estimator. Finally, the GaussMarkov theorem states that it’s the best linear unbiased estimate
(BLUE).
Estimation of $\sigma^2$
We found the RSS to be $\boldsymbol{\hat\epsilon^\prime\hat\epsilon} = \boldsymbol{y}^\prime(\boldsymbol{IH})\boldsymbol{y}$. To find its expectation, we first rewrite the residual to be
$$ \boldsymbol{\hat\epsilon} = (\boldsymbol{IH})\boldsymbol{y} = (\boldsymbol{IH})\boldsymbol{X\beta} + (\boldsymbol{IH})\boldsymbol{\epsilon} $$
The first part is zero because
$$ (\boldsymbol{IH})\boldsymbol{X} = \boldsymbol{X}  \boldsymbol{X}(\boldsymbol{X^\prime X})^{1} \boldsymbol{X}^\prime\boldsymbol{X} = \boldsymbol{X}  \boldsymbol{X} = 0 $$
Thus the RSS is
$$
\begin{aligned}
RSS &= \boldsymbol{\epsilon}^\prime(\boldsymbol{IH})^\prime(\boldsymbol{IH})\boldsymbol{\epsilon} \\
&= \boldsymbol{\epsilon}^\prime(\boldsymbol{IH})\boldsymbol{\epsilon} \\
&= \boldsymbol{\epsilon}^\prime\boldsymbol{\epsilon}  \boldsymbol{\epsilon}^\prime\boldsymbol{H\epsilon} \\
\end{aligned}
$$
We know that the errors are i.i.d. with mean 0 and variance $\sigma^2$, so
$$
E[\epsilon_i \epsilon_j] = \begin{cases}
0, & i \neq j, \\
\sigma^2, & \text{otherwise}
\end{cases}
$$
So the expectation is
$$
\begin{aligned}
E[RSS] &= E\left[ \boldsymbol{\epsilon}^\prime\boldsymbol{\epsilon}  \boldsymbol{\epsilon}^\prime\boldsymbol{H\epsilon} \right] \\
&= n\sigma^2  E\left[ \boldsymbol{\epsilon}^\prime\boldsymbol{H\epsilon} \right] \\
&= n\sigma^2  E\left[ Tr(\boldsymbol{\epsilon}^\prime\boldsymbol{H\epsilon}) \right] \\
&= n\sigma^2  Tr(\boldsymbol{H})\sigma^2 \\
&= n\sigma^2  Tr \left(\boldsymbol{X}(\boldsymbol{X}^\prime\boldsymbol{X})^{1} \boldsymbol{X}^\prime \right) \sigma^2 \\
&= n\sigma^2  Tr \left(\boldsymbol{X}(\boldsymbol{X}^\prime\boldsymbol{X}^\prime\boldsymbol{X})^{1} \right) \sigma^2 \\
&= n\sigma^2  Tr(\boldsymbol{I}_{p}) \sigma^2 \\
&= (np)\sigma^2
\end{aligned}
$$
With this, we can easily see that
$$ \hat\sigma^2 = \frac{\boldsymbol{\hat\epsilon}^\prime \hat{\boldsymbol{\epsilon}}}{np} $$
is an unbiased estimator of $\sigma^2$. The $np$ here is reffered to as the degrees of freedom
of the model.
Sometimes we need the standard error for a specific parameter, in this case we may look at the diagonal of the variancecovariance matrix:
$$ se(\hat\beta_{i1}) = \sqrt{(\boldsymbol{X}^\prime \boldsymbol{X})_{ii}^{1}} \hat\sigma $$
GaussMarkov theorem
To understand the theorem, we first need the concept of an estimable function
. A linear combination of the parameters $\Psi = c^\prime \beta$ is estimable if and only if there exists a linear combination of the observations $a^\prime y$ such that
$$ E\left[ a^\prime y \right] = c^\prime \beta \quad \forall \beta $$
If $X$ is of full rank, then all linear combinations are estimable^{3}.
Suppose $E[\boldsymbol{\epsilon}] = 0$ and $Var(\boldsymbol{\epsilon} = \sigma^2\boldsymbol{I}$. Suppose also that the structural part of the model, $E[Y] = X\beta$, is correct. Let $\Psi = c^\prime \beta$ be an estimable function, then the GaussMarkov theorem states that in the class of all unbiased linear estimates of $\Psi$, $\hat\Psi = c^\prime \hat\beta$ has the minimum variance and is unique.
Proof
Suppose $a^\prime y$ is some unbiased estimate of $c^\prime \beta$ such that
$$
\begin{gathered}
E[a^\prime y] = c^\prime \beta \quad \forall \beta \\
a^\prime X\beta = c^\prime \beta \quad \forall \beta
\end{gathered}
$$
which means that $a^\prime X = c^\prime$. This implies $c$ must be in the column space of $X^\prime$, which in turn implies that $c$ is also in the column space of $X^\prime X$. This means there exists a $\lambda$ such that $c = X^\prime X \lambda$, so
$$ c^\prime \hat\beta = \lambda^\prime X^\prime X \hat\beta = \lambda^\prime X^\prime y $$
Now we can show that the LSE has the minimum variance. We may pick an arbitrary estimate $a^\prime y$ and compute its variance:
$$
\begin{aligned}
Var(a^\prime y) &= Var \left( a^\prime y  c^\prime \hat\beta + c^\prime \hat\beta \right) \\
&= Var \left( a^\prime y  \lambda^\prime X^\prime y + c^\prime \hat\beta \right) \\
&= Var \left( a^\prime y  \lambda^\prime X^\prime y\right) + Var \left(c^\prime \hat\beta \right) + 2Cov \left( a^\prime y  \lambda^\prime X^\prime y, \lambda^\prime X^\prime y \right)
\end{aligned}
$$
If we focus on the covariance term:
$$
\begin{aligned}
Cov \left( a^\prime y  \lambda^\prime X^\prime y, \lambda^\prime X^\prime y \right) &= Cov \left( (a^\prime  \lambda^\prime X^\prime)y, \lambda^\prime X^\prime y \right) \\
&= \left( a^\prime  \lambda^\prime X^\prime \right)X\lambda \sigma^2 \boldsymbol{I} \\
&= \left( a^\prime X  \lambda^\prime X^\prime X \right)\lambda \sigma^2 \boldsymbol{I} \\
&= \left( c^\prime  c^\prime \right)\lambda \sigma^2 \boldsymbol{I} \\
&= 0
\end{aligned}
$$
So
$$ Var(a^\prime y) = Var \left( a^\prime y  \lambda^\prime X^\prime y\right) + Var \left(c^\prime \hat\beta \right) \geq Var \left(c^\prime \hat\beta \right) $$
In other words, $c^\prime \hat\beta$ has minimum variance. The equality holds only when $a^\prime  \lambda^\prime X^\prime = 0$, which means $a^\prime y = \lambda^\prime X^\prime y = c^\prime \hat\beta$, so the estimator is unique because it only occurs when $a^\prime y = c^\prime \hat\beta$.
Implications
The GaussMarkov theorem shows that the LSE is a good choice, but it does require the errors to be uncorrelated and homoscedastic (have equal variance). Even if this is the case but the errors are nonnormal, nonlinear or biased estimates may work better. The theorem doesn’t tell us to use the LSE all the time  it just strongly suggests it unless there’s some strong reason to do otherwise.
Situations where estimators other than ordinary least squares should be considered are:
 When the errors are correlated or have unequal variance, generalized least squares should be used.
 When the error distribution is longtailed, robust estimates^{4} might be used.
 When the predictors are high correlated, biased estimates such as ridge regression might be preferable.
Goodness of fit
It’s useful to have some measure of how well the model fits the data. One common choice is $R^2$, the coefficient of determination
or percentage of variance explained
:
$$ R^2 = 1  \frac{\sum(\hat{y}_i  y_i)^2}{\sum(y_i  \bar{y})^2} = 1  \frac{RSS}{TSS} $$
where TSS stands for total sum of squares. $R^2$ ranges between 0 and 1, and values closer to 1 indicates better fits. An equivalent definition is
$$ R^2 = \frac{\sum(\hat{y}_i  \bar{y})^2}{\sum(y_i  \bar{y})^2} = \frac{\text{Regression SS}}{TSS} = corr^2(\hat{y}, y) $$
It should be noted that the first definition requires an intercept in the regression model. It has a null model with an intercept when the sum of squares is calculated.
So what is a good value of $R^2$? It really depends on the area of application. In biological and social sciences, variables tend to be much more weakly correlated with a lot of noise involved. an $R^2$ of 0.6 might be considered good. In physics and engineering where most data are gathered from closely controlled experiments, $R^2 = 0.6$ could be considered low.
$R^2$ also shouldn’t be the sole measure of fit. As shown in Figure 1^{5}, although the $R^2$ is roughly 0.70 in all four simulated datasets, the underlying relationship could be very different. B has smaller variation in $x$ and also smaller residual variation compared with A, so predictions would also have less variation. C looks like a really good fit except for one outlier, demonstrating how sensitive $R^2$ is to extreme values. The true relationship in D seems to be quadratic, showing that $R^2$ doesn’t tell us much about whether we have the right model.
An alternative measure of fit is $\hat\sigma$. This quantity is directly related to the standard errors of estimates of $\beta$ and predictions. $\hat\sigma$ is measured in the unit of the response, so it can be directly interpreted given the context of the dataset, but it’s hard to say if a $\hat\sigma$ value is large or small as it depends on the scale of the data.
Simple linear regression
As we said earlier, in a few simple models it’s possible to derive explicit formulae for the parameter estimates. In the null model $y = \mu + \epsilon$, we have $\boldsymbol{X} = \boldsymbol{1}$ and $\beta = \mu$, hence
$$
\begin{gathered}
\boldsymbol{X}^\prime \boldsymbol{X} = \boldsymbol{1}^\prime \boldsymbol{1} = n \\
\hat\beta = (\boldsymbol{X}^\prime \boldsymbol{X})^{1} \boldsymbol{X}^\prime \boldsymbol{y} = \frac{1}{n}\boldsymbol{1}^\prime \boldsymbol{y} = \bar{y}
\end{gathered}
$$
In simple linear regression (SLR) where we have only one predictor, the model is
$$ \begin{equation} \label{eq:simplelinearregression} y_i = \beta_0 + \beta_1 x_i + \epsilon_i \end{equation} $$
which in matrix notation is
$$
\begin{pmatrix}
y_1 \\ \vdots \\ y_n
\end{pmatrix} =
\begin{pmatrix}
1 & x_1 \\
\vdots & \vdots \\
1 & x_n
\end{pmatrix}
\begin{pmatrix}
\beta_0 \\ \beta_1
\end{pmatrix} +
\begin{pmatrix}
\epsilon_1 \\ \vdots \\ \epsilon_n
\end{pmatrix}
$$
Here $x_i$ and $y_i$ are observed values, $\beta_0$ is the intercept, and $\beta_1$ is the slope. $\epsilon_i$ is the error for data pair $i$, and the $\epsilon$’s are assumed to be independent (frequently assumed normal) random variables with mean 0 and standard deviation $\sigma_\epsilon$.
Estimating parameters
We may apply the formula $\boldsymbol{\hat\beta} = (\boldsymbol{X^\prime X})^{1} \boldsymbol{X^\prime y}$ directly, but a simpler approach is to rewrite $\eqref{eq:simplelinearregression}$ as
$$ y_i = \underbrace{\beta_0 + \beta_1 \bar{x}}_{\beta_0^\ast} + \beta_1(x_i  \bar{x}) + \epsilon_i $$
So now we have
$$
\boldsymbol{X} = \begin{pmatrix}
1 & x_1  \bar{x} \\
\vdots & \vdots \\
1 & x_n  \bar{x}
\end{pmatrix}, \quad
\boldsymbol{X}^\prime \boldsymbol{X} = \begin{pmatrix}
n & \sum x_i  n\bar{x} \\
\sum x_i  n\bar{x} & \sum (x_i  \bar{x})^2
\end{pmatrix} = \begin{pmatrix}
n & 0 \\
0 & \sum(x_i  \bar{x})^2
\end{pmatrix}
$$
The $\boldsymbol{X^\prime X}$ is diagonal. Applying the formula now gives us
$$
\begin{aligned}
\begin{pmatrix}
\hat\beta_0^\ast \\ \hat\beta_1
\end{pmatrix} &= (\boldsymbol{X^\prime X})^{1} \boldsymbol{X^\prime y} \\
&= \begin{pmatrix}
\frac{1}{n} & 0 \\
0 & \frac{1}{\sum(x_i  \bar{x})^2}
\end{pmatrix}
\begin{pmatrix}
1 & \cdots & 1 \\
x_1  \bar{x} & \cdots & x_n  \bar{x}
\end{pmatrix}\boldsymbol{y} \\
&= \begin{pmatrix}
\frac{1}{n} & \cdots & \frac{1}{n} \\
\frac{x_1  \bar{x}}{\sum(x_i  \bar{x})^2} & \cdots & \frac{x_n  \bar{x}}{\sum(x_i  \bar{x})^2}
\end{pmatrix}\boldsymbol{y}
\end{aligned}
$$
Solving the above gives us
$$
\begin{gathered}
\hat\beta_0^\ast = \frac{\sum y_i}{n} = \bar{y} \\
\hat\beta_1 = \frac{\sum(x_i  \bar{x})y_i}{\sum(x_i  \bar{x})^2} = \frac{\sum(x_i  \bar{x})(y_i  \bar{y})}{\sum(x_i  \bar{x})^2}
\end{gathered}
$$
The least squares estimates of the parameters are:
$$
\begin{gathered}
\hat\beta_1 = \frac{SS_{xy}}{SS_x} \\
\hat\beta_0 = \bar{y}  \hat\beta_1 \bar{x} \\
\hat\sigma_\epsilon = \sqrt{\frac{SSE}{n2}} = \sqrt{MSE}
\end{gathered}
$$
where
$$
\begin{aligned}
SS_{xy} &= \sum_{i=1}^n x_iy_i  \frac{\sum_{i=1}^n x_i \sum_{i=1}^n y_i}{n} = \sum_{i=1}^n (x_i  \bar{x})(y_i  \bar{y}) \\
SS_x &= \sum_{i=1}^n x_i^2  \frac{1}{n} \left(\sum_{i=1}^n x_i \right)^2 = \sum_{i=1}^n (x_i  \bar{x})^2 \\
SS_y &= \sum_{i=1}^n y_i^2  \frac{1}{n} \left(\sum_{i=1}^n y_i \right)^2 = \sum_{i=1}^n (y_i  \bar{y})^2
\end{aligned}
$$
The variancecovariance matrix is
$$
Cov\begin{pmatrix}
\hat\beta_0 \\ \hat\beta_1
\end{pmatrix} = \begin{pmatrix}
\frac{1}{n} + \frac{\bar{x}^2}{SS_x} & \frac{\bar{x}}{SS_x} \\
\frac{\bar{x}}{SS_x} & \frac{1}{SS_x}
\end{pmatrix} \sigma^2
$$
Sum of squares
The naming conventions of the three sum of squares is confusing. The residual sum of squares
(RSS), also known as the sum of squared estimate of errors (SSE), is the sum of the squares of the residuals. The explained sum of squares
(ESS), alternatively known as the model sum of squares or regression sum of squares (SSR), is the sum of the squares of the deviations of the predicted values from the mean of the response. The total sum of squares
(TSS or SST) is defined as the sum over all squared differences between the observations and their overall mean.
$$
\begin{gathered}
SS_{\text{residual}} = RSS = SSE = \sum_{i=1}^n (y_i  \hat{y}_i)^2 \\
SS_{\text{regression}} = ESS = SSR = \sum_{i=1}^n (\hat{y}_i  \bar{y})^2 \\
SS_{\text{total}} = TSS = SST = \sum_{i=1}^n (y_i  \bar{y})^2 = SS_y
\end{gathered}
$$
It can be shown that $TSS = SSR + SSE$, meaning that the total sum of squares can be partitioned into two parts: a part that can be explained by the model, and a random part from the errors. We’ll get into the details in the ANOVA chapter.
Goodness of fit
For simple linear regression, $R^2 = r^2$ where $r$ is the Pearson’s correlation between $x$ and $y$. Combined with the definition of $R^2$, we have
$$
\begin{gathered}
r^2 = \frac{SSR}{SST} = \frac{SS_{xy}^2}{SS_x SS_y} \\
SSR = r^2 \cdot SST = \frac{SS_{xy}^2}{SS_x}
\end{gathered}
$$
To predict $y$ for a given $x$ with the model, we may use
$$ \hat{y} = \hat\beta_0 + \hat\beta_1 x $$
Note that this is the best prediction for the response $y$ from $x$, not the other way around.
Remarks
The most important formula in this chapter is
$$ \boldsymbol{\hat\beta} = (\boldsymbol{X^\prime X})^{1} \boldsymbol{X^\prime y} $$
The most difficult part is the evaluation of $(\boldsymbol{X^\prime X})^{1}$. In most cases ($p > 2$) we’ll use software to compute the matrix, with several caveats:
 The number of observations, $n$, must be at least as large as the number of parameters $p$  preferably much larger. A rule of thumb is 1020 observations per parameter to be estimated.
 If two or more variables are exactly linearly dependent, the inverse will not be unique. Even if some predictors are correlated, known as (multi)collinearity, algorithms might fail to invert the moment matrix.
 The set of parameter estimates $(\hat\beta_0, \cdots, \hat\beta_k)$ should all be used when making predictions. If some predictors are deleted from the model, we can’t simply use the remaining estimates to make the prediction. The regression model must be refit to obtain new estimates.

Predictor variables are also called covariates, explanatory variables, or regressors. ↩︎

The number of parameters is the number of predictors plus one, because we almost always need to include the intercept term. We might see $p1$, $p$ and $p+1$ used in different contexts, so be careful. ↩︎

See this article for an excellent explanation. ↩︎

Robust estimates are typically not linear in $y$. ↩︎

R code for plotting Figure 1:
↩︎1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
library(tidyverse) library(ggpubr) set.seed(42) dat1 < tibble( x = runif(30, 0, 1), y = x + rnorm(30, 0, 0.15), Group = "A" ) dat2 < tibble( x = runif(30, 0.3, 0.7), y = x + rnorm(30, 0, 0.12), Group = "B" ) dat3 < tibble( x = runif(30, 0, 1), y = x + rnorm(30, 0, 0.001) + rbinom(30, 1, 0.05), Group = "C" ) dat4 < tibble( x = runif(30, 0, 1), y = 3 * (x0.37)^2, Group = "D" ) dat < bind_rows(dat1, dat2, dat3, dat4) ggscatter(dat, x = "x", y = "y", color = "Group", palette = "jco", add = "reg.line", add.params = list(color = "gray25")) %>% facet(facet.by = "Group", nrow = 2)+ theme(legend.position = "none")
Oct 23  Projection Matrix  4 min read 
May 08  Modern Nonparametric Regression  8 min read 
May 05  Correlation and Concordance  9 min read 
Apr 26  A Bayesian Perspective on Missing Data Imputation  11 min read 
Apr 19  Bayesian Generalized Linear Models  8 min read 