Many of the modern nonparametric statistics are highly computational - mostly been developed in the last 30 years or so. We'll start with bootstrap, then kernel density estimation and finally modern regression approaches.


Bootstrapping is a widely used computational method, mainly because often the theory is intractable - you can't easily work out the properties of the estimator, or the theory is difficult to apply / work with - checking the assumptions itself can be difficult.

It allows us to work with

  • (almost) arbitrary distributions at the population level
  • more general characteristics of the population, i.e. not just the mean


There are two main versions of the bootstrap: the parametric bootstrap and the non-parametric bootstrap.

A nice thing about the nonparametric bootstrap is that it's very close to being assumption free. One key assumption is that the sample (empirical) CDF is a good approximation to the true population CDF for samples that are not too small. That is, $S(x)$, the empirical CDF, should reflect the main characteristics of $F(x)$, the theoretical population CDF. We don't care about $F(x)$ - we want to use $S(x)$ as a proxy for $F(x)$.

The idea is based on resampling of data. In the real world, we have the population $F(x)$ and some parameter $\theta$. Repeated sampling is usually not feasible. In the bootstrap world, we have a sample $S(x)$ and an estimated parameter $\hat{\theta}$. Repeated sampling is totally feasible with enough computing power to do it.

Note that permutation testing is also a type of resampling. So what's the main difference? With permutation, we're resampling without replacement and leads to exact tests. With bootstrap, we resample with replacement from our original sample, and that leads to approximate results. Bootstrapping leads to a more general approach.


Suppose we have a sample of size $n$: $x_1, x_2, \cdots, x_n$ from some unknown population. A bootstrap sample would be a random sample of size $n$ sampled from $x_1, x_2, \cdots, x_n$ with replacement. Some $x_i$ will appear multiple times in a given bootstrap sample, and others not at all.

In principle, we could enumerate all possible bootstrap resamples; in practice, this number grows very quickly with $n$, e.g. when $n=10$ there are more than $90,000$ possible different bootstrap samples.

In real applications, we'll always take a sample of the possible bootstrap resamples - call this $B$. Often, $B \approx 500$ will suffice, or $B \approx 1000$ but we rarely need more than this.

Standard notation: a typical bootstrap resample will be denoted $x_1^\ast, x_2^\ast, \cdots, x_n^\ast$, where each $x_i^\ast$ is equal to one of the original $x_i$. If $B$ bootstrap resamples are generated, the $b^{th}$ is denoted by $x_1^{\ast b}, x_2^{\ast b}, \cdots, x_n^{\ast b}$, $b = 1, \cdots, B$.

For each bootstrap sample, we compute the statistic of interest (getting $B$ values, one for each sample). We build up from these an approximate, empirical distribution. We use this to learn about the population quantity of interest, e.g. estimate the population parameter by the average of the $B$ bootstrap quantities.

  • $\theta$ - population parameter of interest
  • $S(x^\ast)$ - estimate of $\theta$ from the bootstrap sample $x^\ast$
    • The average of $S(x^\ast)$ values is an estimate for $\theta$
  • $\hat{\theta} = \frac{1}{b}\sum_{b=1}^B{S(x^{\ast b})}$
    • The book call it $S(\cdot^\ast)$, which is non-typical; it's usually denoted $\hat{\theta}$ or $\hat{\theta^\ast}$

As $B$ increases, $S(\cdot^\ast)$ tends to the mean computed for the true bootstrap sampling distribution, which is based on all possible configurations. Even for $B \approx 500$ or $1000$, the estimator will be pretty good in general.

The estimator of the true bootstrap standard error

\[se_B(S) = \sqrt{\frac{1}{B-1} \sum_{b=1}^B{[S(x^{\ast b}) - S(\cdot^\ast)]^2}}\]

where $S(x^{\ast b})$ is the sample s.d. of bootstrap sample values. The easiest way to do this in R is to use the sample() function with replace = T, e.g. sample(x, size = n, replace = T) where x is the data with n observations.


Suppose we have a small dataset with $n=20$ observations from some unknown distribution. We consider bootstrap for the mean and the median.

\[\begin{aligned} 0.049584203 && 0.165223813 && 0.070872759 && 0.009547975 && 0.118294157 \\ 0.365906785 && 0.078496145 && 0.102532392 && 0.297899453 && 0.178715619 \\ 0.336178899 && 0.602171107 && 0.024036640 && 0.014285340 && 0.313490262 \\ 0.077453934 && 0.118797809 && 0.155527571 && 0.311395078 && 0.092584865 \end{aligned}\]

In the sample, we see $\bar{x} = 0.174$ and a median of $0.119$, which indicates skewness. A boxplot would also show this. We should be suspicious of a normal assumption, but the sample is small so it's hard to know. Maybe we don't want to use the mean (and its standard error) as our basis for inference, especially in light of the skewness. The median could be a better alternative, but we know little about the sampling distribution of the median (standard error, etc.). Some of these problems can be addressed via the bootstrap.

B <- 500
boot.mean <- vector("numeric", B)
boot.median <- vector("numeric", B)
for (i in 1:B) {
    boot.mean[i] <- mean(sample(x, size = length(x), replace = T))
    boot.median[i] <- median(sample(x, size = length(x), replace = T))

The mean and median of the boot.mean ($0.1718535$ and $0.1699468$, respectively) are both pretty similar to the data mean, $0.1741497$. The bootstrap standard deviation sd(boot.mean) $0.03249521$ matches pretty well with the standard error for the mean sd(x) / sqrt(length(x)) $0.03395322$. The bootstrap mean behaves very much like what we would "traditionally" do for a data set like this, i.e. compute the sample mean and the standard error.

What we couldn't do so easily is to look at the median of the sample. We can use the sample median as an estimator for the population median, but we don't have an easy estimator for the variability of the estimator! With bootstrap, it's quite easy to explore.

The mean(boot.median) $0.127123$ and median(boot.median) $0.118546$ are both pretty close to the sample median $0.118546$. However, we can also easily get, for example, sd(boot.median) $= 0.0417244$, or the distribution of the minimum, etc.

The data were generated from an exponential with rate $\lambda = 4$. The true mean($\theta$) is $\lambda^{-1} = 0.25$, and the true median is $\lambda^{-1} \ln{2} = 0.174$, so both are within the bootstrap mean and median results. We've underestimated both slightly, because the sample values of the mean and median were on the small side, and that introduces bias in bootstrap estimates.

Regression Analysis with Bootstrap

We'll consider simple linear regression, but the same general ideas hold for multiple regression, generalized linear models, etc.

Suppose we have multivariate data: $(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)$ where $x_i$ is the explanatory variable, and $y_i$ is the response variable. In general, there are different types of bootstrap schemes that can be developed. These schemes depend on the types of assumptions you're willing to make about how the data might have been generated.

\[\begin{aligned}\text{Model: } Y_i = \beta_0 + \beta_1X_i + \epsilon_i && i = 1\cdots n\end{aligned}\]

where $\epsilon_i$ are uncorrelated, mean $0$, constant variance $\sigma^2$. In general, the $x_i$ may be a) fixed by design, b) randomly sampled, or c) simply observed. We're going to proceed as if they were fixed by design (it doesn't make that much difference).

The two common approaches for the bootstrap are model-based resampling (or residual resampling) and case-based resampling. Note that there are different versions of these as well.

Model-based (Residual) Resampling

  1. Set the $x^\ast$ as the original $x$ values.
  2. Fit the OLS (ordinary least squares) to the linear model and find residuals (some version of it).
  3. Resample the residuals, call them $\epsilon^\ast$.
  4. Define

\[y_i^\ast = \hat{\beta_0} + \hat{\beta_1}x_i + \epsilon^\ast\]

where $\hat{\beta_0}$ and $\hat{\beta_1}$ are OLS estimators from original $(x_i, y_i)$ pairs; $x_i$ are the original data; $\epsilon^\ast$ are the resampled residuals, and $y^\ast$ are the new responses.

  1. Fit OLS regression to $(x_1, y_1^\ast), \cdots, (x_n, y_n^\ast) \Rightarrow \hat{\beta_0^{\ast b}}, \hat{\beta_1^{\ast b}}, {S^2}^{\ast b}$ (intercept estimate, slope estimate and variance estimate, respectively) where $b = 1 \cdots B$.

Case-based Resampling

  1. Resample the data pairs $(x_i, y_i)$ directly, keeping the pairs together, with replacement. In R (or any other language), this can be done by sampling the indices.
  2. FIt the OLS regression model and get $\hat{\beta_0^{\ast b}}, \hat{\beta_1{\ast b}}, {S^2}^{\ast b}$, $b = 1 \cdots B$.

Confidence Intervals

In either case, we get an empirical bootstrap distribution of $\hat{\beta_0^\ast}, \hat{\beta_1^\ast}, {S^2}^\ast$ values. We could use these to get confidence intervals for the parameters of interest. This is a widely-useful application of the bootstrap in general - CIs when the theoretical derivation is difficult.

We'll have $B$ "versions" of the statistic of interest: $\hat{\theta}_1^\ast, \hat{\theta}_2^\ast, \cdots, \hat{\theta}_B^\ast$. This results in an empirical distribution. There are two simple, rather intuitive approaches to get CIs for $\theta$. One caveat is that it may not work well in practice - you may need to use "corrections" or improvements.

The first procedure is to mimic "normal-type" confidence intervals. Recall that for a sample of size $n$ from a normal distribution with variance $\sigma^2$, the $95\%$ CI for the mean, $\mu$, is given by $\bar{x} \pm 1.96\frac{\sigma}{\sqrt{n}}$. If $\sigma^2$ is unknown (typical case), we replace $\sigma$ with $S$, which is the sample standard deviation, and $1.96$ by the appropriate quantile for $t_{n-1}$ ($2.5\%$ on both tails). We also use this when data are not necessarily normal, according to the central limit theorem CLT.

In bootstrap, we can do something similar using bootstrap based estimates instead:

\[\hat{\theta}^\ast \pm 2 \cdot se^\ast(\hat{\theta}^\ast)\]

where $\hat{\theta}^\ast$ is the bootstrap estimate of $\theta$ (mean of bootstrap values), and $se^\ast(\hat{\theta}^\ast)$ is the bootstrap standard error estimate (standard deviation of bootstrap values). This forces the CI to be symmetric about $\hat{\theta}^\ast$.

The second approach is to use the empirical bootstrap distribution. Suppose we have a histogram of $\hat{\theta}_1^\ast, \hat{\theta}_2^\ast, \cdots, \hat{\theta}_B^\ast$. We would take the $2.5\%$ of bootstrap values in the lower and upper tails. This does not enforce symmetry around $\hat{\theta}^\ast$.

Here's an example in R for the second approach:

x <- c(1, 2, 3, 5, 8)
y <- c(3, 7, 9, 7, 12)

res <- vector("numeric", 1000)
for (i in seq(1000)) {
  xb <- sample(x, size = length(x), replace = T)
  yb <- sample(y, size = length(y), replace = T)
  res[i] <- mean(xb) / mean(yb)
quantile(res, c(0.025, 0.975))
#      2.5%     97.5% 
# 0.2444444 0.9476842 

Possible Bias in Bootstrap

Again, there are various procedures for dealing with the possible bias in bootstrap estimates, and Jackknife is one of them. We can use this to estimate bias and correct for it.

We have our sample $x_1, x_2, \cdots, x_n$. We're interested in some population characteristic $\theta$. We estimate it based on the entire sample by $\hat{\theta}$. We get $n$ additional estimates of $\theta$ by replacing the original sample by a set of subsamples, leaving out one observation in turn each time. The $i^{th}$ jackknife sample of size $n-1$ is given by:

\[\begin{aligned}x_1, x_2, \cdots, x_{i-1}, x_{i+1}, \cdots, x_n && i=1, \cdots, n\end{aligned}\]

On each subsample, we estimate $\theta$ by $\hat{\theta}_{(i)}$ (taking out the $i^{th}$ observation) $\Rightarrow \hat{\theta}_{(1)}, \hat{\theta}_{(2)}, \cdots, \hat{\theta}_{(n)}$. The mean of jackknife estimates:

\[\hat{\theta}_{(\cdot)} = \frac{1}{n}\sum_{i=1}^n{\hat{\theta}_{(i)}}\]

The Jackknife estimator is given as:

\[\hat{\theta}^J = n\hat{\theta} - (n-1)\hat{\theta}_{(\cdot)}\]

where $n\hat{\theta}$ is based on the whole sample, and $(n-1)\hat{\theta}_{(\cdot)}$ is based on $n$ Jackknife estimates.

The Jackknife estimator of bias is simply defined as:

\[\text{bias}(\hat{\theta}) = (n-1)\left(\hat{\theta}_{(\cdot)} - \hat{\theta}\right)\]


  1. There are also purely bootstrap-based ways of dealing with bias, e.g. Double bootstrap (bootstrapping the bootstrap samples), or jackknife after bootstrap, etc.
  2. The bootstrap is based on the "plug-in" principle, and the jackknife is based on the "leave one out" principle.

Parametric Bootstrap

We assume that the data come from some known distribution but with unknown parameters. The idea of the parametric bootstrap is to use the data to estimate the parameters, and then draw resamples from the estimated distribution. For example, we draw samples from $N(\bar{x}, S^2)$ or $Pois(\bar{x})$. This method is not used much in practice because of the strong distribution assumption.

Density Estimation

We want to learn about the distribution of the population from which the data were drawn. We want to formally estimate the shape of the distribution, i.e. get a "reliable" visual representation such as a histogram.


Some terminology: subintervals of the histogram are called bins. Width of the interval is called binwidth. Small binwidth leads to more bins and shows local details, which may or may not be meaningful. Large binwidth shows a smoother, large-scale picture, but we may lose interesting information. We have a tradeoff between competing goals. The histogram has some drawbacks:

  1. Histograms are not smooth even if the underlying distribution is continuous - binning discretizes the result.
  2. Histograms are sensitive to the choice of the class interval.
  3. (related) Histograms depend on the choice of endpoints of the intervals. Both 2 and 3 are about the visual appearance.

Kernel Density Estimation

A smoother approach which gets around some of the drawbacks of the histogram is called kernel density estimation. It gives a continuous estimate of the distribution. It also removes dependence on endpoints, but the choice of binwidth (drawback 2) has an analogous issue here.


For any $x$ in a local neighborhood of each data value, $x_i$, fitting is controlled in a way that depends on the distance of $x$ from $x_i$. Close-by points are weighted more. As the distance increases, the weight decreases.

The weights are determined by a function called the kernel, which has an associated bandwidth. The kernel, $K(u)$, determines how weights are calculated, and the bandwidth, $\Delta$ or $h$, determines scaling, or how near/far points are considered "close" enough to matter.

Let $\hat{f}(x)$ denote the kernel density estimator of $f(x)$, the PDF of the underlying distribution. $\hat{f}(x)$ is defined as

\[\hat{f}(x) = \frac{1}{n\Delta} \sum\limits_{i=1}^n {K\left(\frac{x-x_i}{\Delta}\right)}\]

where $n$ is the sample size, $K$ is the kernel, and $x_i$ is the observed data.

To be called a kernel, $K(\cdot)$ has to satisfy certain properties: $K(\cdot)$ is a smooth function, such that

\[\begin{aligned} K(x) &\geq 0 && \int{xK(x)dx = 0} \\ \int{K(x)dx} &= 1 && \int{x^2K(x)dx > 0} \end{aligned}\]

The first two constraints make it a density, and the second set of constraints ensures it has mean $0$ and has a variance.

Commonly Used Kernels

Here we give four commonly used kernals[1]. The boxcar kernel

\[K(x) = \frac{1}{2}\mathbf{I}(x)\]

The Gaussian kernal

\[K(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\]

The Epanechnikov kernel

\[K(x) = \frac{3}{4}(1-x^2)\mathbf{I}(x)\]

The tricube kernel is narrower than the previous

\[K(x) = \frac{70}{81}(1 - {x}^3)^3\mathbf{I}(x)\]

\[\mathbf{I}(x) = \begin{cases} 1 & |x| \leq 1 \\ 0 & |x| > 1 \end{cases}\]

Choosing Delta

In practice, the choice of the kernel isn't that important, but the choice of $\Delta$ is crucial: it controls the smoothness of the kernel density estimator. When $\Delta$ is small, $\hat{f}(x)$ is more "wiggly" and shows local features. When $\Delta$ is large, $\hat{f}(x)$ is "smoother", same as the binwidth in histograms.

Many criteria and methods have been proposed to choose the value for $\Delta$:

  1. Define a criterion for a "successful" smooth. Try a range of $\Delta$ values and see which is the best according to that criterion.
  2. Use cross-validation to pick $\Delta$. Split the data set into $k$ pieces, and try to fit on each piece. Use this to get predictions, and pick the $\Delta$ that does well over the cross-validation.
  3. A general rule of thumb takes

\[\Delta = \frac{1.06\hat{\sigma}}{n^{\frac{1}{5}}} \text{ where } \hat{\sigma} = \min\{S, \frac{IQR}{1.34}\}\]

where $n$ is the sample size, $S$ is the sample standard deviation, and $IQR$ is $Q_3 - Q_1$.

According to Wassesman in his book "All of Nonparametric Statistics", the main challenge in smoothing is picking the right amount! When we over-smooth ($\Delta$ is too large), the result is biased, but the variance is small. When we under-smooth ($\Delta$ is too small), the bias is small but the variance is large. This is called the bias-variance tradeoff.

We want to actually minimize risk, which is $\text{bias}^2 + \text{variance}$, so that we get a balance between the two aspects.

Modern Nonparametric Regression

over smoothing and under smoothing
The red line is severe oversmoothing, and the blue line is severe undersmoothing.

Here[2], under-smooth is interpolation, or connecting the dots. It gives a perfect fit to the data but is useless for anything else. Over-smoothing is ignoring $x$ altogether, and fitting $\bar{y}$ as the predicted value everywhere. It's useless for inference. The question is can we use the data to learn about the relationship between $x$ and $y$?

We'll focus on a single explanatory variable. Data: $(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)$. A simple linear model

\[\begin{aligned}y_i = \beta_0 + \beta_1x_i + \epsilon_i && i=1\cdots n\end{aligned}\]

relates $x$ and $y$ in a linear fashion. We make (in standard parametric approaches) certain assumptions about $\epsilon_i$: independence, $\epsilon_i \sim N(0, \sigma^2)$.

We don't want to make the normality (or any other distributional) assumption on $\epsilon_i$. We also don't want to restrict ourselves to a linear fit, which won't always be appropriate. The more general approach

\[\begin{aligned} y_i = g(x_i) + \epsilon_i && i = 1 \cdots n \end{aligned}\]


assuming $\text{median}(\epsilon_i)=0$. $g(x_i)$ is unspecified, and we want to learn this from the data. We're being model-free and distribution-free.

A popular method, due to Cleveland (1979)[3], is called "locally weighted smoothing scatter plots" - lowess or loess. The basic idea if to fit a linear (or quadratic) polynomial at each data point using "weighted least squares" (does parameter estimation with different data points weighted differently), with weights that decrease as we move away from the $x_i$ at which we are fitting. Points more than a specified distance away are assigned a zero weight. The weight function is maximum at the $x_i$ value at which the fit is being made, and is symmetric about that point.

Things to consider:

  • $d$: degree of polynomial to be fitted at each point (usually 1 or 2)
  • how many neighboring points to include (size of neighborhood)
  • $w(u)$: weight function

With these in hand, a localized (in the neighborhood), weighted least squares procedure is used to get an estimate of $g(\cdot)$ at $x_i$, call it $\hat{y}_i$. The window slides across each $x_i$, and fits a (weighted) linear ($d=1$) or quadratic ($d=2$) in each window[4].

lowess example
Blue points are within the neighborhood of $x_i=-1$. A OLS regression is fit on these points.

We're looking for the general pattern, not (necessarily) local structure. This guides us in the choice of bandwidth and $d$ (less critical). Often, within a small neighborhood, functions will be roughly linear (or quadratic), so usually $d=1$ is good enough. Note that if the function is very unsmooth (lots of quick up and downs), this approach won't work well.

Choice of neighborhood size (bandwidth) - a bit of trail and error as it depends on $n$. R default is $\frac{2}{3}$ of the data included in each sliding window, which will often oversmooth. For the weighting kernel (not as crucial), a popular choice is

\[w(u) = \begin{cases} (1 - |u|^3)^3 && |u| \leq 1 \\ 0 && |u| > 1 \end{cases}\]

This gives weights for weighted least squares. For points within the neighborhood, if the farthest points to be included in the "local" neighborhood are at a distance $\Delta$ from $x_i$, then $x_j$ (in the neighborhood) gets weight

\[w_j = \left(1 - \left|\frac{x_j - x_i}{\Delta} \right|^3\right)^3\]

This is done at each data point, so we get $n$ total fits. Another possibility is to robustify for the procedure by down-weighting possible outliers. In R, the lowess() function takes the following default values:

  • Window of $\frac{2}{3}$ of data controlled by argument f
  • 3 iterations of robustification controlled by argument iter

There's also a formula-based loess() function which is pretty similar, but has some different defaults.

Penalty Function

Recall our original problem: we had a generic model of

\[y_i = g(x_i) + \epsilon_i\]

Suppose, in analogy to simple linear regression, we estimate $g(\cdot)$ by choosing $\hat{g}(x)$ to minimize the sum of squares

\[\sum_{i=1}^n {(y_i - \hat{g}(x_i))^2}\]

In the linear case, $\hat{g}(x_i)$ would be $\hat{\beta}_0 - \hat{\beta}_1 x_i$. Minimizing the sum of squares over all linear functions yields the usual least squares estimator, which is possibly an oversmooth. Minimizing over all functions yields the "connect the dots" estimator of exact interpolation, which is definitely an undersmooth.

Lowess uses the local weighting to balance between these two extremes. A different approach to getting this balance is to minimize a penalized sum of squares:

\[\ast U(\lambda) = \sum_{i=1}^n {(y_i - \hat{g}(x_i))^2 + \lambda J(g)}\]

where $\lambda$ is the tuning parameter, and $J(g)$ is the penalty function for roughness. A common choice for $J(g)$ is

\[J(g) = \int [g''(x)]^2dx\]

which is the integral of the squared second derivative of $g$. $\lambda$ controls the amount of smoothing - the larger it is, the more smoothing is enforced.

Cubic Spline

We want "optimal" solution to this problem $\ast U(\lambda)$. Let $\eta_1 < \eta_2 < \cdots < \eta_k$ be a set of ordered points, or knots, contained in some interval $(a, b)$. A cubic spline is defined as a continuous function $r$ such that

  1. $r$ is a cubic polynomial over each $(\eta_1, \eta_2), (\eta_2, \eta_3), \cdots$
  2. $r$ has continuous first and second derivatives at the knots, which ensures smoothness of the curves at the knots.

Turns out the cubic splines are a solution to this problem. In general, an $M^{th}$-order spline is a piecewise $M-1$ degree polynomial with $M-2$ continuous derivatives at the knots (ensures function looks smooth at the knots).

A spline that is linear beyond the boundary knots ($\eta_1$ and $\eta_k$) is called a natural spline.

No discontinuities (or jumps) at the knots - functions connect smoothly. And edge effects beyond $\eta_1$ and $\eta_k$ are handled through the linear fit.

The cubic spline ($M=4$) are the most commonly used and solve our penalized problem. The function $\hat{g}(x)$ that minimizes $U(\lambda)$ with penalty $\int[g''(x)]^2$ is a natural cubic spline with knots at the data points. Our estimator $\hat{g}(x)$ is called a smoothing spline.

What does $\hat{g}(x)$ look like? We can construct a basis for a set of splines:

\[\begin{aligned} &h_1(x) = 1 && h_2(x) = x \\ &h_3(x) = x^2 && h_4(x) = x^3 \\ &h_j(x) = (x - \eta_{j-4})^3_+ && \text{ for } j = 5, \cdots, k+4 \end{aligned}\]

where $(x - \eta_{j-4})^3_+ = 0$ if $(x - \eta_{j-4})^3 < 0$. The set ${h_1, h_2, \cdots, h_{k+4}}$ is a basis for the set of cubic splines at these knots, called the truncated power basis. This means any cubic spline $g(x)$ with these same knots can be written as

\[g(x) = \sum_{j=1}^{k+4} {\beta_j h_j(x)}\]

where $h_j(x)$ is known, and we just need to estimate $\beta_j$ e.g. by least squares. There are also other sets of basis functions.


  1. $\lambda$, the tuning parameter, controls the roughness of the fit, so it also somewhat shrinks the basis functions.
  2. We need to be aware of and account for edge effects. The fit isn't going to look as good at the ends.
  3. $\lambda$ controls the bias-variance tradeoff in the spline setting.
    • $\lambda = 0 \Rightarrow$ no effect of the penalty term $\Rightarrow$ exact interpolation $\Rightarrow$ low bias, high variance
    • $\lambda \rightarrow \infty \Rightarrow$ perfectly smooth (linear) fit $\Rightarrow$ a line that passes as close to the data as possible $\Rightarrow$ high bias, low variance
    • Intermediate values of $\lambda$ balance the two, and we get a reasonably smooth function that passes near the data.
  4. In R, there's a library splines.
    • smoothing.splines() places knots at each data point and performs smoothing splines.
    • ns() is for general natural splines, and we need to specify the number and locations of knots (not an easy question to answer!).
    • bs() generates a basis function set for splines which we can then use for other thing. This also requires specifying the number and locations of knots.

  1. For graphical representations, see this Wikipedia page. ↩︎

  2. R code for generating the figure:

    df <- tibble(
    x = seq(-3, 3, by = 0.1),
    y = dnorm(x, 0, 1) + rnorm(length(x), 0, 0.1)
    over.smoothing <- mean(df$y)
    ggplot(df, aes(x, y)) +
    geom_point() +
    geom_line(color = "blue") +
    geom_abline(slope = 0, intercept = over.smoothing, color = "red") +
  3. Cleveland, William S. "Robust locally weighted regression and smoothing scatterplots." Journal of the American statistical association 74.368 (1979): 829-836. ↩︎

  4. R code for plotting:

    df <- tibble(
      x = seq(-3, 3, by = 0.1),
      y = dnorm(x, 0, 1) + rnorm(length(x), 0, 0.1),
      point.type = ifelse(x > -1.5 & x < -0.5, "Neighborhood", "Distant")
    ggplot(df, aes(x, y)) +
      geom_point(aes(color = point.type)) +
      geom_smooth(data = subset(df, x > -1.5 & x < -0.5), method = "glm") +