These are the notes I took for a Master's course Nonparametric Statistics. The recommended textbook is Sprent, P. and Smeeton, N.C. (2007) Applied Nonparametric Statistical Methods, Fourth Edition.
Outline of the Course
We will start with an overview of some fundamentals of nonparametric statistics.
Then we will consider in turn methods for a single sample (location inference and others), for two samples (paired and independent), and for multiple samples. This will be followed by discussion of correlation, concordance, as well as association and other related methods for categorical data. Finally, we will look at a variety of more "modern" nonparametric methods, such as the bootstrap, kernel density estimation and regression.
Some Basic Concepts
We want to move away from "standard" or "typical" approaches to statistical inference, where we assume that our data are drawn from some distributional family, e.g. the standard setup in which
\[X_1, X_2, ..., X_n \sim N(\mu, \sigma^2)\]
here $N(\mu, \sigma^2)$ is a Normal distributional family. Similarly we could have $Pois(\lambda)$ for a Poisson distribution. In these cases, we're making assumptions about the underlying distribution. These assumptions may (or may not) be realistic or valid. In any case, they are restrictive.
Nonparametric (sometimes called "distribution-free") statistical methods aim to relax these assumptions about distributional forms. They will be more general and more robust (methods will be good in a wider range of applications), but we sacrifice power (not always) if the data truly come from a particular family, such as Normal, for which optimal tests (such as
nonparametric method is also used in a variety of ways, which we want to examine:
- Classical approaches, e.g. based on
- Computational approaches, e.g.
- Modern regression (and other) approaches, e.g.
If we don't assume a distributional family, how can we proceed to do inference? What sorts of inferential questions can we ask and answer?
We do still need to make some assumptions (of course), but they can be weaker than what we're used to.
- Instead of normality, which is a strong assumption, we might assume that the true data distribution is merely symmetric.
- For comparing two samples, rather than assuming that both come from normally-distributed populations with possibly different means, we might assume that their distributions are the same (without specifying what it is) but with a shift in location:
library(tidyverse) N <- 1e+6 components <- sample(1:3,size = N,replace = TRUE, prob = c(0.3,0.5,0.2)) mus <- c(0,10,3) sds <- sqrt(c(0.2,1,3)) samples <- rnorm( n = N, mean = mus[components], sd = sds[components] ) tibble( weird_1 = samples, weird_2 = samples + 2 ) %>% gather(key = "dist", value = "value") %>% ggpubr::ggdensity(x = "value", color = "dist", fill = "dist", palette = "npg", alpha = 0.25)
First, we'll need to discuss some of the basic tools in nonparametric statistics.