# Correlation and Concordance

We’ve been focusing on location inference for quite a while. There’s of course other inferences in the field, and what we often want are measures that summarize the strength of relationships between variables, namely the strength of association or dependence.

Recall that we have the “classical” Pearson correlation coefficient between two random variables $X$ and $Y$. It’s a measure of linear association. Inference for $\rho$ (population parameter) based on $r$ (sample value) has an assumption of bivariate normality, i.e. $X$ and $Y$ are jointly normally distributed.

Can we be more general and relax away the normality assumption? What about variables/measures that are not continuous (e.g. counts) and therefore can’t be normal? Monotonicity asks do the two variables tend to increase together, or do $Y$ tend to decrease as $X$ increases.

In the parametric/bivariate normal/linearity context:

\begin{aligned} +1 &= \text{perfect positive linear} \\ -1 &= \text{perfect negative linear} \end{aligned}

In the nonparametric/monotonicity settings, we’d like:

\begin{aligned} +1 &= \text{perfect increasing monotone} \\ -1 &= \text{perfect decreasing monotone} \end{aligned}

## Correlation in bivariate data

The key idea (again) is ranks, and it requires a notion of ordering. Exact tests are based on simulation of the permutation type. A simple scheme would be two paired samples (measurements) with $n$ observations on each measurement. If we fix the order of one of the variables:

V1 (ranks)V2
1?
2?
$\vdots$?
$n$?

and look at all possible orderings of the ranks for the second variable - there are $n!$ of them. Compute measure of correlation for each of these to build the empirical distribution.

### Spearman rank correlation coefficient

A popular measure is the Spearman rank correlation coefficient. It’s essentially Pearson's correlation calculated on the ranks instead of the raw data. Some notations:

• $\rho_s$ - population value
• $r_s$ - sample value
• $(x_i, y_i)$ - paired observations, $i = 1 \cdots n$
• $r_i$ - ranks assigned to $x$ values $i = 1 \cdots n$
• $s_i$ - ranks assigned to $y$ values $i = 1 \cdots n$

$$r_s = \frac{\sum\limits_{i=1}^n{(r_i - \bar{r})(s_i - \bar{s})}}{\sqrt{\sum\limits_{i=1}^n{(r_i - \bar{r})^2} \sum\limits_{i=1}^n {(s_i - \bar{s})^2}}}$$

If there are no ties,

$$r_s = 1 - \frac{6T}{n(n^2-1)} \quad T = \sum\limits_{i=1}^n(r_i - s_i)^2$$

If $r_i = s_i$ for all $i$, i.e. ranks on $x$ are equal to the ranks on $y$, $T = 0$ and $r_s = 1$. If ranks are perfectly reversed:

$x$12$\cdots$$n y$$n$$n-1$$\cdots$1

We have a perfect monotonically decreasing trend. Here $r_i + s_i = n + 1$ for all $i$. We can show that in this case $r_s = -1$. Intermediate cases (not perfect monotone decreasing/increasing) $\Rightarrow r_s$ is somewhere between $-1$ and $+1$.

### Kendall rank correlation coefficient

The other widely used measure is Kendall's tau ($\tau$), which is built on the Mann-Whitney formulation (hence also the J-T test for ordered alternatives). This is often used as a measure of agreement between judges (how well do two judges agree on their rankings).

We first order the values of the first variable to get $r_i = i$ for all $i$. If there’s a positive rank association, ranks for the second variable, $s_i$, should also show an increasing trend; if there’s a negative rank association, $s_i$ should show a decreasing trend.

Order of the $1^{st}$ variable is fixed, $r_i = i$. For the $s_i$, count concordances $n_c$ and discordances $n_d$, which are pairs that follow the ordering and that reverse the ordering, respectively. That is, for $i = 1 \cdots n-1$ and $j> i$, count as a concordance ($+1$) if $s_j - s_i > 0$ and a discordance ($-1$) if $s_j - S-i < 0$. Our test statistic is

$$t_k = \frac{n_c - n_d}{n(n-1)/2}$$

Example: We have scores on two exam questions for 12 students:

Next