Categorical Data
In this chapter our focus is mostly on “count data” - the data are number of units with particular attributes. The problems appear different at first sight, yet many are solved using procedures developed previously.
Attributes
are characteristics that the units can have, and are typically qualitative or categorical. Data are arranged in a contingency table. The table has a dimension (row, column, layer, …) for each attribute. We consider multiple attributes simultaneously. Attributes can be
- Nominal or ordinal
Nominal
- categories are just names, not orderingOrdinal
- categories can be ordered
- Explanatory or response variables
The general setup is an
is the count in cell is also denoted , and is also denoted is also denoted or
Total | |||||
---|---|---|---|---|---|
Total |
Inference and procedures are based on conditional inference: we treat row and column marginals as if they were fixed. Here “conditional” means we condition on the row/column totals. Note that sometimes one (or both) of the marginals will be fixed by the study design, but not always.
Two by two table
We start with explanatory variable
, and the column variable is the response variable
. The row variable often will have marginal totals that are fixed by design. Suppose the response variable has two levels that are “Success” and “Failure”. We have
Response 1 (S) | Response 2 (F) | Total | |
---|---|---|---|
Treated | |||
Non-Treated | |||
Total |
where
We saw before that this design implies independent Binomials
-
We haven’t said anything (yet) about the common probability of success under
In our example, we observe
Also, in total we have
Using properties of conditional, independent, etc. probabilities, we can show the probability of observing
Under hypergeometric distribution
:
What might the cell counts be “expected” to look like under
Odds ratio
Another useful construct is called the odds ratio
, which is used a lot in medical settings. It can also be used as a measure of association between the treatment and response in a
Under
Two response variables
A second possibility for how the table could have arise: both attributes are some sort of response, e.g. both attributes could be side effects experienced when patients are given a treatment.
Level 1 | Level 2 | Total | |
---|---|---|---|
Level 1 | |||
Level 2 | |||
Total |
Only the overall sample size
Now each individual has four possible outcomes, corresponding to the different patterns of response multinomial model
(multiple outcomes) with
Under independence,
By properties of the multinomial, the expected count in cell
This is the same as in the first model, even though our assumptions about how the table came about were different. Also the odds ratio estimation will be the same.
Fixing nothing
A third possibility is we collect data for
The model here is independent Poissons
for each of the four cells, with mean
Note that this all extends pretty easily to a general
Unified framework for general tables
The unified framework for
Nominal attributes
Three approaches are commonly used when row and column categories are both nominal: the Fisher's exact test
, the Pearson chi-squared test
, and the likelihood-ratio test
. All are tests for a null of no association (or independence) between the two attributes. Under that null, all three tests have the same large-sample (asymptotic) distribution :
Importantly, the values of the three test statistics will differ on the same sample. So will their exact distributions. However, unless the choice of the significance level cutoff is critical, the three seldom lead to different conclusions.
Fisher’s exact test
We saw before for the fisher.test()
does the job. Input either
x
- a two-dimensional contingency table, orx
andy
- two vectors (observations for each attribute), from which the table can be built.
The function computes exact p-values for the simulate.p.value = T
. For
In practice, the Fisher’s exact test is often used when asymptotic theory is inappropriate (usually when sample sizes are small), or in the case of
Pearson’s chi-squared test
An alternative statistic for testing independence of row and column categories is the Pearson chi-squared statistic:
where
A caveat is that if there are
Likelihood-ratio test
An alternative statistic for test of association is the likelihood ratio:
where once again
On the same data, these three will obviously give different numerical results. Note that since the three tests apply to nominal categories, reordering the rows or columns doesn’t affect the values of the test statistics. For ordered categories, there are more appropriate tests.
Ordinal attributes
The following cases are considered here:
- Nominal explanatory and ordered response.
- Ordered explanatory and ordered response, e.g. increasing dose of a drug and varied side-effects.
- Row and column attributes are both ordered responses, e.g. two different side-effects.
Nominal row and ordered column
We’re going to look at the number of patients receiving each drug who experience different levels of side effects:
None | Slight | Moderate | Severe | Fatal | Total | |
---|---|---|---|---|---|---|
Drug A | 23 | 8 | 9 | 3 | 2 | 45 |
Drug B | 42 | 8 | 4 | 0 | 0 | 54 |
Total | 65 | 16 | 13 | 3 | 2 | 99 |
Here we’re not saying anything about how this study was designed. We don’t have to think about if any margins are fixed.
Question: is there an association between drug type (A vs. B) and level of side effect, or are these two attributes independent?
Procedure: This is like a massive WMW situation with lots of ties! Both test statistic formulations can be applied.
In the Wilcoxon formulation we may use mid-ranks again, e.g. 65 subjects had no side effects, so all get the same mid-rank of 33. 16 are tied at “slight”, so we set
In the Mann-Whitney formulation, there is no need to specify the mid-ranks. We just count the number of drug B patients showing the same or more severe side effects than those for each recipient of drug A, counting ties as 0.5. For instance, the 42 drug B patients with no side effects are tied with the 23 drug A patients with no side effect,
If the nominal explanatory variable has more than two values (e.g. 3 drugs in the previous example), the obvious extension to the Kruskal-Wallis test applies.
Ordered row and column
The row categories are ordinal explanatory, and the columns are ordered responses. We want to ask if there’s an association between the two attributes. This can again be applied with the J-T test
, with each row as an ordered sample.
Question: We have data on side effects experienced at increasing dose levels of a drug. Does side effects increase with dose level?
None | Slight | Moderate | Severe | |
---|---|---|---|---|
100mg | 50 | 0 | 1 | 0 |
200mg | 60 | 1 | 0 | 0 |
300mg | 40 | 1 | 1 | 0 |
400mg | 30 | 1 | 1 | 2 |
Procedure: again, we score relevant ties in any column (as with the previous analysis) as MW statistic
and add them all up. That is,
Conclusion: this has a two-sided approximate p-value of 0.4576 - no evidence of association. Even the one-sided p-value is pretty high. At first glance their results may seem counter-intuitive, but side effects are very rare so it’s hard to discover much pattern.
The Goodman-Kruskal statistic
What if both row and column attributes are “responses”? Do high responses in one classification tend to be associated with high responses in the other (positive association) or low responses (negative association)? Or maybe there is no association between the two responses?
We can use the J-T test
here as well, of course. The problem with the J-T test
is that it isn’t calibrated as we’d like for a measure of association. There has been a lot of work in this general class of problems - calibration for measures of association in general
One measure is the Goodman-Kruskal gamma statistic
ordering disagrees | ordering agrees |
Cell
Similarly, let
We define our test statistic
which is calibrated to be between
Testing goodness of fit
Here we’ll only talk about the
test, where we have an asymptotic distribution as our reference. It’s widely used as a goodness-of-fit test of data to any discrete distribution.
Instead of a model of independence or lack of association as before, we can consider goodness of fit to a hypothesized discrete distribution, which may be binomial, Poisson, uniform or some other discrete distribution. We compute expected cell counts under the hypothesized model, and compare to the observed counts in the data.
Sometimes, the parameter(s) of those distributions will not be known. In this case, we need to estimate the unknown parameters from the data.
All parameters are known
A computer program is supposed to generate random digits from 0 to 9. If it is doing so, we’ll get digits that look like i.i.d. observations on the values 0 to 9, each with probability 0.1.
We want to test
Generate some number, suppose 300, of digits from the program, and compare to the discrete uniform on 0 to 9 under
Number | Observation |
---|---|
0 | 22 |
1 | 28 |
2 | 41 |
3 | 35 |
4 | 19 |
5 | 25 |
6 | 25 |
7 | 40 |
8 | 30 |
9 | 35 |
Compare this to
In R, use the chisq.test()
function. In this case:
|
|
The default is to test for uniform if no other parameters are specified. This gives a p-value of approximately 0.049. For a different set of specified probabilities, e.g. as given by binomial or Poisson, we need to pass another vector of the same length as the data vector specifying these probabilities. Alternatively, we can supply the vector of expected counts under rescale.p = T
.
Test with estimated parameters
If we have to estimate parameters, we lose a degree of freedom for each one. In this situation, R will compute the test statistic but won’t use the correct degrees of freedom. Use pchisq()
with the correct degrees of freedom in this case.
Suppose we want test the goodness of fit to a binomial, but the probability of success
We have data on the first 18 major league baseball players to have 45 times at bat in 1970. The number of hits they got in their 45 times at bat are given as follows:
We will test the null hypothesis that these data follow a
With
|
|
Here we multiply by 18 because there are 18 players. We get many small probabilities, which means we’d get many expected counts of
|
|
Now we can build the chi-square goodness-of-fit test:
|
|
The value of the test statistic is correct, but we should have 10 degrees of freedom instead of 11, because estimating
|
|
Here we specify lower.tail = F
because we want the probability to the right of our observed test statistic value, and by default the probability to the left is calculated.
This is the last part of the “classical” nonparametric statistics. Next, we’ll be focusing on topics in modern nonparametric statistics, which is also the finale of our nonparametric methods discussions.
May 08 | Modern Nonparametric Regression | 8 min read |
May 06 | Density Estimation | 6 min read |
May 06 | Bootstrap | 11 min read |
May 05 | Correlation and Concordance | 9 min read |
May 04 | Basic Tests for Three or More Samples | 10 min read |