Contingency tables are used to evaluate the interaction between two categorical variables (qualitative). They are also called two-way tables or cross-tabulations.
Searching for relationships between two categorical variables is a very common goal for researchers. Think, for example, of the classic question that marketers ask: who is more likely to buy certain product categories, young or old people, men or women…
A two-way table is a table with rows and columns that helps organize data from categorical variables:
A marginal distribution shows how many total responses there are for each category of the variable. The marginal distribution of a variable can be determined by looking at the “Total” column (or row).
Let’s look at an example.
Note: I couldn’t think of anything particularly clever, so I created a table (with fictitious data, of course) of rare silliness, imagining that the two categorical variables concern education level and favorite sci-fi series…
We build the table in R:
scifi_fans and we get something like this:
star trek star wars doctor who degree 44 38 26 diploma 53 35 30 lower education 58 22 29
Remember? A marginal distribution shows how many total responses there are for each category of the variable (at the margins, precisely, where the Total column or row is...).
We can compute row totals in R with:
margin.table(scifi_fans, 1) and column totals with:
margin.table(scifi_fans, 2) We can also find the "grand total" with:
margin.table(scifi_fans) Here is the table with totals:
star trek star wars doctor who TOTAL degree 44 38 26 108 diploma 53 35 30 118 lower ed. 58 22 29 109 TOTAL 155 95 85 335
So the marginal totals by education level are 108 for degree holders, 118 for diploma holders, 109 for lower education.
Likewise, the marginal totals by sci-fi series type are 155 for Star Trek, 95 for Star Wars, 85 for Doctor Who.
The grand total must be the same in both directions, in this case 335.
We could also have displayed a complete table with totals using just a few lines of R code:
scifi_fans We can then ask ourselves (and answer): what percentage of degree holders has a soft spot for Doctor Who?
Elementary, Watson (oh wait, that was a different series...):
26/108 = 0.24 = 24% of degree holders prefer Doctor Who
And how many Star Wars fans hold a diploma?
35/95 = 0.37 = 37% of Star Wars fans are diploma holders
In R, we can directly obtain row proportions with the function:
prop.table(scifi_fans, 1) and the result will be:
star trek star wars doctor who degree 0.4074074 0.3518519 0.2407407 diploma 0.4491525 0.2966102 0.2542373 lower ed. 0.5321101 0.2018349 0.2660550
(as we can see, the row totals add up to 1, or 100%)
or column proportions with:
prop.table(scifi_fans, 2) and the result will be:
star trek star wars doctor who degree 0.2838710 0.4000000 0.3058824 diploma 0.3419355 0.3684211 0.3529412 lower ed. 0.3741935 0.2315789 0.3411765
(as we can see, the column totals add up to 1, or 100%)
As always, there is more than one way to get the result. We can also install the "gmodels" package and use the CrossTable function (we'll leave it to R's built-in help to show all the command options...):
install.packages("gmodels")
library(gmodels)
scifi_fans So what is all this good for? The answer is: for example, to compute conditional probability.
Before we see what it is and why it is an extremely useful concept in everyday life, we need a few preliminary definitions about probability.
An event is something that occurs with one or more possible outcomes.
An experiment is the process of measuring or making an observation.
Key definition: the probability of an event is the ratio of the number of favorable cases to the number of possible cases
\( P(A) = \frac {\text{number of favorable cases}}{\text{number of possible cases}}\\ \)Let us also recall that:
The conditional probability of an event A with respect to an event B is the probability that A occurs, given that B has occurred.
The formula is:
\( P(A|B) = \frac {P(A \text{ and } B)}{P(B)}\\ \)If a probability is based on one variable it is a marginal probability; if on two or more variables it is called a joint probability.
In other words:
A joint probability is the probability that someone selected from the entire group has two particular characteristics at the same time. That is, both characteristics occur jointly. We find a joint probability by taking the value of the cell at the intersection of A and B and dividing by the grand total.
To find a conditional probability, we take the value of the cell at the intersection of A and B and divide it by the marginal total of B, i.e. the variable expressing the event that has occurred.
It's time for a second example. We take the data from:
Ellis GJ and Stone LH. 1979. Marijuana Use in College: An Evaluation of a Modeling Explanation. Youth and Society 10:323-334.
The study asks whether a college student is more likely to smoke marijuana if their parents had used drugs in the past. Here is the table:
parents parents Total
use no use
student uses 125 94 219
student does not use 85 141 226
Total 210 235 445 Let's apply our knowledge to answer these questions:
This is a case of conditional probability.
We recall \( P(A|B) = \frac {P(A \text{ and } B)}{P(B)}\\ \), therefore
P(student uses given that parents used) = 125 / 210 = 0.59 = 59%
2. A student is selected at random and does not use marijuana. What is the probability that their parents used it?
Here again we face a question that asks for a conditional probability. Therefore:
P(parents used given that student does not use) = 85 / 226 = 0.376 = 37.6%
3. What is the probability of selecting a student who does not use marijuana and whose parents used it in the past?
In this case we need to find a joint probability, so:
\( \frac {P(A \text{ and } B)}{\text{Total}}\\ \), therefore \( \frac {85}{445} = 0.19\\ \).
The probability is approximately 19%.
If the outcomes of A and B influence each other, we say that the two variables are in a relationship of dependence.
Conversely, we say the two variables are independent.
More rigorously: we can state that event B is independent of event A if:
P(B|A) = P(B)
or
P(A|B) = P(A)
If this is not the case, the events are dependent on each other.
Therefore:
Let's explain this better with an example.
Let A be the event that people enjoy cycling.
B expresses whether they enjoy roast lamb. (Makes perfect sense, right?)
We build our contingency table:
Likes cycling Doesn't like cycling Total Likes roast lamb 95 36 131 No roast lamb 15 19 34 --------------------------------------------------------------- Total 110 55 165
Let's remember what it means for two events to be independent. It means this:
P(A | B) = P(A)
But in our case we see that
P(A) = 66.7%
because 110/165 = 0.67
P(A | B) = 72.5%
because 95/131 = 0.725
We recall that \( P(A|B) = \frac {P(A \text{ and } B)}{P(B)}\\ \), therefore \( \frac {95}{131} = 0.725\\ \).
From the result it is clear that \( P(A) \neq P(A|B) \) -- the two events are NOT independent (therefore they are dependent).
After all, everyone knows that there is a clear dependence between loving cycling and loving roast lamb!
For a comprehensive treatment of contingency tables, conditional probability, and the full machinery of categorical data analysis, Statistica by Newbold, Carlson and Thorne provides a rigorous yet accessible framework for applying these concepts in real-world settings.
Introduction Machine Learning is changing the way we see the world around us. From weather…
The Gini coefficient is a measure of the degree of inequality in a distribution, and…
The Poisson distribution is a discrete probability distribution that describes the number of events occurring…
After looking at the most famous discrete distribution, the Binomial, as well as the Poisson…
The need I feel—the fruit of many years working in this field—is to affirm the…
Measures of variability are used to describe the degree of dispersion of observations around a…