statistics

Contingency Tables and Conditional Probability

Contingency tables are used to evaluate the interaction between two categorical variables (qualitative). They are also called two-way tables or cross-tabulations.

Searching for relationships between two categorical variables is a very common goal for researchers. Think, for example, of the classic question that marketers ask: who is more likely to buy certain product categories, young or old people, men or women…


Two-Way Tables and Marginal Distributions

A two-way table is a table with rows and columns that helps organize data from categorical variables:

  • Rows represent the possible categories for one qualitative variable, for example males and females.
  • Columns represent the possible categories for a second qualitative variable, for example whether someone likes pizza or not…

A marginal distribution shows how many total responses there are for each category of the variable. The marginal distribution of a variable can be determined by looking at the “Total” column (or row).

Let’s look at an example.

Note: I couldn’t think of anything particularly clever, so I created a table (with fictitious data, of course) of rare silliness, imagining that the two categorical variables concern education level and favorite sci-fi series…

We build the table in R:

scifi_fans 

and we get something like this:

                 star trek   star wars   doctor who
degree               44          38          26
diploma              53          35          30
lower education      58          22          29

Remember? A marginal distribution shows how many total responses there are for each category of the variable (at the margins, precisely, where the Total column or row is...).

We can compute row totals in R with:

margin.table(scifi_fans, 1)

and column totals with:

margin.table(scifi_fans, 2)

We can also find the "grand total" with:

margin.table(scifi_fans)

Here is the table with totals:

              star trek   star wars   doctor who   TOTAL
degree            44          38          26        108
diploma           53          35          30        118
lower ed.         58          22          29        109
TOTAL            155          95          85        335

So the marginal totals by education level are 108 for degree holders, 118 for diploma holders, 109 for lower education.

Likewise, the marginal totals by sci-fi series type are 155 for Star Trek, 95 for Star Wars, 85 for Doctor Who.

The grand total must be the same in both directions, in this case 335.

We could also have displayed a complete table with totals using just a few lines of R code:

scifi_fans 

We can then ask ourselves (and answer): what percentage of degree holders has a soft spot for Doctor Who?
Elementary, Watson (oh wait, that was a different series...):

26/108 = 0.24 = 24% of degree holders prefer Doctor Who

And how many Star Wars fans hold a diploma?

35/95 = 0.37 = 37% of Star Wars fans are diploma holders

In R, we can directly obtain row proportions with the function:

prop.table(scifi_fans, 1)

and the result will be:

                 star trek    star wars    doctor who
degree           0.4074074    0.3518519    0.2407407
diploma          0.4491525    0.2966102    0.2542373
lower ed.        0.5321101    0.2018349    0.2660550

(as we can see, the row totals add up to 1, or 100%)

or column proportions with:

prop.table(scifi_fans, 2)

and the result will be:

                 star trek    star wars    doctor who
degree           0.2838710    0.4000000    0.3058824
diploma          0.3419355    0.3684211    0.3529412
lower ed.        0.3741935    0.2315789    0.3411765

(as we can see, the column totals add up to 1, or 100%)

As always, there is more than one way to get the result. We can also install the "gmodels" package and use the CrossTable function (we'll leave it to R's built-in help to show all the command options...):

install.packages("gmodels")
library(gmodels)
scifi_fans 

So what is all this good for? The answer is: for example, to compute conditional probability.


Conditional Probability

Before we see what it is and why it is an extremely useful concept in everyday life, we need a few preliminary definitions about probability.

An event is something that occurs with one or more possible outcomes.
An experiment is the process of measuring or making an observation.

Key definition: the probability of an event is the ratio of the number of favorable cases to the number of possible cases

\( P(A) = \frac {\text{number of favorable cases}}{\text{number of possible cases}}\\ \)

Let us also recall that:

  • The probability that two events both occur can never be greater than the probability that each event occurs separately.
  • If two possible events, A and B, are independent, then the probability that both occur is the product of their individual probabilities.
  • If an event can have a certain number of different and distinct possible outcomes (A, B, C, etc.), then the probability that A or B occurs equals the sum of the individual probabilities of A and B, and the sum of the probabilities of all possible outcomes (A, B, C, etc.) equals 1, i.e. 100%.

The conditional probability of an event A with respect to an event B is the probability that A occurs, given that B has occurred.

The formula is:

\( P(A|B) = \frac {P(A \text{ and } B)}{P(B)}\\ \)

If a probability is based on one variable it is a marginal probability; if on two or more variables it is called a joint probability.

  • The probability of an event P(A) is: \( \frac {\text{marginal probability of A}}{\text{Total}}\\ \)
  • The joint probability of two events is: \( \frac {P(A \text{ and } B)}{\text{Total}}\\ \)
  • The conditional probability of outcome A given the occurrence of condition B is: \( \frac {P(A \text{ and } B)}{P(B)}\\ \)

In other words:

A joint probability is the probability that someone selected from the entire group has two particular characteristics at the same time. That is, both characteristics occur jointly. We find a joint probability by taking the value of the cell at the intersection of A and B and dividing by the grand total.

To find a conditional probability, we take the value of the cell at the intersection of A and B and divide it by the marginal total of B, i.e. the variable expressing the event that has occurred.


It's time for a second example. We take the data from:
Ellis GJ and Stone LH. 1979. Marijuana Use in College: An Evaluation of a Modeling Explanation. Youth and Society 10:323-334.

The study asks whether a college student is more likely to smoke marijuana if their parents had used drugs in the past. Here is the table:

                    parents    parents     Total
                      use      no use
student uses          125        94         219
student does not use   85       141         226
Total                 210       235         445

Let's apply our knowledge to answer these questions:

  1. If the parents used soft drugs in the past, what is the probability that their child does the same in college?

This is a case of conditional probability.
We recall \( P(A|B) = \frac {P(A \text{ and } B)}{P(B)}\\ \), therefore

P(student uses given that parents used) = 125 / 210 = 0.59 = 59%

2. A student is selected at random and does not use marijuana. What is the probability that their parents used it?

Here again we face a question that asks for a conditional probability. Therefore:

P(parents used given that student does not use) = 85 / 226 = 0.376 = 37.6%

3. What is the probability of selecting a student who does not use marijuana and whose parents used it in the past?

In this case we need to find a joint probability, so:

\( \frac {P(A \text{ and } B)}{\text{Total}}\\ \), therefore \( \frac {85}{445} = 0.19\\ \).

The probability is approximately 19%.

Dependence and Independence

If the outcomes of A and B influence each other, we say that the two variables are in a relationship of dependence.
Conversely, we say the two variables are independent.

More rigorously: we can state that event B is independent of event A if:

P(B|A) = P(B)

or

P(A|B) = P(A)

If this is not the case, the events are dependent on each other.

Therefore:

  • P(A and B) = P(A) P(B) if and only if A and B are independent events.
  • P(A | B) = P(A) and P(B | A) = P(B) if and only if A and B are independent events.

Let's examine the independence of categorical variables...

Let's explain this better with an example.

Let A be the event that people enjoy cycling.
B expresses whether they enjoy roast lamb. (Makes perfect sense, right?)

We build our contingency table:

                  Likes cycling   Doesn't like cycling   Total
Likes roast lamb       95                36               131
No roast lamb          15                19                34
---------------------------------------------------------------
Total                 110                55               165

Let's remember what it means for two events to be independent. It means this:
P(A | B) = P(A)

But in our case we see that
P(A) = 66.7%
because 110/165 = 0.67

P(A | B) = 72.5%
because 95/131 = 0.725

We recall that \( P(A|B) = \frac {P(A \text{ and } B)}{P(B)}\\ \), therefore \( \frac {95}{131} = 0.725\\ \).

From the result it is clear that \( P(A) \neq P(A|B) \) -- the two events are NOT independent (therefore they are dependent).

After all, everyone knows that there is a clear dependence between loving cycling and loving roast lamb!


You might also like


Further Reading

For a comprehensive treatment of contingency tables, conditional probability, and the full machinery of categorical data analysis, Statistica by Newbold, Carlson and Thorne provides a rigorous yet accessible framework for applying these concepts in real-world settings.

autore-articoli

Recent Posts

Understanding the Basics of Machine Learning: A Beginner’s Guide

Introduction Machine Learning is changing the way we see the world around us. From weather…

16 hours ago

The Gini Index: What It Is, Why It Matters, and How to Compute It in R

The Gini coefficient is a measure of the degree of inequality in a distribution, and…

16 hours ago

The Poisson Distribution

The Poisson distribution is a discrete probability distribution that describes the number of events occurring…

16 hours ago

The Geometric Distribution

After looking at the most famous discrete distribution, the Binomial, as well as the Poisson…

16 hours ago

A Brief (Personal) Manifesto for SEO

The need I feel—the fruit of many years working in this field—is to affirm the…

16 hours ago

Descriptive Statistics: Measures of Variability (or Dispersion)

Measures of variability are used to describe the degree of dispersion of observations around a…

16 hours ago