The Chi-Square Test: Goodness of Fit and Test of Independence

In previous posts, we have seen different types of tests that we can use to analyze our data and test hypotheses.

The chi-square test was proposed by Karl Pearson in 1900, and it is widely used to estimate how effectively the distribution of a categorical variable represents an expected distribution (in this case, we talk about the “Goodness of Fit Test”) or to estimate when two categorical variables are independent of each other (and then we talk about the “Test of Independence”).

Such is the importance and widespread use of this test that it was listed by the magazine Scientific American among the 20 most important scientific discoveries of the 20th century.


The Goodness of Fit Test

This is a very useful test, concerning the distribution of a categorical variable. It allows us to verify if the observed frequencies differ significantly from the expected frequencies when there are more than two possible outcomes.

The prerequisites for carrying out the test are very simple:

  1. The sample must be random;
  2. Observations must be independent for the sample (one observation per subject);
  3. No observed value in each class should be less than 5.
    This last point sounds rather cryptic and deserves a few more words. When the variable is continuous or the characters are not nominal and individual sample observations are available, an important issue concerns determining the number of classes (also called “cells”) into which the distribution is divided. In practice, it is required that the theoretical frequencies are at least equal to 5; that is, it is necessary to verify that the number of elements observed in each class is not less than a minimum threshold.

Understanding Through a Simple Example

As usual, to better understand what we are talking about, we will explain it with a super-simplified (and, I apologize, quite ridiculous…) example.

Suppose a study was conducted on electronics hobbyists who use Arduino boards. It was found that 50% own only one Arduino board, 30% have 2 to 4 boards, and 20% own 5 or more.

Let’s imagine that I conducted my own independent study and found these data: out of 150 hobbyists, I found that 90 owned only one Arduino, 30 had 2 to 4 boards, and 30 had 5 or more boards.

The null hypothesis is that the proportions I found are in line with those of the official study.
The alternative hypothesis is obviously that the collected data do not confirm the proportions of the official study.

I prepare my table by entering the data:

One Arduino2 to 4 boards5 or more boardsTotal
Observed Data90
3030150
Expected Data0.50 x 150 = 750.30 x 150 = 45 0.20 x 150 = 30150

To accept the null hypothesis, the difference between the expected and observed frequencies must be attributable to sampling variability at the designated level of significance.

The χ2 statistic calculated from the sample data is given by:

\( \chi^2=\Sigma\frac{(f_0-f_e)^2}{f_e}\ \ \)

f0=observed frequencies
fe=expected frequencies

The degrees of freedom for the goodness of fit tests are:

\( df=(r-1)(c-1)\ \ \ \)

r = number of rows in the contingency table
c = number of columns in the contingency table

Let’s use our example as guidelines. We start from the hypotheses:

\( H_0=the\ frequencies\ are\ 0.5\ 0.3\ 0.2\ H_a=the\ frequencies\ are\ not\ 0.5\ 0.3\ 0.2\ \)

We have:

\( n=150\\ df=(2-1)(3-1)=2\\ \\ \)

We find the critical χ2 value in the tables (df=2, α=0.05)
The value is: 5.99

Now I calculate the χ2 value for my data:

\( \chi^2=\frac{(90-75)^2}{75}+\frac{(30-45)^2}{45}+\frac{(30-30)^2}{30}=\ =\frac{225}{75}+\frac{225}{45}+\frac{0}{30}=\ =3+5\ =8\ \)

We conclude then (since the calculated value is higher than the critical value) that we can reject the null hypothesis at the 5% significance level. That is, we can reject the assertion that the frequencies are distributed according to the proportion 50%, 30%, 20%.

Making Life Easier with a Casio Scientific Calculator

With my fx calculator, I just need to choose “STAT” from the menu and enter the observed values in list L1 and the expected values in L2 in my table editor.

Then I will choose:

[TEST]
[CHI]
[GoF]
Observed:List1
Expected:List2
df:2
[CALC]

and I will get both the chi-square value and the p-value (in this case, 0.01832, which is less than the alpha value of 0.05 I chose, confirming the conclusion that I can reject the null hypothesis and accept the alternative one).

Using R for the Goodness of Fit Test

In R, the example given is even easier to set up:

observed<-c(90,30,30)
expected_proportion<-c(0.5,0.3,0.2)
chisq.test(observed,p=expected_proportion,correct=FALSE)

and the result will be:

Chi-squared test for given probabilities
data: observed
X-squared = 8, df = 2, p-value = 0.01832

The Test of Independence

It is commonly used to determine if two factors are related to each other.

Generally, what we want to know is: “Is variable X independent of variable Y?”

Note: the answer we get from our test is only this, not how the variables are related.

In the case of the goodness of fit test, there is only one variable at play: the observed frequencies can therefore be listed in a single row, or column, of values in a table.

Tests of independence, on the other hand, involve two variables, and the object of the test is precisely the assumption that the two variables are statistically independent.

Since two variables are involved in the test, the observed frequencies are entered into a contingency table of the row x column type.
For example, I represent the data relating to the age and gender of enthusiasts of a given commercial brand:

AgeMaleFemaleTotal
<356654120
>=35781290
Total14466210

We want to test the null hypothesis that the two qualitative variables, gender and age, are independent. Therefore, the alternative hypothesis predicts that there is a relationship between the two variables.

If the hypothesis of independence is true, between the observed frequency of each cell and the total of the observed frequencies of the row and column in which that cell is included, there must be the same proportions existing between the column and row totals and the total sample size.

\( f_e=\frac{\Sigma_{row}\ \Sigma_{column}}{n}\ \ \ df=(r-1)(c-1)\ \ \ \)

At this point, I proceed with my example:

\( f_e=\frac{\Sigma_{row}\ \Sigma_{column}}{n}=\frac{120\times 144}{210}=82,3\ \)

The 3 remaining frequencies can be easily obtained by subtraction from the row and column totals. In fact, a 2×2 table has df=1, meaning that the frequency of only one cell is free to vary.

I will get:

AgeMaleFemaleTotal
<358238120
>=35622890
Total14466210

\( H_0=gender\ and\ age\ are\ independent\ H_a=there\ is\ a\ relationship\ between\ gender\ and\ age\ \ \ df=(2-1)(2-1)=1 \)

I choose a significance level of α=0.01

\( \chi^2_{critical}=6.63\ \)

I calculate the chi-square value and find:

\( \chi^2=23.9\ \)

Therefore, the null hypothesis of independence is rejected at the 1% significance level. The variables age and gender are dependent.

The Test of Independence with Casio

To solve my example very easily with my Casio, I could have done this:

I load my table data into a matrix, which I call A:

[[66,54][78,12]]→[OPTN][MAT][MAT][ALPHA][A]

At this point, I move to the statistical functions:

[MENU][STAT]

[TEST][CHI][2WAY]

Observed:Mat A

Expected:Mat B

[CALC]
The result will be:

χ2=23.9299242
p=9.9907e-07
df=1

As can be seen from the very low p-value, I accept the alternative hypothesis and reject the null hypothesis.

The Test of Independence with R

I build my contingency table

enthusiasts <- matrix(c(66,54,78,12),ncol=2,byrow=TRUE)
rownames(enthusiasts) <- c("less than 35","35 or more")
colnames(enthusiasts) <- c("male","female")
enthusiasts <- as.table(enthusiasts)
enthusiasts

I can calculate the row totals:
margin.table(enthusiasts,1)

and the column totals:
margin.table(enthusiasts,2)

the grand total is:
margin.table(enthusiasts)

I look at the expected values:
chisq.test(enthusiasts)$expected

and test the hypothesis with:
chisq.test(enthusiasts)

The resulting very low p-value indicates that I can reject the null hypothesis of independence of the two variables.

An SEO Example: Does CTR Depend on the Device?

Arduino hobbyists and brand enthusiasts are fine for understanding the mechanics, but the test of independence is at its best in the daily practice of anyone working with Search Console data. Let’s pick up the numbers we already met when discussing Simpson’s Paradox: in one month our site collected 10,000 impressions on Desktop with 550 clicks, and 20,000 impressions on Mobile with 500 clicks. The CTR is therefore 5.5% versus 2.5%: a difference that looks huge, but is it real, or could it be the product of chance?

Phrased in the language of this article, the question becomes: is the click independent of the device? We build the contingency table, with one important caveat: the cells must contain counts, never percentages. For each device we therefore need the clicks and the “no clicks” (the impressions that did not generate a click).

DeviceClicksNo clicksTotal
Desktop5509,45010,000
Mobile50019,50020,000
Total1,05028,95030,000

The hypotheses are the usual ones:

\( H_0=click\ and\ device\ are\ independent\\ H_a=there\ is\ a\ relationship\ between\ click\ and\ device\\ \)

I check in R, building the matrix of counts (I use correct=FALSE so that the result can be compared with a manual calculation):

ctr <- matrix(c(550, 9450, 500, 19500), ncol=2, byrow=TRUE)
rownames(ctr) <- c("Desktop", "Mobile")
colnames(ctr) <- c("click", "no click")
chisq.test(ctr, correct=FALSE)

the result will be:

Pearson's Chi-squared test
data:  ctr
X-squared = 177.65, df = 1, p-value < 2.2e-16

The p-value is infinitesimal: we reject the null hypothesis without hesitation. The click depends on the device, and the difference between the two CTRs cannot be attributed to chance.

N.B.: with the volumes typical of Search Console (tens of thousands of impressions) the chi-square test rejects the null hypothesis even for tiny, practically irrelevant differences. Statistical significance tells us that the difference is not the product of chance, not that it matters: with very large samples the two things must be kept well apart.

Try It Yourself

To consolidate the mechanics, here is an exercise with made-up but realistic data. From the Search Console of an e-commerce site we extract one month of data, separating brand queries from non-brand ones:

Query typeClicksNo clicksTotal
Brand2401,7602,000
Non-brand54017,46018,000
Total78019,22020,000

The question is the same as before: does the click depend on the query type? The exercise consists of formulating the hypotheses, choosing α=0.05, building the matrix in R and running the test (again with correct=FALSE). If everything goes smoothly, the chi-square should come out close to 389, with a microscopic p-value. And while we are at it: which of the two CTRs (12% versus 3%) “pulls” the result more? A look at the expected frequencies with chisq.test(...)$expected helps answer that.

One question remains open, though, and it is subtler than it seems: the test told us that the dependence exists, not how strong it is. As we have just seen, with large samples almost everything turns out significant: measuring the strength of an association requires other tools (such as Cramér’s V), and that will be the subject of an upcoming article dedicated to effect size and the power of tests.

You might also like


Further Reading

The chi-square test, with all its variants and applicability conditions, is covered in detail in Statistica by Newbold, Carlson and Thorne (Italian edition), together with the other tests we have met along this path.

And if the examples on these pages made you want to learn R properly, R for Data Science by Hadley Wickham (second edition, also freely readable online) is the starting point I recommend.

Leave a Reply

Your email address will not be published. Required fields are marked *