Uncategorized

The Central Limit Theorem: Why Statistics Works (Even When Data Isn’t Normal)

Throughout the previous articles, we’ve had the chance to examine the normal distribution and its properties. And then we moved forward: we built confidence intervals, conducted hypothesis tests, calculated margins of error. In all these steps, the normal distribution was there, always present, like a quiet thread running through everything.

But there’s a question we may have asked ourselves without yet finding a satisfying answer: why does the normal distribution work so well, even when our data aren’t normal at all? Who said that organic traffic, conversion rates, or session durations follow a bell curve? In most cases, they don’t follow one at all.

The answer lies in one of the most elegant and powerful results in all of mathematics: the Central Limit Theorem (often abbreviated as CLT). It’s the theorem that, in a sense, justifies all of inferential statistics.


What Is the Central Limit Theorem

Let’s start with the formal statement, and then we’ll translate it into plain language.

The Central Limit Theorem states that: if we draw sufficiently large samples from any population with finite mean \(\mu\) and finite standard deviation \(\sigma\), the distribution of the sample means will be approximately normal, regardless of the shape of the original distribution.

More precisely, the distribution of sample means \(\bar{X}\) tends to:

\(
\bar{X} \sim N\left(\mu, \, \frac{\sigma}{\sqrt{n}}\right) \\
\)

where:

  • \(\mu\) is the population mean
  • \(\sigma\) is the population standard deviation
  • \(n\) is the size of each sample
  • \(\frac{\sigma}{\sqrt{n}}\) is the standard error of the mean

In clearer, more direct terms: it doesn’t matter how strange, skewed, or bizarre the distribution of our original data is. If we take many samples and calculate the mean of each one, those means will arrange themselves into a bell curve. Always.


Why It Is So Important

This point must always be kept firmly in mind, because it’s the keystone of everything we’ve done so far.

When we calculate a confidence interval or conduct a hypothesis test, we don’t work with individual data points: we work with sample means. And the CLT guarantees that those means, as long as the sample is large enough, follow a normal distribution (or approximately so).

That’s why we can use the normal distribution and the t-distribution even when the original data aren’t normal. We’re not making a risky assumption: we’re leveraging a solid mathematical result.

In practice, the CLT is the reason why:

  • confidence intervals work
  • hypothesis tests are reliable
  • we can perform statistical inference on virtually any type of data

See It With Your Own Eyes: A Simulation in R

Theory is beautiful, but seeing the CLT in action is something else entirely. Let’s build a simulation in R that shows the theorem at work.

We’ll start from a decidedly non-normal distribution: an exponential distribution, which is strongly right-skewed (think of the distribution of time spent on a website: many very short visits, few very long ones).

Let’s simulate the repeated sampling process in R:

set.seed(42)

# Popolazione: distribuzione esponenziale (media = 1/lambda)
lambda <- 0.5
pop_mean <- 1 / lambda  # media vera = 2

# Simuliamo 10000 campioni di dimensione n
n_campioni <- 10000

# Funzione per calcolare le medie campionarie
simula_medie <- function(n) {
  replicate(n_campioni, mean(rexp(n, rate = lambda)))
}

# Proviamo con tre dimensioni campionarie diverse
medie_n5  <- simula_medie(5)
medie_n30 <- simula_medie(30)
medie_n100 <- simula_medie(100)

# Visualizziamo
par(mfrow = c(2, 2))

# La distribuzione originale (esponenziale)
hist(rexp(10000, rate = lambda), breaks = 50, probability = TRUE,
     main = "Popolazione originale\n(esponenziale)",
     col = "lightcoral", xlab = "Valore", ylab = "Densità")

# Medie con n = 5
hist(medie_n5, breaks = 50, probability = TRUE,
     main = "Medie campionarie (n = 5)",
     col = "lightyellow", xlab = "Media", ylab = "Densità")
curve(dnorm(x, mean = pop_mean, sd = pop_mean / sqrt(5)),
      add = TRUE, col = "red", lwd = 2)

# Medie con n = 30
hist(medie_n30, breaks = 50, probability = TRUE,
     main = "Medie campionarie (n = 30)",
     col = "lightgreen", xlab = "Media", ylab = "Densità")
curve(dnorm(x, mean = pop_mean, sd = pop_mean / sqrt(30)),
      add = TRUE, col = "red", lwd = 2)

# Medie con n = 100
hist(medie_n100, breaks = 50, probability = TRUE,
     main = "Medie campionarie (n = 100)",
     col = "lightblue", xlab = "Media", ylab = "Densità")
curve(dnorm(x, mean = pop_mean, sd = pop_mean / sqrt(100)),
      add = TRUE, col = "red", lwd = 2)

As you can see, the result is spectacular. The starting population is completely asymmetric (the exponential doesn’t remotely resemble a bell curve), and yet:

  • With n = 5, the means already begin to resemble a normal distribution, even if a bit of skewness remains
  • With n = 30, the distribution of the means is practically indistinguishable from a normal
  • With n = 100, the overlap with the theoretical curve is nearly perfect

Truly child’s play: just increase the sample size and normality emerges on its own.


The Practical Rule: How Large Should n Be?

A legitimate question: “sufficiently large” is a rather vague term. In practice, how large does the sample need to be for the CLT to do its job?

The most common rule of thumb is n ≥ 30. With 30 or more observations, the distribution of sample means is generally well approximated by the normal, even if the original distribution is moderately skewed.

But be careful: this rule has exceptions.

  • If the original distribution is already symmetric (even if not normal), smaller samples suffice (even n = 10-15 can be enough)
  • If the original distribution is strongly skewed (as with data containing many outliers, or exponential distributions with extreme parameters), larger samples may be needed (n = 50 or even more)

In everyday SEO and digital marketing practice, we usually work with samples well above 30 (hundreds or thousands of sessions, clicks, conversions), so the CLT is almost always on our side.


The CLT and the Standard Error

The theorem also tells us something valuable about the spread of sample means. The standard deviation of the distribution of means (i.e., the standard error) is:

\(
SE = \frac{\sigma}{\sqrt{n}} \\
\)

This has two important practical consequences:

  1. As n increases, the standard error decreases. The more data we collect, the more our sample means cluster around the true mean. The relationship is with \(\sqrt{n}\), which means (as we’ve already seen with confidence intervals) that to halve the standard error we must quadruple the sample size.
  2. The population’s variability matters. If our data are highly dispersed (high \(\sigma\)), larger samples are needed to obtain precise estimates. A website with highly variable traffic requires more days of observation for a reliable estimate of the daily mean.

Let’s verify in R that the observed standard error matches the theoretical formula:

# Errore standard teorico per n = 30
se_teorico <- pop_mean / sqrt(30)

# Errore standard osservato dalla simulazione
se_osservato <- sd(medie_n30)

cat("SE teorico:", round(se_teorico, 4), "\n")
cat("SE osservato:", round(se_osservato, 4), "\n")
cat("Differenza:", round(abs(se_teorico - se_osservato), 4), "\n")

The agreement is remarkable: the two values practically coincide. The CLT works exactly as promised.


A Practical Example: Daily Organic Traffic

Let’s apply the CLT to a concrete case. Suppose we monitor a website’s daily organic traffic for a year (365 days). Traffic data are never normal: they’re right-skewed (weekdays vs. weekends, seasonal peaks, anomalies).

Let’s simulate a realistic scenario in R:

set.seed(123)

# Simuliamo 365 giorni di traffico (distribuzione log-normale, tipica del web)
traffico <- round(rlnorm(365, meanlog = 6, sdlog = 0.5))

cat("Media traffico giornaliero:", round(mean(traffico)), "visite\n")
cat("Mediana:", round(median(traffico)), "visite\n")
cat("Dev. standard:", round(sd(traffico)), "visite\n")

# Prendiamo campioni di 30 giorni e calcoliamo la media di ciascuno
medie_mensili <- replicate(5000, mean(sample(traffico, 30, replace = TRUE)))

par(mfrow = c(1, 2))

hist(traffico, breaks = 30, probability = TRUE,
     main = "Traffico giornaliero\n(365 giorni)",
     col = "lightcoral", xlab = "Visite", ylab = "Densità")

hist(medie_mensili, breaks = 50, probability = TRUE,
     main = "Medie di campioni\nda 30 giorni",
     col = "lightblue", xlab = "Media visite", ylab = "Densità")
curve(dnorm(x, mean = mean(traffico), sd = sd(traffico) / sqrt(30)),
      add = TRUE, col = "red", lwd = 2)

# Test di normalità sulle medie
shapiro.test(sample(medie_mensili, 5000))

Daily traffic is clearly asymmetric (the mean differs from the median, the distribution has a long right tail). But the means of 30-day samples? Perfectly normal, just as the CLT guarantees.

This is exactly why we can build reliable confidence intervals for mean traffic, even though individual days have a distribution that’s anything but normal.


When the CLT Is Not Enough

It would be dishonest not to mention the cases where the CLT has its limits. The theorem requires that the population have finite mean and variance. There are distributions (such as the Cauchy distribution) that don’t have a finite variance, and for these the CLT doesn’t hold.

In SEO and marketing practice, this is rarely a problem: our data always have finite mean and variance. However, it’s worth remembering that:

  • With strongly skewed distributions and small samples (n < 20), the normal approximation can be insufficient. In these cases, it’s better to use non-parametric methods or bootstrap techniques
  • With extreme proportions (very close to 0 or 1), the CLT for proportions requires larger samples for the approximation to work. We’ve already discussed this in the article on confidence intervals

Try It Yourself

An e-commerce site records the following order amounts (in euros) over a month:

ordini <- c(12, 8, 45, 15, 22, 150, 9, 18, 35, 11,
            14, 200, 7, 19, 28, 13, 55, 10, 16, 95,
            8, 21, 42, 12, 17, 310, 9, 14, 25, 11)
  1. Calculate the mean and standard deviation of the orders. Does the distribution look normal?
  2. Use replicate() and sample() to generate 5000 sample means with n = 10 (sampling with replacement)
  3. Draw the histogram of the sample means. Does it resemble a normal distribution?
  4. Calculate the theoretical standard error (\(\frac{s}{\sqrt{n}}\)) and compare it with the standard deviation of the simulated means

Hint: replicate(5000, mean(sample(ordini, 10, replace = TRUE))) does almost all the work.


We’ve seen how the Central Limit Theorem is the hidden foundation of all inferential statistics: it’s the reason we can build confidence intervals, conduct hypothesis tests, and make reliable predictions, even when our data aren’t normal. But the CLT has also taught us that sample size is crucial. This opens the door to a very practical question: how much data do we need? That’s the problem of sample size and sampling, topics we’ll tackle in an upcoming article.


Further Reading

If you want to deepen your understanding of the role of the normal distribution and the Central Limit Theorem in statistical practice, The Art of Statistics by David Spiegelhalter is an excellent companion. Spiegelhalter manages to explain why the bell curve appears everywhere — from physical measurements to election polls — with a clarity that never sacrifices rigor.

paolo

Share
Published by
paolo

Recent Posts

Confidence Intervals: What They Are, How to Calculate Them (and What They Do NOT Mean)

In previous articles, we examined how hypothesis testing works and how the t-distribution allows us…

2 days ago

Anomaly Detection: How to Identify Outliers in Your Data

Throughout this journey, we've examined tools to describe data, test hypotheses, and build models. But…

3 days ago

Bayesian Statistics: How to Learn from Data, One Step at a Time

In previous articles, we've examined statistical inference from a precise and coherent perspective: formulate a…

4 days ago

Guide to Statistical Tests for A/B Analysis

Statistical tests are fundamental tools for data analysis and informed decision-making. Choosing the appropriate test…

2 years ago

How to Use Decision Trees to Classify Data

Decision Trees are a type of machine learning algorithm that uses a tree structure to…

2 years ago

The Gradient Descent Algorithm Explained Simply

Imagine wanting to find the fastest route to reach a destination by car. You could…

3 years ago