In this article:
In everyday life, as in web analytics, we often have to make decisions based on incomplete information. How much data do I need to understand if this modification to the landing page worked? Are a thousand visits enough? Are ten thousand too many?
We can almost never measure the entire population (for example, all future visitors to a site). We have to work on a sample. And here lies the delicate balance: a sample that is too small leads to wrong conclusions, while one that is unnecessarily large wastes time and resources. So the question becomes: how much data do we really need?
How to Choose Who to Measure: Types of Sampling
Before figuring out how much data we need, we must understand how to collect it. The three main methods are:
- Simple random sampling: Every user has exactly the same probability of being chosen. It’s the gold standard, what we try to achieve when we randomize users in an A/B test.
- Stratified sampling: We divide users into groups (e.g., Mobile and Desktop traffic) and randomly sample within each group, respecting the original proportions. It ensures that no important minority is ignored.
- Systematic sampling: We choose one user every k (e.g., one user every 10). Easy to implement, but tricky when the data hide a cyclicity (imagine sampling one user every 7: if we end up with only Mondays, the estimate will be skewed from the start).
Sample Size: The Math Behind the Estimation
The intuition is straightforward: the smaller the effect we are looking for (or the more variable the data), the more observations we need to distinguish it from background noise. Sounds hard to formalise? It is more linear than it seems.
To calculate the exact number, we need three ingredients:
- Confidence level: How sure do we want to be? We usually use 95% (which corresponds to a Z-score of 1.96).
- Margin of error (E): The maximum error we are willing to accept (e.g., 1% or 0.01).
- Expected proportion (p): The estimated conversion rate. If we have no idea, we use 0.5 (50%): it represents maximum uncertainty and yields the largest possible sample, so it is the most conservative choice.
The formula to estimate a proportion (like the Conversion Rate) is:
n = (Z² × p(1 – p)) / E²
Let’s Calculate It in R and Python
Let’s run a quick example. We want to estimate the Conversion Rate of a new page with a margin of error of 1% (0.01) and a confidence level of 95% (Z = 1.96). To stay on the safe side, we set p = 0.5.
The examples below are in both R and Python — pick whichever language feels more familiar.
Let’s calculate it in R:
# Sample size calculation for a proportion
Z <- 1.96
p <- 0.5
E <- 0.01
n <- (Z^2 * p * (1-p)) / E^2
print(paste("Required size:", round(n)))
# Output: Required size: 9604Let’s verify it in Python:
# Sample size calculation for a proportion
Z = 1.96
p = 0.5
E = 0.01
n = (Z**2 * p * (1-p)) / E**2
print(f"Required size: {round(n)}")
# Output: Required size: 9604As we can see, around 9,604 users are needed to reach that precision. N.B.: if we accepted a margin of error of 2% (E=0.02), the number would collapse to about 2,401. That is the effect of E squared in the denominator: halving the precision requirement divides the required sample by four. Worth keeping in mind whenever we decide which margin to accept.
From Estimation to A/B Testing
The formula seen so far estimates a single proportion. But in everyday CRO (Conversion Rate Optimization) work the actual problem is almost always a different one: comparing two proportions, as in an A/B test.
In that case the logic is the same, but the formula gets more complex because two new concepts come into play: the Effect Size (the minimum difference we want to detect) and the Statistical Power.
To skip the manual calculation, I built an interactive A/B test sample size calculator: it does the dirty work and also indicates how many days the test should run, given the page’s average traffic.
Sampling Error vs Bias
One point worth keeping firmly in mind before closing. Sampling error (the one the formula handles) is inevitable and shrinks as the data grow. But there is a far more insidious enemy, and no formula captures it: bias.
If we test a page only during the weekend, we might collect a million visits (sampling error practically zero), but the sample will not be representative of weekday users. So: no formula can save a sample that is biased at the source. A thousand observations gathered well beat a million gathered badly.
Try It Yourself
A product page receives roughly 10,000 impressions per month on Google, with an observed CTR of 3.5%. We want to estimate the true CTR with a margin of error of 1 percentage point (E = 0.01) and 95% confidence.
- Compute the required sample size with the formula above, first using p = 0.5 (conservative case) and then p = 0.035 (observed CTR).
- Compare the two results: how much does the data requirement change once we have a reasonable estimate of p?
- Given 10,000 impressions per month, how many months are needed to satisfy the conservative estimate?
- If we accepted a 2% margin (E = 0.02), how would the collection time change?
Hint: in R, a minimal function is enough — sample_size <- function(Z, p, E) ceiling((Z^2 * p * (1-p)) / E^2) — to be called twice with the two values of p.
Now we know how to collect an adequate sample and how much data we need. One question remains: how do we use that sample to rigorously compare two versions of the same page? This is where actual A/B testing comes in, and it is the next step of the path.
Further Reading
To dig deeper into sampling, the biases that can distort it, and the logic of statistical inference, The Art of Statistics by David Spiegelhalter is the most suitable companion. Spiegelhalter devotes illuminating pages to real cases — flawed polls, convenience samples, misleading figures — showing how the mathematics of sampling means little without careful thought on how the data are collected.