statistics

The peeking problem: why sneaking a look at an A/B test inflates false positives

On 21 January 2015 Optimizely — one of the most widely used A/B testing platforms in the world — switched on a completely new statistical engine for all of its customers, the New Stats Engine.
It wasn’t a technical whim: the old engine, built around a classic fixed-horizon t-test (Fixed Horizon) and developed with statisticians from Stanford, had a flaw that affected anyone who looked at a test’s results before the end. And we look at a test’s results always, before the end.

Optimizely had measured the problem themselves, simulating A/A tests — two identical variants, where by construction neither is better than the other, so any declared “winner” is a false alarm.
According to the figures published by Optimizely, on tests of 5,000 visitors anyone checking the numbers after every visitor saw 57% of A/A tests declare a false winner at least once; checking every 500 visitors the figure dropped to 26%, every 1,000 to 20%. Chilling numbers for a tool that is supposed to help us decide with rigour. The rewrite — sequential inference plus false discovery rate control, what they call always-valid — was meant precisely to bring the error, as they put it, “from over 30% to 5%”.

It’s the same deception we ran into closing the article on regression to the mean: there we selected the worst-performing pages — an extreme instant in the space of the data — and let ourselves be fooled by their rebound. Here we select an extreme instant in time: we stop the moment the test proves us right. The mechanism is a cousin, the risk identical.

What peeking is

Anyone who runs an A/B test knows it well: the test is running, the data come in day after day, and the temptation to sneak a look at the dashboard is irresistible.
Peeking isn’t the mere act of looking: it’s looking while reserving the right to stop the test the moment the result becomes significant. It’s that “great, variant B has crossed the threshold, let’s wrap up here and declare the winner” said halfway through data collection.

The delicate point is that every look accompanied by the possibility of stopping is one more statistical test.
A single test with a 5% threshold accepts, by definition, a 5% chance of crying “winner” when in fact there’s no difference at all. But if we repeat that same test twenty times over the course of collection, and all we need is for just one of those twenty times to cross the threshold in order to stop and declare victory, then the chance of stumbling into a false positive is no longer 5%: it accumulates with every look.

This isn’t the usual multiplicity of someone comparing ten variants at once. Here the multiplicity is hidden in time: a single variant, looked at many times. It’s the same logic by which a single coin toss rarely gives a strange result, but if one is allowed to look after every toss and stop at the first favourable moment, sooner or later that moment arrives — and it gets mistaken for a signal.

What peeking costs: a simulation

Words convince us up to a point; numbers convince us far more. I simulate in R an A/A test, that is two variants with exactly the same true conversion rate (10%): any difference that emerges is noise, and any declared “victory” is a false positive by construction.
I set the stage by fixing the random number generator’s seed (so the numbers are reproducible), the function that computes the p-value of the comparison between two proportions, and the function that simulates a single experiment and reports whether at some point it declared a (false) winner:

set.seed(2025)

p_vero  <- 0.10    # same conversion rate for A and B (H0 true)
n_arm   <- 2000    # visitors per variant at the end of the test
n_sim   <- 4000    # number of simulated experiments
alpha   <- 0.05
sguardi <- 20      # how many times we "peek" during collection
look_at <- round(seq(n_arm / sguardi, n_arm, length.out = sguardi))

# p-value of a two-proportion, two-sided z-test
pval_ab <- function(xa, na, xb, nb) {
  pp <- (xa + xb) / (na + nb)
  se <- sqrt(pp * (1 - pp) * (1 / na + 1 / nb))
  2 * pnorm(-abs((xa / na - xb / nb) / se))
}

# one A/A experiment: TRUE if it declares a (false) winner
esperimento <- function(soglia, guarda) {
  a <- cumsum(rbinom(n_arm, 1, p_vero))
  b <- cumsum(rbinom(n_arm, 1, p_vero))
  for (k in guarda) {
    p <- pval_ab(a[k], k, b[k], k)
    if (!is.na(p) && p < soglia) return(TRUE)
  }
  FALSE
}

Let’s start from the correct behaviour: a single test, at the end, on the 2,000 visitors per variant. I run it 4,000 times and count how many declare a winner:

# fixed horizon: a single test, at the end
fisso <- mean(replicate(n_sim, esperimento(alpha, n_arm)))
cat(sprintf("Fixed horizon: %.1f%% false positives\n", 100 * fisso))
# Fixed horizon: 5.0% false positives

Out comes 5.0%: exactly the level we declared with the 5% threshold. The test, used as it should be, keeps its promise.
Now I change one thing only: instead of looking once at the end, I look twenty times during collection and stop at the first moment the p-value drops below 0.05. I add the intermediate looks and run again:

# peeking: a test at every look, stop at the first significant one
peek <- mean(replicate(n_sim, esperimento(alpha, look_at)))
cat(sprintf("Peeking (%d looks): %.1f%% false positives\n", sguardi, 100 * peek))
# Peeking (20 looks): 24.3% false positives

From 5.0% to 24.3%.
The same data, the same test, the same threshold: the only thing that changed is when we decided to look, and the false positive rate has nearly quintupled. Almost one A/A test in four, in which the two variants are identical by construction, convinces us we’ve found a winner. The 24.3% from our simulation and the 30% reported by Optimizely tell the same story with different data: peeking isn’t a venial sin, it’s the most effective way to fool ourselves.

Solution 1: the fixed horizon

The simplest cure is also the most annoying one: decide beforehand how much data to collect, and then have the discipline to wait until the end without stopping early, whatever the dashboard says in the meantime.
It’s what the simulation has just shown us: with a single test at the end, the false positive stays nailed to the promised 5%. No magic, just the elimination of opportunistic looks.

“How much data” isn’t a number plucked from thin air: it depends on how small a difference we want to be able to detect and on how much certainty we demand. It’s the sample size calculation, which is done before launching the test with our significance calculator and which rests on the concepts of effect size and power analysis.
Once that number is fixed, the fixed horizon is the safest road: no statistical correction to apply, no threshold to tweak. The price, though, is paid in patience — resisting the curiosity for days or weeks — and this, in operational reality, is exactly what almost no one manages to do.

Solution 2: looking without cheating

And if monitoring on the fly really were necessary — because a test that’s going terribly needs stopping, because the stakeholders want updates?
Then the way is not to look in secret with the usual threshold, but to look openly with a stricter one. The idea is simple: if at every look we raise the bar, making it harder to cry “winner” on each occasion, we can arrange for the overall error — summed across all the looks — to stay at the 5% we wanted. I calibrate in R the per-look threshold, trying ever more stringent values on the same twenty looks as before:

# stricter per-look threshold that brings the overall error back to ~5%
for (sg in c(0.05, 0.02, 0.01, 0.005)) {
  fp <- mean(replicate(n_sim, esperimento(sg, look_at)))
  cat(sprintf("  threshold %.3f -> %.1f%% overall\n", sg, 100 * fp))
}
#   threshold 0.050 -> 25.1% overall
#   threshold 0.020 -> 11.7% overall
#   threshold 0.010 ->  6.6% overall
#   threshold 0.005 ->  3.3% overall

As we can see, the usual 0.05 threshold produces a 25.1% overall error (the peeking disaster again), but as we make it stricter the error comes back down: around 0.01 — a threshold five times more stringent than the standard one — the overall error returns close to the nominal 5%. It’s the price to be paid for the right to peek: at every single look much more evidence is required, in exchange for the freedom to look often.

What we’ve just shown is a homemade, constant-threshold version of the idea. The “textbook” boundaries — more refined, with thresholds that change over the course of the test, like those of Pocock or O’Brien-Fleming — are obtained in R with the gsDesign package, and commercial tools like Optimizely use an always-valid variant (the so-called mSPRT) of the same underlying idea.
The fine mathematics changes, not the principle: to look often without cheating one must demand, at every look, more evidence than a single test would ask for.

A word of caution: a result seen during the test, on its own, proves nothing: what counts is when the decision to look was made.
The same p-value below 0.05 means different things depending on whether it’s the only fixed-horizon test or the first of the twenty at which one reserved the right to stop. Without declaring in advance how and when the data will be examined, any “winner” that emerges on the fly is suspect.

Try it yourself

To feel the mechanism up close, let’s start from the script and change a single parameter: the number of sguardi (looks).
Let’s go from weekly monitoring (few looks) to daily monitoring (many looks) and re-run the peeking simulation. What to expect: the more frequently one peeks, the higher the false positive rate climbs — the frequency of looks is the fuel of the problem. Then let’s redo the threshold calibration with that new number of looks and check that, by choosing a strict enough threshold, the overall error comes back under control all the same. It’s the proof, first-hand, that peeking isn’t a curse: it’s just a bill that has to be paid.


There’s one last trap in this family, perhaps the most insidious of all, because it doesn’t hide in our own data but in the data others tell us about. When we read an agency’s case study — “we increased conversions by 300% with this tactic” — we’re looking at a survivor: the thousand identical attempts that failed are something nobody mentions. It’s survivorship bias, the reason case studies lie even when they tell the truth, and it’s the next step on our journey through the pitfalls of marketing data.


Further Reading

On peeking, early stopping and sequential testing the reference — in English — remains Trustworthy Online Controlled Experiments by Ron Kohavi, Diane Tang and Ya Xu: written by people who led the experimentation platforms at Microsoft, Google and LinkedIn, it devotes explicit pages to all the ways a running A/B test can fool us, and to how to defend against them. It’s the book anyone who has to take online experiments seriously pulls out of the drawer.

Paolo Gironi

Recent Posts

Regression to the Mean: the SEO Fix That Worked… by Accident

In the Israeli Air Force, Daniel Kahneman recounts, the flight instructors were sure of one…

2 days ago

A/B Testing: How to Run Statistically Valid Experiments (and the Mistakes to Avoid)

Over the previous articles we have looked at how hypothesis testing works and how the…

5 days ago

An Introduction to Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a widely used statistical technique for reducing the complexity of…

5 days ago

Correlation: Pearson, Spearman and Kendall (and Why It Isn’t Causation)

Anyone who looks at a website's data does it constantly, often without noticing: they spot…

5 days ago

Effect Size and Power Analysis: How Big Is the Effect (and How Much Data You Need)

We closed the article on the A/B test significance calculator with a promise. We said…

1 week ago

A/B Test Significance Calculator

Our A/B test has run its course: variant B shows a higher conversion rate than…

2 weeks ago