Categories: statistics

Regression to the Mean: the SEO Fix That Worked… by Accident

In the Israeli Air Force, Daniel Kahneman recounts, the flight instructors were sure of one thing: praising a cadet after an excellent manoeuvre made him worse, scolding him after a terrible one made him better.
They had seen it happen a thousand times in the field, so it had to be true: with pilots, severity works and compliments backfire.
Except it wasn’t true. An exceptional manoeuvre — in either direction — is part skill and part luck; and luck, on the next attempt, doesn’t repeat. After a brilliant flight you tend to drift back toward your own average (and it looks as if the praise hurt), after a disastrous one you drift back toward the average (and it looks as if the scolding helped). The instructors were crediting themselves with an effect that was just regression to the mean.

The very same illusion waits for us every time we look at a site’s data and decide whether one of our changes “worked”.

It’s worth clearing up the name straight away, because it misleads: regression to the mean has nothing to do with linear regression, the model that fits a line between two variables. Here “regression” means going back, reverting: extreme values sliding back toward their average. They are two different things that happen to share a word.

What regression to the mean is

The mechanism is simple, and once you see it you can’t unsee it.
Almost every number we measure is the sum of two parts: a “true”, stable value and a dose of noise — random fluctuation of the moment. A page’s average SERP position in a given month depends on its real relevance, but also on chance: the algorithm wobbling, a competitor who pushed hard that month, the query’s seasonality, a handful of clicks more or fewer.

Now, when we single out the extreme cases — the worst-performing pages, the worst month — we are almost always picking situations where the noise pushed everything in the same unfavourable direction.
At the next measurement that noise won’t repeat identically, and the value will tend to climb back toward its true mean — without anyone having done anything. It’s a purely statistical fact, not an SEO phenomenon: the more extreme a measurement, the more likely the next one is less extreme.

The optimization that worked by accident

Let’s see what this mechanism does in everyday work. Suppose we track the average SERP position of 300 pages over two consecutive months. I simulate the scenario in R, giving each page a stable “true” position and adding a random fluctuation each month:

set.seed(48)

# 300 pages, each with its "true" SERP position (stable over time)
pos_vera <- runif(300, 3, 40)

# two consecutive months: same true position, different random noise
mese1 <- pos_vera + rnorm(300, 0, 8)
mese2 <- pos_vera + rnorm(300, 0, 8)

# the 60 worst pages in month 1 (in the SERP "worse" = higher number)
peggiori <- order(mese1, decreasing = TRUE)[1:60]

round(mean(mese1[peggiori]), 1)   # starting average position
# [1] 39.1

Our 60 worst pages start from an average position of 39.1: bottom-of-the-third-page territory. We decide to act — rewrite the titles, update the content, fix internal links — and we check again a month later.
Here’s the result, with no algorithm change in the meantime and, above all, with no real intervention at all in the simulation:

round(mean(mese2[peggiori]), 1)   # one month later
# [1] 33

From 39.1 to 33: about six positions gained. A jump that would look great in a report, and that anyone would be tempted to credit to the optimization just carried out.
Too bad there is no optimization in the code: the pages improved on their own, because they had been chosen precisely for being extreme and the noise that had sunk them in month 1 didn’t repeat. For scale: the average position of all 300 pages is about 22, and it’s toward that value that the worst ones are reverting.

How not to be fooled: the control group

If the improvement comes anyway, how can we tell whether our optimization had any real effect?
The answer is the same one we’d use for a drug: we need a control group. We split the 60 worst pages into two random halves: one we “optimize”, the other we deliberately leave alone. Then we compare how much they improve:

# split the 60 worst pages into two random groups
gruppo <- sample(rep(c("optimized", "control"), each = 30))

# average improvement (month1 - month2) in the two groups
round(tapply(mese1[peggiori] - mese2[peggiori], gruppo, mean), 1)
# control optimized
#     6.0       6.1

The “optimized” group gains 6.1 positions, the control group — left untouched — gains 6.0.
Practically the very same improvement. Our optimization, in the simulation, added nothing: all of the gain was regression to the mean, and the control exposes it by showing it would have happened anyway.

The lesson is this: an improvement, on its own, proves nothing.
When you act precisely on the pages (or campaigns) that were doing worst, part of their rebound is guaranteed regardless of you. Without a comparison against what you did not touch, you can’t know how much of the result is your doing and how much is just reversion toward the mean.

It’s the same reasoning, it should be said, behind a properly run A/B test: you compare the variant against a concurrent control, not against the “before”. And it’s a close cousin of the trap we met discussing correlation and causation: here too we mistake a sequence in time (“I acted, then it improved”) for a causal link.

Try it yourself

To lock in the mechanism, try rebuilding the scenario changing a single detail: instead of the 60 worst pages, select the 60 best of month 1 (order(mese1)[1:60], without decreasing) and watch how they behave in month 2.

What to expect: the champions revert toward the mean too, but by getting worse — top positions contain luck that doesn’t repeat. It’s the mirror image of the same phenomenon, and it explains why “that golden month” or “that page that was flying” so often can’t be reproduced: you hadn’t lost your magic touch, you were just drifting back toward your average.

There’s an even more insidious variant of this trap, because it hides inside the very tools we use to decide with rigour. If we look at a test while it’s still running and stop as soon as the numbers please us, we are picking an extreme instant exactly as we picked the worst pages — and we’ll be fooled in the same way. It’s the peeking problem, and it’s the next stop on our tour of marketing-data pitfalls.