statistics

Bayesian A/B Testing: not just “whether” B beats A, but “by how much”

In the article on classic A/B testing we saw how to compare two variants with the two-proportion test: we compute a statistic, get a p-value, and decide whether to reject the null hypothesis. It works, and it is the daily bread of anyone running online experiments. But there is a subtle gap between what the p-value tells us and what we actually want to know.
The p-value answers a convoluted question: “if A and B were identical, how unlikely would it be to observe a difference as large as this one?”. The question we care about in practice is a different, far more direct one: what is the probability that B is better than A? And, right after: by how much, and how much can we trust that “how much”?

The Bayesian approach answers both questions natively. In this article we apply it to the comparison of two variants, picking up the thread we left hanging when we estimated the conversion rate of a single variant.

What we will cover:

  • Two posteriors instead of one: a distribution per variant
  • What is the probability that B really wins
  • By how much it is better: the distribution of the difference
  • When to stop: expected loss and the peeking problem
  • Try it yourself
  • Further reading

Two posteriors instead of one

When we estimated the conversion rate of a single variant, we saw that — starting from a Beta prior and observing binomial data — the posterior is again a Beta distribution. The updating rule was simple arithmetic: if the prior is Beta(α, β) and we observe \( k \) conversions out of \( n \) sessions, the posterior is:

\( Beta(\alpha + k,\ \beta + (n – k)) \\ \)

In an A/B test we do not have a single proportion, we have two: one for the control (A) and one for the treatment (B). The mechanism, however, is identical: we build one posterior per variant, independently, applying the same rule twice.

Here is a quick example. We tested two versions of a landing page, assigning visitors at random:

  • Variant A (control): 90 conversions out of 1000 sessions → raw rate 9.0%
  • Variant B (treatment): 120 conversions out of 1000 sessions → raw rate 12.0%

We start from a non-informative Beta(1, 1) prior for both — “we know nothing, before the data every rate is equally plausible”. Applying the rule, the posterior of A is Beta(91, 911) and that of B is Beta(121, 881).

I build the two posteriors in R, sampling them by simulation:

set.seed(42)
# Variant A: 90 conv / 1000 ; Variant B: 120 conv / 1000
cA <- 90; nA <- 1000; cB <- 120; nB <- 1000

# Posteriors with uniform Beta(1,1) prior
postA <- rbeta(1e5, 1 + cA, 1 + nA - cA)   # Beta(91, 911)
postB <- rbeta(1e5, 1 + cB, 1 + nB - cB)   # Beta(121, 881)

Now we hold two distributions, not two numbers. And this is precisely the point: instead of comparing 9.0% against 12.0% as if they were fixed values, we compare the whole uncertainty surrounding them. The operational questions become operations on these distributions.


What is the probability that B wins?

The first question — the one the p-value never answers directly — is the probability that B is genuinely better than A.
With the posteriors in hand, the calculation is almost trivial: we compare B’s samples with A’s, pair by pair, and count in what fraction of cases B exceeds A. That fraction is the probability we are after.

I compute in R the probability that B beats A:

cat("P(B>A) =", round(mean(postB > postA), 3), "\n")

Output: P(B>A) = 0.985.

There is a 98.5% probability that variant B converts better than variant A.
Notice the change of register compared to the frequentist version. We are not saying “the observed difference is unlikely under the null hypothesis”: we are saying, directly, that given the evidence collected it is almost certain that B is the better variant. This is exactly the statement we would want to base a decision on — and the Bayesian approach hands it over without circumlocutions.


By how much it is better: the distribution of the difference

Knowing that B wins with 98.5% probability is not enough to decide. There is almost surely an improvement, but if it were two tenths of a percentage point, perhaps it would not be worth shipping the new page. The next question is therefore: by how much is it better?

The answer lives in the distribution of the difference between the two posteriors. We subtract, sample by sample, A’s rate from B’s: we obtain a new distribution, that of the uplift. From it we read both the typical value (the mean) and a credible interval that quantifies its uncertainty.

I compute in R the difference and its 95% interval:

diff <- postB - postA
cat("mean uplift (pct points) =", round(mean(diff)*100, 2), "\n")
cat("95% CI of difference =", round(quantile(diff, c(.025,.975))*100, 2), "\n")

Output: mean uplift = 3.00 pct points, 95% CI = [0.31, 5.68].

The expected gain is about 3 percentage points of conversion, with a 95% credible interval running from 0.31 to 5.68 points.
Here too the meaning is direct, not an abstract property of the procedure: there is a 95% probability that the true improvement of B over A lies between 0.3 and 5.7 percentage points. The interval does not touch zero, which confirms — consistently with the earlier 98.5% — that B is almost certainly superior. But the valuable figure is the width: the improvement could be modest (half a point) or robust (over five points), and this spread is information the operational decision must keep in mind.


When to stop: expected loss and the peeking problem

In the article on classic A/B testing we devoted space to one of the most insidious errors: peeking, that is, glancing at the interim data and stopping as soon as the difference looks significant. In the frequentist framework this inflates the false-positive rate, because each glance is effectively a new test on the same null hypothesis, and repeated tests multiply the chances of being wrong.

The Bayesian approach changes the nature of the problem. Here we are not repeating a test on a null hypothesis: we are updating a belief. Today’s posterior becomes tomorrow’s prior, and looking at the data as it arrives does not “consume” an error budget in the same way. This does not mean we can stop on a whim: we still need a stopping rule declared in advance. And the natural Bayesian rule is not “stop when P(B>A) is high”, but is based on expected loss.

The idea is this: if we choose B but A were in fact the better variant, we are wrong, and the size of the error is how much A beats B in those cases. The expected loss of choosing B is the average of this “regret” over all the residual uncertainty. In plain words: by how much, on average, we would regret having chosen B if we were wrong.

I compute in R the expected loss of choosing B:

# Expected loss of choosing B: average loss if A were actually better
loss_B <- mean(pmax(postA - postB, 0))
cat("expected loss choosing B =", round(loss_B*100, 3), "pct points\n")

Output: expected loss choosing B = 0.007 pct points.

The expected loss of choosing B is a mere 0.007 percentage points: negligible. In plainer terms, even in the unlucky scenario where we were wrong, the average damage would be tiny. We then set a tolerance threshold before starting — for example “I stop when the expected loss drops below 0.01 points” — and let the test run until we reach it.

A note of caution: the freedom to look at the data as it comes in is not a licence to stop whenever the result pleases us. The stopping rule — the expected-loss threshold, or a minimum level of P(B>A) — must be fixed before collecting the data, exactly as in the frequentist setting we fix the sample size. Rigour does not lie in the method we use, but in deciding the criterion before seeing the numbers.


Try it yourself

A lead generation website tests two variants of its contact form. On variant A we observe 45 conversions out of 600 sessions; on B, 52 conversions out of 600 sessions.

1. Build the two posteriors with a non-informative Beta(1, 1) prior: postA <- rbeta(1e5, 1 + 45, 1 + 555) and the analogue for B. 2. Compute P(B>A): is B better with a probability high enough to convince you? 3. Compute the mean uplift and 95% interval of the difference: does the interval touch zero? 4. Compute the expected loss of choosing B. With these numbers (closer to each other than in the case above), how does it change compared to the article’s example?

Hint: the structure of the code is identical to the one we used. Only the starting counts change — and the result, far less clear-cut, is precisely why the interval and the expected loss matter more than a plain “B won”.


So far we have compared two variants at a fixed sample size: we collect the data, compute, decide. But if during the test one variant turns out to be clearly better, why keep sending half the traffic to the worse one? We can do better: allocate traffic adaptively, shifting it toward the winning variant while the test is still running. It is the leap from the test to the bandit, the subject of the next article.


Further reading

If you want to explore Bayesian A/B testing with a practical, code-oriented angle, Bayesian Methods for Hackers by Cameron Davidson-Pilon is the book I recommend. It tackles Bayesian reasoning starting from programming rather than formal mathematics, and devotes an explicit chapter to the Bayesian comparison of variants — probability that B wins, distribution of the difference, expected loss. It is written for those who learn better by reading code than proofs.

Paolo Gironi

Recent Posts

Bayesian Conversion Rate Estimation: how much can we trust limited data

In the article on the foundations of Bayesian statistics, we saw how Bayesian updating works…

23 hours ago

The peeking problem: why sneaking a look at an A/B test inflates false positives

On 21 January 2015 Optimizely — one of the most widely used A/B testing platforms…

3 days ago

Regression to the Mean: the SEO Fix That Worked… by Accident

In the Israeli Air Force, Daniel Kahneman recounts, the flight instructors were sure of one…

4 days ago

A/B Testing: How to Run Statistically Valid Experiments (and the Mistakes to Avoid)

Over the previous articles we have looked at how hypothesis testing works and how the…

1 week ago

An Introduction to Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a widely used statistical technique for reducing the complexity of…

1 week ago

Correlation: Pearson, Spearman and Kendall (and Why It Isn’t Causation)

Anyone who looks at a website's data does it constantly, often without noticing: they spot…

1 week ago