statistics

Multi-armed bandit: optimising the variants while the test is still running

In the article on Bayesian A/B testing we compared two variants at a fixed sample size: we collect the data for the whole planned duration, compute the probability that B beats A, and decide. It is a solid method, but it carries a cost that usually goes unmentioned.
That cost is the traffic that, for the entire duration of the test, we keep sending to the worse variant. If halfway through the experiment B is already winning hands down, every visitor assigned to A is a conversion we are probably throwing away. The fixed-sample test makes us pay for the information we gather: to learn which variant is better, we must keep showing the one we suspect to be the worse.

There is a way to cut this bill, and it is called a multi-armed bandit. The idea is to shift traffic adaptively toward the variant that is winning while the test is still running, instead of waiting for the final verdict. In this article we build one with one of the most elegant and practical algorithms, Thompson sampling, which is the natural continuation of the Bayesian reasoning we have followed so far.

What we will cover:

The exploration versus exploitation dilemma
Thompson sampling: letting the posterior decide
How much we really gain: the regret avoided
When a bandit makes sense, and when it does not
Try it yourself
Further reading

Exploration versus exploitation

The name comes from slot machines: a one-armed bandit is the casino machine, and let us imagine we have several of them in front of us, each with an unknown win probability different from the others. We have a limited number of tokens. At each play we must choose which arm to pull. What is the strategy that maximises the total winnings?

The translation for anyone doing SEO or marketing is immediate: the arms are the variants (three versions of a title tag, of a call to action, of a landing page), the tokens are the visitors, the winnings are the conversions. Each visitor must be assigned to a variant, and we want to maximise total conversions across the whole experiment.

Here arises the underlying tension, the one that makes the problem interesting. On one hand we would like to exploit the variant that so far seems the best, to bank as many conversions as possible. On the other we must keep exploring the others too, because “so far it seems the best” rests on little data and might be a wrong impression.
It is a delicate balance: too much exploration and we waste traffic on mediocre variants; too much exploitation and we risk crowning the wrong winner on the basis of an early stroke of luck. The exploration-exploitation dilemma is the heart of every multi-armed bandit problem: each choice is at once an opportunity for gain and an opportunity for learning, and the two pull in opposite directions.

Thompson sampling: letting the posterior decide

The classic A/B test solves the dilemma in the crudest possible way: it explores and nothing more, in equal parts, until the end. Half the traffic to A, half to B, no adaptation. Thompson sampling solves it in a far cleverer, and almost surprisingly simple, way.

Let us pick up the Bayesian thread. For each variant we keep a posterior on its conversion rate: as we saw when estimating the conversion rate, starting from a Beta prior and observing binary outcomes (conversion yes/no), the posterior of each arm is again a Beta distribution, updated with every visitor. At the start, when we know nothing, each arm begins from a non-informative Beta(1, 1) prior.

Thompson’s rule, in words before formulas, is this: instead of asking “which is the variant with the highest mean so far?”, at each visitor we sample a rate at random from the posterior of each arm, and play the arm that produced the highest sample. It is a way of choosing “in proportion to the probability of being the best”: a variant we are very uncertain about can still win the draw now and then (and that is how it keeps being explored), but as the data accumulate its samples concentrate and, if it really is worse, it almost stops being chosen on its own.

The elegance lies precisely here: there is no exploration parameter to tune by hand. The uncertainty of the posterior is the engine of exploration. The more uncertain an arm, the more its samples are spread out, the more often it happens to win the draw and get tried; the more certain it becomes, the more its samples tighten around the true value and the arm gets played (or avoided) decisively.

I simulate in R the whole process on three variants with true rates of 5%, 7% and 9% (which the algorithm of course does not know), over 5000 visitors:

set.seed(7)
true_rates <- c(0.05, 0.07, 0.09)   # 3 variants, the third is the best
K <- length(true_rates); N <- 5000
alpha <- rep(1, K); beta_ <- rep(1, K)   # Beta(1,1) prior per arm
pulls <- rep(0, K); rewards <- rep(0, K)
for (t in 1:N) {
  theta <- rbeta(K, alpha, beta_)        # sample a rate per arm
  arm <- which.max(theta)                 # play the best sampled arm
  r <- rbinom(1, 1, true_rates[arm])      # outcome (conv yes/no)
  alpha[arm] <- alpha[arm] + r; beta_[arm] <- beta_[arm] + (1 - r)
  pulls[arm] <- pulls[arm] + 1; rewards[arm] <- rewards[arm] + r
}

The loop is all there is. At each iteration we sample a theta per arm (rbeta), choose the maximum (which.max), simulate the outcome (rbinom) and update the parameters of the arm played: a conversion increases its alpha by one, a non-conversion its beta_. No other logic, no threshold to calibrate.

How much we gain: the regret avoided

Let us see where the simulation took us. The first thing to look at is how the 5000 visitors were distributed across the three arms:

cat("pulls per arm:", pulls, "\n")
cat("bandit total conv:", sum(rewards), "\n")

Output: pulls per arm = 359, 278, 4363; bandit total conv = 424.

Of the 5000 visitors, a full 4363 ended up on the best variant, and only a little over 600 in total on the two worse ones. The algorithm, without our having told it anything about the true rates, worked out on its own which arm to reward and steered the vast majority of the traffic there. It is exactly the exploitation we wanted, reached through just enough early exploration to tell the arms apart.

Now the comparison that reveals the size of the advantage. A classic A/B test would have split the 5000 visitors equally across the three variants, roughly 1667 each, for the whole duration. The expected conversions in that scenario are simply the mean of the three rates times the number of visitors:

exp_ab <- sum(true_rates) / K * N
cat("expected conv equal-split A/B:", round(exp_ab), "\n")
cat("regret avoided (approx):", round(sum(rewards) - exp_ab), "conv\n")

Output: expected conv equal-split A/B = 350; regret avoided (approx) = 74 conv.

An equal-split A/B would have taken home about 350 conversions; the bandit collected 424. The difference, about 74 more conversions on the very same traffic, is what is technically called the regret avoided: the regret, that is the conversions lost by playing the wrong arms, that the adaptive allocation saved us. In plainer terms, for the same number of visitors the bandit converted over 20% more, simply by not insisting on the weak variants.
Note that we did not have to sacrifice anything to get it: at the end of the experiment we still know the winner (indeed we know it with great confidence, given the 4363 data points gathered on it), and in the meantime we converted more. This is the structural advantage of the bandit over the fixed-sample test.

When it makes sense (and when it does not)

That said, a bandit is not the right answer to every question, and it is important to understand where it shines and where instead a traditional A/B test remains preferable.

The bandit is at its best when traffic is continuous and long-running (an always-on page, a campaign running for months) and when the variants to compare are many: there the adaptive allocation pays off, because there is time and volume to shift the traffic and plenty of weak variants to drop quickly. It is ideal for ongoing optimisations where the goal is to maximise conversions along the whole journey, not to take a clean statistical snapshot at a given moment.

When instead the goal is precisely that snapshot — a precise, unbiased estimate of how much a variant is better, perhaps to defend in front of a client or to use for a strategic decision — the fixed-sample test remains more suitable: precisely because it explores in equal parts, it gathers balanced data on all variants and produces sharper estimates of the effect. The bandit, by concentrating early on the winner, gathers little data on the losers and therefore estimates by how much they are worse less well.

A note of caution: the bandit assumes that the conversion rates stay stable over time. But the real world is often non-stationary — seasonality, shifts in audience, a promotion that kicks in. If the best variant changes after the algorithm has already concentrated on another, a naive Thompson sampling like the one shown here struggles to notice, because it has stopped exploring the alternatives. In changing contexts we need variants that “forget” old data (for example by gradually discounting past observations), otherwise we risk staying anchored to a winner that no longer is one.

Try it yourself

Simulation is a perfect playground for building intuition. Taking the code above, try changing the starting numbers and observe how the pulls redistribute:

Bring the true rates closer, for example true_rates <- c(0.07, 0.08, 0.09): with more similar variants, how much more traffic is needed before the best arm breaks away? Does the concentration of pulls stay this sharp?
Cut the traffic drastically (N <- 500): with few visitors does the bandit still have time to spot the winner, or does exploration eat up the whole budget?
Add variants: move to four or five arms. With more alternatives to discard, does the advantage in regret avoided over the equal-split A/B grow or shrink?

Hint: the structure of the loop never changes, you just modify true_rates and N. It is precisely by watching how the pulls vector tips (or fails to tip) as a function of the distance between the rates that you truly grasp what Thompson sampling does under the hood.

So far we have used Bayes to decide between variants: estimate a rate, compare two, allocate traffic adaptively. But the same machinery — a prior, some data, a posterior — also serves a different and very common task: classifying, that is assigning to each new observation the most probable label given its features. It is the leap from deciding to classifying, and the algorithm that does it with disarming elegance, once again starting from Bayes’ theorem, is Naive Bayes: the subject of the next article.

Keyword Clustering: grouping thousands of queries with K-means and hierarchical clustering

It happens with every reasonably serious project: you export the keyword list from Search Console…

6 hours ago

statistics

Expected vs Actual CTR: finding the pages that earn fewer clicks than their position deserves

Anyone who spends their days inside Search Console knows that little nagging feeling: a page…

1 day ago

statistics

Naive Bayes: classifying search intent with Bayes’ theorem

In the article on the multi-armed bandit we used Bayes to decide between variants: shifting…

2 days ago

statistics

Bayesian A/B Testing: not just “whether” B beats A, but “by how much”

In the article on classic A/B testing we saw how to compare two variants with…

4 days ago

statistics

Bayesian Conversion Rate Estimation: how much can we trust limited data

In the article on the foundations of Bayesian statistics, we saw how Bayesian updating works…

5 days ago

statistics

The peeking problem: why sneaking a look at an A/B test inflates false positives

On 21 January 2015 Optimizely — one of the most widely used A/B testing platforms…

1 week ago