{"id":3889,"date":"2026-06-27T18:05:23","date_gmt":"2026-06-27T17:05:23","guid":{"rendered":"https:\/\/www.gironi.it\/blog\/?p=3889"},"modified":"2026-06-28T10:16:39","modified_gmt":"2026-06-28T09:16:39","slug":"thompson-sampling-multi-armed-bandit","status":"publish","type":"post","link":"https:\/\/www.gironi.it\/blog\/en\/thompson-sampling-multi-armed-bandit\/","title":{"rendered":"Multi-armed bandit: optimising the variants while the test is still running"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">In the article on <a href=\"https:\/\/www.gironi.it\/blog\/en\/bayesian-ab-testing\/\">Bayesian A\/B testing<\/a> we compared two variants at a fixed sample size: we collect the data for the whole planned duration, compute the probability that B beats A, and decide. It is a solid method, but it carries a cost that usually goes unmentioned.<br> That cost is the traffic that, for the entire duration of the test, we keep sending to the worse variant. If halfway through the experiment B is already winning hands down, every visitor assigned to A is a conversion we are probably throwing away. <strong>The fixed-sample test makes us pay for the information we gather: to learn which variant is better, we must keep showing the one we suspect to be the worse.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There is a way to cut this bill, and it is called a <em>multi-armed bandit<\/em>. The idea is to shift traffic adaptively toward the variant that is winning <em>while the test is still running<\/em>, instead of waiting for the final verdict. In this article we build one with one of the most elegant and practical algorithms, <em>Thompson sampling<\/em>, which is the natural continuation of the <a href=\"https:\/\/www.gironi.it\/blog\/en\/bayesian-statistics-how-to-learn-from-data-one-step-at-a-time\/\">Bayesian<\/a> reasoning we have followed so far.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What we will cover<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"#exploration-exploitation\">The exploration versus exploitation dilemma<\/a><\/li><li><a href=\"#thompson-sampling\">Thompson sampling: letting the posterior decide<\/a><\/li><li><a href=\"#regret-avoided\">How much we really gain: the regret avoided<\/a><\/li><li><a href=\"#when-it-makes-sense\">When a bandit makes sense, and when it does not<\/a><\/li><li><a href=\"#try-it-yourself\">Try it yourself<\/a><\/li><li><a href=\"#further-reading\">Further reading<\/a><\/li><\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"exploration-exploitation\">Exploration versus exploitation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The name comes from slot machines: a <em>one-armed bandit<\/em> is the casino machine, and let us imagine we have several of them in front of us, each with an unknown win probability different from the others. We have a limited number of tokens. At each play we must choose which arm to pull. What is the strategy that maximises the total winnings?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The translation for anyone doing SEO or marketing is immediate: the arms are the variants (three versions of a <em>title tag<\/em>, of a <em>call to action<\/em>, of a landing page), the tokens are the visitors, the winnings are the conversions. Each visitor must be assigned to a variant, and we want to maximise total conversions across the whole experiment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here arises the underlying tension, the one that makes the problem interesting. On one hand we would like to <strong>exploit<\/strong> the variant that so far seems the best, to bank as many conversions as possible. On the other we must keep <strong>exploring<\/strong> the others too, because &#8220;so far it seems the best&#8221; rests on little data and might be a wrong impression.<br> It is a delicate balance: too much exploration and we waste traffic on mediocre variants; too much exploitation and we risk crowning the wrong winner on the basis of an early stroke of luck. <strong>The exploration-exploitation dilemma is the heart of every multi-armed bandit problem: each choice is at once an opportunity for gain and an opportunity for learning, and the two pull in opposite directions.<\/strong><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"thompson-sampling\">Thompson sampling: letting the posterior decide<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The classic A\/B test solves the dilemma in the crudest possible way: it explores and nothing more, in equal parts, until the end. Half the traffic to A, half to B, no adaptation. Thompson sampling solves it in a far cleverer, and almost surprisingly simple, way.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let us pick up the Bayesian thread. For each variant we keep a posterior on its conversion rate: as we saw when estimating the conversion rate, starting from a <a href=\"https:\/\/www.gironi.it\/blog\/en\/the-beta-distribution-explained-simply\/\">Beta<\/a> prior and observing binary outcomes (conversion yes\/no), the posterior of each arm is again a Beta distribution, updated with every visitor. At the start, when we know nothing, each arm begins from a non-informative Beta(1, 1) prior.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Thompson&#8217;s rule, in words before formulas, is this: instead of asking &#8220;which is the variant with the highest mean so far?&#8221;, at each visitor we <strong>sample a rate at random from the posterior of each arm, and play the arm that produced the highest sample<\/strong>. It is a way of choosing &#8220;in proportion to the probability of being the best&#8221;: a variant we are very uncertain about can still win the draw now and then (and that is how it keeps being explored), but as the data accumulate its samples concentrate and, if it really is worse, it almost stops being chosen on its own.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The elegance lies precisely here: there is no exploration parameter to tune by hand. The uncertainty of the posterior <em>is<\/em> the engine of exploration. The more uncertain an arm, the more its samples are spread out, the more often it happens to win the draw and get tried; the more certain it becomes, the more its samples tighten around the true value and the arm gets played (or avoided) decisively.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I simulate in R the whole process on three variants with true rates of 5%, 7% and 9% (which the algorithm of course does not know), over 5000 visitors:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>set.seed(7)\ntrue_rates &lt;- c(0.05, 0.07, 0.09)   # 3 variants, the third is the best\nK &lt;- length(true_rates); N &lt;- 5000\nalpha &lt;- rep(1, K); beta_ &lt;- rep(1, K)   # Beta(1,1) prior per arm\npulls &lt;- rep(0, K); rewards &lt;- rep(0, K)\nfor (t in 1:N) {\n  theta &lt;- rbeta(K, alpha, beta_)        # sample a rate per arm\n  arm &lt;- which.max(theta)                 # play the best sampled arm\n  r &lt;- rbinom(1, 1, true_rates[arm])      # outcome (conv yes\/no)\n  alpha[arm] &lt;- alpha[arm] + r; beta_[arm] &lt;- beta_[arm] + (1 - r)\n  pulls[arm] &lt;- pulls[arm] + 1; rewards[arm] &lt;- rewards[arm] + r\n}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The loop is all there is. At each iteration we sample a <code>theta<\/code> per arm (<code>rbeta<\/code>), choose the maximum (<code>which.max<\/code>), simulate the outcome (<code>rbinom<\/code>) and update the parameters of the arm played: a conversion increases its <code>alpha<\/code> by one, a non-conversion its <code>beta_<\/code>. No other logic, no threshold to calibrate.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"regret-avoided\">How much we gain: the regret avoided<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Let us see where the simulation took us. The first thing to look at is how the 5000 visitors were distributed across the three arms:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cat(\"pulls per arm:\", pulls, \"\\n\")\ncat(\"bandit total conv:\", sum(rewards), \"\\n\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Output: pulls per arm = 359, 278, 4363; bandit total conv = 424.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Of the 5000 visitors, a full 4363 ended up on the best variant, and only a little over 600 in total on the two worse ones.<\/strong> The algorithm, without our having told it anything about the true rates, worked out on its own which arm to reward and steered the vast majority of the traffic there. It is exactly the exploitation we wanted, reached through just enough early exploration to tell the arms apart.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now the comparison that reveals the size of the advantage. A classic A\/B test would have split the 5000 visitors equally across the three variants, roughly 1667 each, for the whole duration. The expected conversions in that scenario are simply the mean of the three rates times the number of visitors:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>exp_ab &lt;- sum(true_rates) \/ K * N\ncat(\"expected conv equal-split A\/B:\", round(exp_ab), \"\\n\")\ncat(\"regret avoided (approx):\", round(sum(rewards) - exp_ab), \"conv\\n\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Output: expected conv equal-split A\/B = 350; regret avoided (approx) = 74 conv.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">An equal-split A\/B would have taken home about 350 conversions; the bandit collected 424. The difference, about <strong>74 more conversions on the very same traffic<\/strong>, is what is technically called the <em>regret<\/em> avoided: the regret, that is the conversions lost by playing the wrong arms, that the adaptive allocation saved us. In plainer terms, for the same number of visitors the bandit converted over 20% more, simply by not insisting on the weak variants.<br> Note that we did not have to sacrifice anything to get it: at the end of the experiment we still know the winner (indeed we know it with great confidence, given the 4363 data points gathered on it), and in the meantime we converted more. This is the structural advantage of the bandit over the fixed-sample test.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"when-it-makes-sense\">When it makes sense (and when it does not)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">That said, a bandit is not the right answer to every question, and it is important to understand where it shines and where instead a traditional A\/B test remains preferable.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The bandit is at its best when traffic is <strong>continuous and long-running<\/strong> (an always-on page, a campaign running for months) and when the variants to compare are <strong>many<\/strong>: there the adaptive allocation pays off, because there is time and volume to shift the traffic and plenty of weak variants to drop quickly. It is ideal for ongoing optimisations where the goal is to maximise conversions along the whole journey, not to take a clean statistical snapshot at a given moment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When instead the goal is precisely that snapshot \u2014 a precise, unbiased estimate of <em>how much<\/em> a variant is better, perhaps to defend in front of a client or to use for a strategic decision \u2014 the fixed-sample test remains more suitable: precisely because it explores in equal parts, it gathers balanced data on all variants and produces sharper estimates of the effect. The bandit, by concentrating early on the winner, gathers little data on the losers and therefore estimates <em>by how much<\/em> they are worse less well.<\/p>\n\n\n\n<p class=\"has-light-gray-background-color has-background wp-block-paragraph\">A note of caution: the bandit assumes that the conversion rates stay stable over time. But the real world is often non-stationary \u2014 seasonality, shifts in audience, a promotion that kicks in. If the best variant changes <em>after<\/em> the algorithm has already concentrated on another, a naive Thompson sampling like the one shown here struggles to notice, because it has stopped exploring the alternatives. In changing contexts we need variants that &#8220;forget&#8221; old data (for example by gradually discounting past observations), otherwise we risk staying anchored to a winner that no longer is one.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"try-it-yourself\">Try it yourself<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Simulation is a perfect playground for building intuition. Taking the code above, try changing the starting numbers and observe how the <code>pulls<\/code> redistribute:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Bring the true rates closer, for example <code>true_rates &lt;- c(0.07, 0.08, 0.09)<\/code>: with more similar variants, how much more traffic is needed before the best arm breaks away? Does the concentration of pulls stay this sharp?<\/li><li>Cut the traffic drastically (<code>N &lt;- 500<\/code>): with few visitors does the bandit still have time to spot the winner, or does exploration eat up the whole budget?<\/li><li>Add variants: move to four or five arms. With more alternatives to discard, does the advantage in <code>regret avoided<\/code> over the equal-split A\/B grow or shrink?<\/li><\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Hint: the structure of the loop never changes, you just modify <code>true_rates<\/code> and <code>N<\/code>. It is precisely by watching how the <code>pulls<\/code> vector tips (or fails to tip) as a function of the distance between the rates that you truly grasp what Thompson sampling does under the hood.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">So far we have used Bayes to <em>decide between variants<\/em>: estimate a rate, compare two, allocate traffic adaptively. But the same machinery \u2014 a prior, some data, a posterior \u2014 also serves a different and very common task: <em>classifying<\/em>, that is assigning to each new observation the most probable label given its features. It is the leap from deciding to classifying, and the algorithm that does it with disarming elegance, once again starting from Bayes&#8217; theorem, is <em>Naive Bayes<\/em>: the subject of the next article.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"further-reading\">Further reading<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to explore bandits applied to website optimisation, <a href=\"https:\/\/www.amazon.it\/dp\/1449341330?tag=consulenzeinf-21\" rel=\"nofollow sponsored noopener\" target=\"_blank\"><em>Bandit Algorithms for Website Optimization<\/em><\/a> by John Myles White is the book I recommend. It is a slim, avowedly practical volume that compares A\/B tests and bandit algorithms (epsilon-greedy, softmax, UCB) precisely from the standpoint of someone optimising pages and conversions, with the code at hand. It is the ideal starting point for anyone wanting to move from the simulation we have seen to a bandit that actually runs on their own site.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This article is part of the <a href=\"https:\/\/www.gironi.it\/blog\/en\/bayesian-approach\/\">&#8220;The Bayesian Approach&#8221;<\/a> path, a guided route through the articles on Bayesian statistics and inference for SEO.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the article on Bayesian A\/B testing we compared two variants at a fixed sample size: we collect the data for the whole planned duration, compute the probability that B beats A, and decide. It is a solid method, but it carries a cost that usually goes unmentioned. That cost is the traffic that, for &hellip; <a href=\"https:\/\/www.gironi.it\/blog\/en\/thompson-sampling-multi-armed-bandit\/\" class=\"more-link\">Leggi tutto<span class=\"screen-reader-text\"> &#8220;Multi-armed bandit: optimising the variants while the test is still running&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","footnotes":""},"categories":[161],"tags":[],"class_list":["post-3889","post","type-post","status-publish","format-standard","hentry","category-statistics"],"lang":"en","translations":{"en":3889,"it":3888},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"post-thumbnail":false},"uagb_author_info":{"display_name":"Paolo Gironi","author_link":"https:\/\/www.gironi.it\/blog\/author\/autore-articoli\/"},"uagb_comment_info":0,"uagb_excerpt":"In the article on Bayesian A\/B testing we compared two variants at a fixed sample size: we collect the data for the whole planned duration, compute the probability that B beats A, and decide. It is a solid method, but it carries a cost that usually goes unmentioned. That cost is the traffic that, for&hellip;","_links":{"self":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3889","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/comments?post=3889"}],"version-history":[{"count":4,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3889\/revisions"}],"predecessor-version":[{"id":3936,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3889\/revisions\/3936"}],"wp:attachment":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/media?parent=3889"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/categories?post=3889"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/tags?post=3889"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}