  {"id":3847,"date":"2026-06-23T10:40:54","date_gmt":"2026-06-23T09:40:54","guid":{"rendered":"https:\/\/www.gironi.it\/blog\/?p=3847"},"modified":"2026-06-23T10:40:55","modified_gmt":"2026-06-23T09:40:55","slug":"peeking-problem-ab-testing","status":"publish","type":"post","link":"https:\/\/www.gironi.it\/blog\/en\/peeking-problem-ab-testing\/","title":{"rendered":"The peeking problem: why sneaking a look at an A\/B test inflates false positives"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">On 21 January 2015 Optimizely \u2014 one of the most widely used A\/B testing platforms in the world \u2014 switched on a completely new statistical engine for all of its customers, the <em>New Stats Engine<\/em>. <br>It wasn&#8217;t a technical whim: the old engine, built around a classic fixed-horizon t-test (<em>Fixed Horizon<\/em>) and developed with statisticians from Stanford, had a flaw that affected anyone who looked at a test&#8217;s results before the end. And we look at a test&#8217;s results <em>always<\/em>, before the end.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Optimizely had measured the problem themselves, simulating A\/A tests \u2014 two identical variants, where by construction neither is better than the other, so any declared &#8220;winner&#8221; is a false alarm. <br>According to the figures published by Optimizely, on tests of 5,000 visitors anyone checking the numbers after <em>every<\/em> visitor saw <strong>57% of A\/A tests declare a false winner at least once<\/strong>; checking every 500 visitors the figure dropped to 26%, every 1,000 to 20%. Chilling numbers for a tool that is supposed to help us decide with rigour. The rewrite \u2014 sequential inference plus false discovery rate control, what they call always-valid \u2014 was meant precisely to bring the error, as they put it, &#8220;from over 30% to 5%&#8221;.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It&#8217;s the same deception we ran into closing the article on <a href=\"https:\/\/www.gironi.it\/blog\/en\/regression-to-the-mean\/\">regression to the mean<\/a>: there we selected the worst-performing pages \u2014 an extreme instant in the <em>space<\/em> of the data \u2014 and let ourselves be fooled by their rebound. Here we select an extreme instant in <em>time<\/em>: we stop the moment the test proves us right. The mechanism is a cousin, the risk identical.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">What peeking is<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Anyone who runs an <a href=\"https:\/\/www.gironi.it\/blog\/en\/ab-testing-statistically-valid-experiments\/\">A\/B test<\/a> knows it well: the test is running, the data come in day after day, and the temptation to sneak a look at the dashboard is irresistible. <br><em>Peeking<\/em> isn&#8217;t the mere act of looking: it&#8217;s looking <em>while reserving the right to stop the test<\/em> the moment the result becomes significant. It&#8217;s that &#8220;great, variant B has crossed the threshold, let&#8217;s wrap up here and declare the winner&#8221; said halfway through data collection.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The delicate point is that every look accompanied by the possibility of stopping <strong>is one more statistical test<\/strong>. <br>A single test with a 5% threshold accepts, by definition, a 5% chance of crying &#8220;winner&#8221; when in fact there&#8217;s no difference at all. But if we repeat that same test twenty times over the course of collection, and all we need is for <em>just one<\/em> of those twenty times to cross the threshold in order to stop and declare victory, then the chance of stumbling into a false positive is no longer 5%: it <strong>accumulates<\/strong> with every look.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This isn&#8217;t the usual multiplicity of someone comparing ten variants at once. Here the multiplicity is hidden in <em>time<\/em>: a single variant, looked at many times. It&#8217;s the same logic by which a single coin toss rarely gives a strange result, but if one is allowed to look after every toss and stop at the first favourable moment, sooner or later that moment arrives \u2014 and it gets mistaken for a signal.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What peeking costs: a simulation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Words convince us up to a point; numbers convince us far more. I simulate in R an A\/A test, that is two variants with <strong>exactly the same<\/strong> true conversion rate (10%): any difference that emerges is noise, and any declared &#8220;victory&#8221; is a false positive by construction. <br>I set the stage by fixing the random number generator&#8217;s seed (so the numbers are reproducible), the function that computes the p-value of the comparison between two proportions, and the function that simulates a single experiment and reports whether at some point it declared a (false) winner:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>set.seed(2025)\n\np_vero  &lt;- 0.10    # same conversion rate for A and B (H0 true)\nn_arm   &lt;- 2000    # visitors per variant at the end of the test\nn_sim   &lt;- 4000    # number of simulated experiments\nalpha   &lt;- 0.05\nsguardi &lt;- 20      # how many times we \"peek\" during collection\nlook_at &lt;- round(seq(n_arm \/ sguardi, n_arm, length.out = sguardi))\n\n# p-value of a two-proportion, two-sided z-test\npval_ab &lt;- function(xa, na, xb, nb) {\n  pp &lt;- (xa + xb) \/ (na + nb)\n  se &lt;- sqrt(pp * (1 - pp) * (1 \/ na + 1 \/ nb))\n  2 * pnorm(-abs((xa \/ na - xb \/ nb) \/ se))\n}\n\n# one A\/A experiment: TRUE if it declares a (false) winner\nesperimento &lt;- function(soglia, guarda) {\n  a &lt;- cumsum(rbinom(n_arm, 1, p_vero))\n  b &lt;- cumsum(rbinom(n_arm, 1, p_vero))\n  for (k in guarda) {\n    p &lt;- pval_ab(a[k], k, b[k], k)\n    if (!is.na(p) &amp;&amp; p &lt; soglia) return(TRUE)\n  }\n  FALSE\n}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s start from the correct behaviour: a single test, at the end, on the 2,000 visitors per variant. I run it 4,000 times and count how many declare a winner:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># fixed horizon: a single test, at the end\nfisso &lt;- mean(replicate(n_sim, esperimento(alpha, n_arm)))\ncat(sprintf(\"Fixed horizon: %.1f%% false positives\\n\", 100 * fisso))\n# Fixed horizon: 5.0% false positives<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Out comes <strong>5.0%<\/strong>: exactly the level we declared with the 5% threshold. The test, used as it should be, keeps its promise. <br>Now I change one thing only: instead of looking once at the end, I look twenty times during collection and stop at the first moment the p-value drops below 0.05. I add the intermediate looks and run again:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># peeking: a test at every look, stop at the first significant one\npeek &lt;- mean(replicate(n_sim, esperimento(alpha, look_at)))\ncat(sprintf(\"Peeking (%d looks): %.1f%% false positives\\n\", sguardi, 100 * peek))\n# Peeking (20 looks): 24.3% false positives<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">From <strong>5.0%<\/strong> to <strong>24.3%<\/strong>. <br><strong>The same data, the same test, the same threshold: the only thing that changed is when we decided to look, and the false positive rate has nearly quintupled.<\/strong> Almost one A\/A test in four, in which the two variants are identical by construction, convinces us we&#8217;ve found a winner. The 24.3% from our simulation and the 30% reported by Optimizely tell the same story with different data: peeking isn&#8217;t a venial sin, it&#8217;s the most effective way to fool ourselves.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Solution 1: the fixed horizon<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The simplest cure is also the most annoying one: decide <em>beforehand<\/em> how much data to collect, and then have the discipline to wait until the end without stopping early, whatever the dashboard says in the meantime. <br>It&#8217;s what the simulation has just shown us: with a single test at the end, the false positive stays nailed to the promised 5%. No magic, just the elimination of opportunistic looks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&#8220;How much data&#8221; isn&#8217;t a number plucked from thin air: it depends on how small a difference we want to be able to detect and on how much certainty we demand. It&#8217;s the sample size calculation, which is done before launching the test with our <a href=\"https:\/\/www.gironi.it\/blog\/en\/ab-test-significance-calculator\/\">significance calculator<\/a> and which rests on the concepts of <a href=\"https:\/\/www.gironi.it\/blog\/en\/effect-size-and-power-analysis\/\">effect size and power analysis<\/a>. <br>Once that number is fixed, the fixed horizon is the safest road: no statistical correction to apply, no threshold to tweak. The price, though, is paid in patience \u2014 resisting the curiosity for days or weeks \u2014 and this, in operational reality, is exactly what almost no one manages to do.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Solution 2: looking without cheating<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">And if monitoring on the fly really were necessary \u2014 because a test that&#8217;s going terribly needs stopping, because the stakeholders want updates? <br>Then the way is not to look in secret with the usual threshold, but to look <em>openly<\/em> with a stricter one. The idea is simple: if at every look we raise the bar, making it harder to cry &#8220;winner&#8221; on each occasion, we can arrange for the <strong>overall<\/strong> error \u2014 summed across all the looks \u2014 to stay at the 5% we wanted. I calibrate in R the per-look threshold, trying ever more stringent values on the same twenty looks as before:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># stricter per-look threshold that brings the overall error back to ~5%\nfor (sg in c(0.05, 0.02, 0.01, 0.005)) {\n  fp &lt;- mean(replicate(n_sim, esperimento(sg, look_at)))\n  cat(sprintf(\"  threshold %.3f -&gt; %.1f%% overall\\n\", sg, 100 * fp))\n}\n#   threshold 0.050 -&gt; 25.1% overall\n#   threshold 0.020 -&gt; 11.7% overall\n#   threshold 0.010 -&gt;  6.6% overall\n#   threshold 0.005 -&gt;  3.3% overall<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As we can see, the usual 0.05 threshold produces a 25.1% overall error (the peeking disaster again), but as we make it stricter the error comes back down: <strong>around 0.01 \u2014 a threshold five times more stringent than the standard one \u2014 the overall error returns close to the nominal 5%.<\/strong> It&#8217;s the price to be paid for the right to peek: at every single look much more evidence is required, in exchange for the freedom to look often.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What we&#8217;ve just shown is a homemade, constant-threshold version of the idea. The &#8220;textbook&#8221; boundaries \u2014 more refined, with thresholds that change over the course of the test, like those of Pocock or O&#8217;Brien-Fleming \u2014 are obtained in R with the <code>gsDesign<\/code> package, and commercial tools like Optimizely use an <em>always-valid<\/em> variant (the so-called mSPRT) of the same underlying idea. <br>The fine mathematics changes, not the principle: to look often without cheating one must demand, at every look, more evidence than a single test would ask for.<\/p>\n\n\n\n<p class=\"has-light-gray-background-color has-background wp-block-paragraph\">A word of caution: <strong>a result seen during the test, on its own, proves nothing: what counts is when the decision to look was made.<\/strong> <br>The same p-value below 0.05 means different things depending on whether it&#8217;s the only fixed-horizon test or the first of the twenty at which one reserved the right to stop. Without declaring in advance how and when the data will be examined, any &#8220;winner&#8221; that emerges on the fly is suspect.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Try it yourself<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">To feel the mechanism up close, let&#8217;s start from the script and change a single parameter: the number of <code>sguardi<\/code> (looks). <br>Let&#8217;s go from weekly monitoring (few looks) to daily monitoring (many looks) and re-run the peeking simulation. What to expect: the more frequently one peeks, the higher the false positive rate climbs \u2014 the frequency of looks is the fuel of the problem. Then let&#8217;s redo the threshold calibration with that new number of looks and check that, by choosing a strict enough threshold, the overall error comes back under control all the same. It&#8217;s the proof, first-hand, that peeking isn&#8217;t a curse: it&#8217;s just a bill that has to be paid.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">There&#8217;s one last trap in this family, perhaps the most insidious of all, because it doesn&#8217;t hide in our own data but in the data others tell us about. When we read an agency&#8217;s case study \u2014 &#8220;we increased conversions by 300% with this tactic&#8221; \u2014 we&#8217;re looking at a survivor: the thousand identical attempts that failed are something nobody mentions. It&#8217;s <em>survivorship bias<\/em>, the reason case studies lie even when they tell the truth, and it&#8217;s the next step on our journey through the pitfalls of marketing data.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Further Reading<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">On peeking, early stopping and sequential testing the reference \u2014 in English \u2014 remains <a href=\"https:\/\/www.amazon.it\/dp\/1108724264?tag=consulenzeinf-21\" rel=\"nofollow sponsored noopener\" target=\"_blank\"><em>Trustworthy Online Controlled Experiments<\/em><\/a> by Ron Kohavi, Diane Tang and Ya Xu: written by people who led the experimentation platforms at Microsoft, Google and LinkedIn, it devotes explicit pages to all the ways a running A\/B test can fool us, and to how to defend against them. It&#8217;s the book anyone who has to take online experiments seriously pulls out of the drawer.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>On 21 January 2015 Optimizely \u2014 one of the most widely used A\/B testing platforms in the world \u2014 switched on a completely new statistical engine for all of its customers, the New Stats Engine. It wasn&#8217;t a technical whim: the old engine, built around a classic fixed-horizon t-test (Fixed Horizon) and developed with statisticians &hellip; <a href=\"https:\/\/www.gironi.it\/blog\/en\/peeking-problem-ab-testing\/\" class=\"more-link\">Leggi tutto<span class=\"screen-reader-text\"> &#8220;The peeking problem: why sneaking a look at an A\/B test inflates false positives&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","footnotes":""},"categories":[161],"tags":[],"class_list":["post-3847","post","type-post","status-publish","format-standard","hentry","category-statistics"],"lang":"en","translations":{"en":3847,"it":3846},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"post-thumbnail":false},"uagb_author_info":{"display_name":"Paolo Gironi","author_link":"https:\/\/www.gironi.it\/blog\/author\/autore-articoli\/"},"uagb_comment_info":0,"uagb_excerpt":"On 21 January 2015 Optimizely \u2014 one of the most widely used A\/B testing platforms in the world \u2014 switched on a completely new statistical engine for all of its customers, the New Stats Engine. It wasn&#8217;t a technical whim: the old engine, built around a classic fixed-horizon t-test (Fixed Horizon) and developed with statisticians&hellip;","_links":{"self":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3847","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/comments?post=3847"}],"version-history":[{"count":1,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3847\/revisions"}],"predecessor-version":[{"id":3854,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3847\/revisions\/3854"}],"wp:attachment":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/media?parent=3847"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/categories?post=3847"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/tags?post=3847"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}