  {"id":3679,"date":"2026-06-12T20:47:32","date_gmt":"2026-06-12T19:47:32","guid":{"rendered":"https:\/\/www.gironi.it\/blog\/?p=3679"},"modified":"2026-06-12T20:47:33","modified_gmt":"2026-06-12T19:47:33","slug":"ab-test-significance-calculator","status":"publish","type":"post","link":"https:\/\/www.gironi.it\/blog\/en\/ab-test-significance-calculator\/","title":{"rendered":"A\/B Test Significance Calculator"},"content":{"rendered":"<p>Our A\/B test has run its course: variant B shows a higher conversion rate than variant A. The temptation to declare a winner and ship the change is strong. But first there is a question to answer, the same one that runs through this whole series: <strong>is the difference we observe a real signal, or just statistical noise?<\/strong><\/p>\n<p>This calculator is the natural complement of the <a href=\"https:\/\/www.gironi.it\/blog\/en\/ab-test-sample-size-calculator\/\">sample size calculator<\/a>: that one works <em>before<\/em> the test and tells us how many users we need; this one works <em>after<\/em> and tells us whether the result we obtained is statistically significant. If you have read the article on <a href=\"https:\/\/www.gironi.it\/blog\/en\/hypothesis-testing-a-step-by-step-guide\/\">hypothesis testing<\/a>, you will recognise the machinery at once: behind the scenes sits a z-test for comparing two proportions.<\/p>\n<p><!--more--><\/p>\n<p>Using it is immediate: we enter visitors and conversions for the two variants, choose a significance level, and the calculator returns the p-value, a verdict, and the confidence interval of the difference.<\/p>\n<div style=\"border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px\">\n<h3 style=\"margin-top: 0\">Contents<\/h3>\n<ul>\n<li><a href=\"#calculator\">The calculator<\/a><\/li>\n<li><a href=\"#formula\">The formula: how the calculation works<\/a><\/li>\n<li><a href=\"#verify-r\">Let&#8217;s verify it in R<\/a><\/li>\n<li><a href=\"#interpret\">How to read the result (without being fooled)<\/a><\/li>\n<li><a href=\"#further\">Further reading<\/a><\/li>\n<\/ul>\n<\/div>\n<hr \/>\n<h2 id=\"calculator\">The calculator<\/h2>\n<p>The preloaded values are the ones we will work through step by step below: replace them with the numbers from your own test.<\/p>\n<style>\n.sg-calc{max-width:620px;margin:2em auto;padding:1.5em 2em;background:#f8f8f8;border:1px solid #ddd;border-radius:8px;font-family:inherit}\n.sg-calc h3{margin:0 0 1em;color:#333;font-size:1.2em}\n.sg-calc fieldset{border:1px solid #ddd;border-radius:6px;margin:0 0 1em;padding:0.6em 1em 1em;background:#fff}\n.sg-calc legend{font-weight:700;font-size:0.95em;color:#333;padding:0 0.4em}\n.sg-calc label{display:block;margin:0.6em 0 0.3em;font-weight:600;color:#333;font-size:0.9em}\n.sg-calc input[type=number],.sg-calc select{width:100%;padding:8px 10px;border:1px solid #ccc;border-radius:4px;font-size:1em;box-sizing:border-box;background:#fff}\n.sg-calc input[type=number]:focus,.sg-calc select:focus{outline:none;border-color:#0073aa;box-shadow:0 0 0 2px rgba(0,115,170,0.15)}\n.sg-calc .sg-row{display:flex;gap:1.2em}\n.sg-calc .sg-col{flex:1}\n.sg-calc .sg-result{margin-top:1.5em;padding:1.2em;background:#fff;border:2px solid #ccc;border-radius:6px;text-align:center}\n.sg-calc .sg-result.sg-si{border-color:#2ecc71}\n.sg-calc .sg-result.sg-no{border-color:#e67e22}\n.sg-calc .sg-verdict{font-size:1.25em;font-weight:700;display:block;margin:0.2em 0;color:#333}\n.sg-calc .sg-si .sg-verdict{color:#2ecc71}\n.sg-calc .sg-no .sg-verdict{color:#e67e22}\n.sg-calc .sg-pvalue{font-size:1.05em;color:#333;margin-top:0.4em}\n.sg-calc .sg-detail{font-size:0.9em;color:#666;margin-top:0.6em;line-height:1.6}\n.sg-calc .sg-warn{color:#e74c3c;font-size:0.85em;margin-top:0.5em;display:none}\n@media(max-width:520px){.sg-calc .sg-row{flex-direction:column;gap:0}.sg-calc{padding:1em 1.2em}}\n<\/style>\n<div class=\"sg-calc\" id=\"sgCalc\">\n<h3>Significance calculator<\/h3>\n<div class=\"sg-row\">\n<div class=\"sg-col\">\n<fieldset>\n<legend>Variant A (control)<\/legend>\n<p><label for=\"sgNA\">Visitors<\/label><br \/>\n<input type=\"number\" id=\"sgNA\" value=\"8500\" min=\"1\" step=\"1\"><br \/>\n<label for=\"sgCA\">Conversions<\/label><br \/>\n<input type=\"number\" id=\"sgCA\" value=\"204\" min=\"0\" step=\"1\"><br \/>\n<\/fieldset>\n<\/div>\n<div class=\"sg-col\">\n<fieldset>\n<legend>Variant B<\/legend>\n<p><label for=\"sgNB\">Visitors<\/label><br \/>\n<input type=\"number\" id=\"sgNB\" value=\"8300\" min=\"1\" step=\"1\"><br \/>\n<label for=\"sgCB\">Conversions<\/label><br \/>\n<input type=\"number\" id=\"sgCB\" value=\"251\" min=\"0\" step=\"1\"><br \/>\n<\/fieldset>\n<\/div>\n<\/div>\n<p><label for=\"sgAlpha\">Significance (\u03b1)<\/label><br \/>\n<select id=\"sgAlpha\"><option value=\"0.01\">0.01 (99%)<\/option><option value=\"0.05\" selected>0.05 (95%)<\/option><option value=\"0.10\">0.10 (90%)<\/option><\/select><\/p>\n<div class=\"sg-result\" id=\"sgResult\">\n<span class=\"sg-verdict\" id=\"sgVerdict\">\u2014<\/span><\/p>\n<div class=\"sg-pvalue\" id=\"sgPvalue\"><\/div>\n<div class=\"sg-detail\" id=\"sgDetail\"><\/div>\n<\/div>\n<div class=\"sg-warn\" id=\"sgWarn\"><\/div>\n<\/div>\n<p><script>\n(function(){\nfunction normCdf(z) {\n\tconst x = Math.abs(z) \/ Math.SQRT2;\n\tconst t = 1 \/ (1 + 0.3275911 * x);\n\tconst erf = 1 - (((((1.061405429 * t - 1.453152027) * t) + 1.421413741) * t - 0.284496736) * t + 0.254829592) * t * Math.exp(-x * x);\n\tconst phi = 0.5 * (1 + erf);\n\treturn z >= 0 ? phi : 1 - phi;\n}\nconst Z975 = 1.959964; \/\/ qnorm(0.975), per l'IC al 95% come prop.test\nfunction testSignificativita(nA, cA, nB, cB) {\n\tconst interi = [nA, cA, nB, cB];\n\tif (interi.some(v => !Number.isFinite(v) || v < 0) || nA === 0 || nB === 0 || cA > nA || cB > nB) {\n\t\treturn { valido: false, avvisi: [] };\n\t}\n\tconst pA = cA \/ nA;\n\tconst pB = cB \/ nB;\n\tconst diff = pB - pA;\n\tconst pooled = (cA + cB) \/ (nA + nB);\n\tconst sePooled = Math.sqrt(pooled * (1 - pooled) * (1 \/ nA + 1 \/ nB));\n\tconst z = sePooled > 0 ? diff \/ sePooled : 0;\n\tconst pValue = sePooled > 0 ? 2 * (1 - normCdf(Math.abs(z))) : 1;\n\tconst seDiff = Math.sqrt(pA * (1 - pA) \/ nA + pB * (1 - pB) \/ nB);\n\tconst avvisi = [];\n\tfor (const [c, n, nome] of [[cA, nA, 'A'], [cB, nB, 'B']]) {\n\t\tif (c < 5 || n - c < 5) {\n\t\t\tavvisi.push(`La variante ${nome} ha meno di 5 conversioni (o non-conversioni): l'approssimazione normale \u00e8 poco affidabile con numeri cos\u00ec piccoli.`);\n\t\t}\n\t}\n\treturn {\n\t\tvalido: true,\n\t\tpA, pB, diff,\n\t\tlift: pA > 0 ? diff \/ pA : null,\n\t\tz, pValue,\n\t\tciLow: diff - Z975 * seDiff,\n\t\tciHigh: diff + Z975 * seDiff,\n\t\tsignificativo: (alpha) => pValue < alpha,\n\t\tavvisi,\n\t};\n}\n  var SG_LABELS = {\n    sigYes: 'Statistically significant difference at %CONF%',\n    sigNo: 'Difference not significant at %CONF%',\n    lift: 'relative lift',\n    ci: '95% CI of the difference:',\n    pp: 'percentage points',\n    invalid: 'Let\\u2019s check the data: conversions cannot exceed visitors.',\n    warnSmall: 'Warning: one variant has fewer than 5 conversions (or non-conversions): with numbers this small the normal approximation is unreliable.'\n  };\n  var L = SG_LABELS;\n  function fmtP(p){ return p < 0.0001 ? '&lt; 0.0001' : p.toFixed(4); }\n  function calc(){\n    var nA=parseInt(document.getElementById('sgNA').value,10);\n    var cA=parseInt(document.getElementById('sgCA').value,10);\n    var nB=parseInt(document.getElementById('sgNB').value,10);\n    var cB=parseInt(document.getElementById('sgCB').value,10);\n    var alpha=parseFloat(document.getElementById('sgAlpha').value);\n    var conf=Math.round((1-alpha)*100)+'%';\n    var box=document.getElementById('sgResult');\n    var warn=document.getElementById('sgWarn');\n    warn.style.display='none'; warn.textContent='';\n    var r=testSignificativita(nA,cA,nB,cB);\n    if(!r.valido){\n      box.className='sg-result';\n      document.getElementById('sgVerdict').innerHTML='&mdash;';\n      document.getElementById('sgPvalue').textContent='';\n      document.getElementById('sgDetail').innerHTML=L.invalid;\n      return;\n    }\n    var sig=r.significativo(alpha);\n    box.className='sg-result '+(sig?'sg-si':'sg-no');\n    document.getElementById('sgVerdict').textContent=(sig?L.sigYes:L.sigNo).replace('%CONF%',conf);\n    document.getElementById('sgPvalue').innerHTML='p-value: <strong>'+fmtP(r.pValue)+'<\/strong> &nbsp;&middot;&nbsp; z = '+r.z.toFixed(3);\n    var liftTxt=r.lift===null?'&mdash;':(r.lift>=0?'+':'')+(r.lift*100).toFixed(1)+'%';\n    document.getElementById('sgDetail').innerHTML=\n      'CR A: <strong>'+(r.pA*100).toFixed(2)+'%<\/strong> &nbsp;&middot;&nbsp; CR B: <strong>'+(r.pB*100).toFixed(2)+'%<\/strong> &nbsp;&middot;&nbsp; '+L.lift+': <strong>'+liftTxt+'<\/strong><br \/>'+\n      L.ci+' [' + (r.ciLow*100).toFixed(2) + '; ' + (r.ciHigh*100).toFixed(2) + '] ' + L.pp;\n    if(r.avvisi.length){\n      warn.textContent=L.warnSmall;\n      warn.style.display='block';\n    }\n  }\n  ['sgNA','sgCA','sgNB','sgCB','sgAlpha'].forEach(function(id){\n    document.getElementById(id).addEventListener('input',calc);\n    document.getElementById(id).addEventListener('change',calc);\n  });\n  calc();\n})();\n<\/script><\/p>\n<hr \/>\n<h2 id=\"formula\">The formula: how the calculation works<\/h2>\n<p>The reasoning is the classic hypothesis-testing one. We start from the <strong>null hypothesis<\/strong>: the two variants convert at the same rate, and the observed difference is due to chance. Then we measure how &#8220;surprising&#8221; that difference would be if the null hypothesis were true: if it is too surprising, the null hypothesis does not hold.<\/p>\n<p>There are three protagonists:<\/p>\n<ul>\n<li><strong>p&#770;<sub>A<\/sub><\/strong> and <strong>p&#770;<sub>B<\/sub><\/strong>: the observed conversion rates of the two variants (conversions divided by visitors).<\/li>\n<li><strong>p&#770;<\/strong>: the <em>pooled<\/em> proportion, i.e. the overall conversion rate computed by combining the data from both variants. Why pooled? Under the null hypothesis the two proportions coincide, and the best estimate of that single proportion uses all the available data.<\/li>\n<li><strong>n<sub>A<\/sub><\/strong> and <strong>n<sub>B<\/sub><\/strong>: the visitors of the two variants.<\/li>\n<\/ul>\n<p>The test statistic measures the observed difference in standard-error units:<\/p>\n\\( z = \\frac{\\hat{p}_B &#8211; \\hat{p}_A}{\\sqrt{\\hat{p}(1-\\hat{p})\\left(\\frac{1}{n_A} + \\frac{1}{n_B}\\right)}} \\)\n<p>The denominator is the <strong>standard error of the difference<\/strong>: it tells us how much the gap between the two rates would fluctuate if we repeated the test many times in a world where the variants are identical. The resulting z is read on the standard normal distribution: the <strong>p-value<\/strong> is the probability of observing a difference at least this extreme, in either direction, by pure chance. &#8220;In either direction&#8221; is not a footnote: the test is <strong>two-tailed<\/strong>, because before looking at the data we do not know whether B will do better or worse than A.<\/p>\n<p>The reference values are always the same:<\/p>\n<ul>\n<li>|z| &gt; 1.645 &rarr; significant at 90%<\/li>\n<li>|z| &gt; 1.96 &rarr; significant at 95%<\/li>\n<li>|z| &gt; 2.576 &rarr; significant at 99%<\/li>\n<\/ul>\n<p><strong>Let&#8217;s work through an example<\/strong>, with the numbers preloaded in the calculator. Variant A received 8,500 visitors and 204 conversions; variant B 8,300 visitors and 251 conversions:<\/p>\n<ul>\n<li>p&#770;<sub>A<\/sub> = 204 \/ 8,500 = 0.0240 (2.40%)<\/li>\n<li>p&#770;<sub>B<\/sub> = 251 \/ 8,300 = 0.0302 (3.02%) &mdash; a +26% relative lift<\/li>\n<li>pooled p&#770; = (204 + 251) \/ (8,500 + 8,300) = 455 \/ 16,800 = 0.0271<\/li>\n<li>standard error = &radic;[0.0271 &times; 0.9729 &times; (1\/8,500 + 1\/8,300)] = 0.00250<\/li>\n<li>z = (0.0302 &minus; 0.0240) \/ 0.00250 = <strong>2.49<\/strong><\/li>\n<\/ul>\n<p>So: z = 2.49 clears the 1.96 threshold and the p-value is 0.0127. The difference is <strong>significant at 95%<\/strong> &mdash; but, as you can see, not at 99% (0.0127 &gt; 0.01). Same result, two different verdicts depending on how strict we chose to be: the significance level must be decided <em>before<\/em> looking at the data, not after.<\/p>\n<hr \/>\n<h2 id=\"verify-r\">Let&#8217;s verify it in R<\/h2>\n<p>I check the calculation in R with <code>prop.test<\/code>, switching off the continuity correction to stay aligned with the manual computation:<\/p>\n<pre>prop.test(c(251, 204), c(8300, 8500), correct = FALSE)\n\n\t2-sample test for equality of proportions\n\twithout continuity correction\n\ndata:  c(251, 204) out of c(8300, 8500)\nX-squared = 6.2075, df = 1, p-value = 0.01272\nalternative hypothesis: two.sided\n95 percent confidence interval:\n 0.001325762 0.011156166\nsample estimates:\n    prop 1     prop 2\n0.03024096 0.02400000<\/pre>\n<p>The numbers match: the p-value is the same as the manual calculation, and the X-squared statistic is simply our z squared (2.49&sup2; &asymp; 6.21 &mdash; the chi-square test on a 2&times;2 table and the z-test on two proportions are the same test). As a bonus, R hands us the <strong>confidence interval of the difference<\/strong>: between 0.13 and 1.12 percentage points. That is the most valuable piece of information of all, and here is why.<\/p>\n<hr \/>\n<h2 id=\"interpret\">How to read the result (without being fooled)<\/h2>\n<p><strong>Significant does not mean important.<\/strong> This must always be kept firmly in mind: with very large samples, even tiny, commercially irrelevant differences become statistically significant. Significance tells us the difference is not due to chance &mdash; not that it is <em>big<\/em>. To understand how big it is, we look at the confidence interval of the difference: in our example it runs from +0.13 to +1.12 percentage points. If even the lower bound justifies the effort of shipping the change, we can proceed with confidence; if the interval includes negligible values, the &#8220;significant&#8221; verdict alone is not enough.<\/p>\n<p><strong>The p-value holds if the test stops when planned.<\/strong> The calculation assumes the sample size was fixed in advance (with the <a href=\"https:\/\/www.gironi.it\/blog\/en\/ab-test-sample-size-calculator\/\">sample size calculator<\/a>) and that the test stops there. Checking the results every day and stopping at the first p-value below 0.05 &mdash; the infamous <em>peeking<\/em> &mdash; dramatically inflates false positives: it is like flipping a coin until three heads come up in a row and declaring the coin rigged. We covered this in the <a href=\"https:\/\/www.gironi.it\/blog\/en\/guide-to-statistical-tests-for-a-b-analysis\/\">guide to statistical tests for A\/B analysis<\/a>.<\/p>\n<p><strong>N.B.<\/strong>: the calculator uses a two-tailed test, the standard, prudent choice. One-tailed versions exist and &#8220;reward&#8221; a directional hypothesis with halved p-values, but they should be used only when the direction of the effect is genuinely known a priori &mdash; which, in everyday A\/B testing practice, is almost never.<\/p>\n<hr \/>\n<h3 id=\"further\">You might also like<\/h3>\n<ul>\n<li><a href=\"https:\/\/www.gironi.it\/blog\/en\/ab-test-sample-size-calculator\/\">A\/B Test Sample Size Calculator<\/a><\/li>\n<li><a href=\"https:\/\/www.gironi.it\/blog\/en\/guide-to-statistical-tests-for-a-b-analysis\/\">Guide to Statistical Tests for A\/B Analysis<\/a><\/li>\n<li><a href=\"https:\/\/www.gironi.it\/blog\/en\/hypothesis-testing-a-step-by-step-guide\/\">Hypothesis Testing: a Step-by-Step Guide<\/a><\/li>\n<\/ul>\n<hr \/>\n<p>The p-value answers a single question: <em>does the effect exist?<\/em> It does not tell us how large it is, nor whether it is worth shipping. For that we need two more tools &mdash; <strong>effect size<\/strong> and <strong>power analysis<\/strong> &mdash; and that is exactly where this series is headed next.<\/p>\n<hr \/>\n<h3>Further reading<\/h3>\n<p>The most complete reference on running online experiments rigorously remains <a href=\"https:\/\/www.amazon.it\/dp\/1108724264?tag=consulenzeinf-21\" rel=\"nofollow sponsored noopener\" target=\"_blank\"><em>Trustworthy Online Controlled Experiments<\/em><\/a> by Ron Kohavi, Diane Tang and Ya Xu: the chapter on the pitfalls of interpreting results (peeking included) is worth the price of the book on its own.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Our A\/B test has run its course: variant B shows a higher conversion rate than variant A. The temptation to declare a winner and ship the change is strong. But first there is a question to answer, the same one that runs through this whole series: is the difference we observe a real signal, or &hellip; <a href=\"https:\/\/www.gironi.it\/blog\/en\/ab-test-significance-calculator\/\" class=\"more-link\">Leggi tutto<span class=\"screen-reader-text\"> &#8220;A\/B Test Significance Calculator&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","footnotes":""},"categories":[161],"tags":[],"class_list":["post-3679","post","type-post","status-publish","format-standard","hentry","category-statistics"],"lang":"en","translations":{"en":3679,"it":3678},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"post-thumbnail":false},"uagb_author_info":{"display_name":"Paolo Gironi","author_link":"https:\/\/www.gironi.it\/blog\/author\/autore-articoli\/"},"uagb_comment_info":0,"uagb_excerpt":"Our A\/B test has run its course: variant B shows a higher conversion rate than variant A. The temptation to declare a winner and ship the change is strong. But first there is a question to answer, the same one that runs through this whole series: is the difference we observe a real signal, or&hellip;","_links":{"self":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3679","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/comments?post=3679"}],"version-history":[{"count":1,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3679\/revisions"}],"predecessor-version":[{"id":3681,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3679\/revisions\/3681"}],"wp:attachment":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/media?parent=3679"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/categories?post=3679"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/tags?post=3679"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}