A/B Test Significance Calculator

Our A/B test has run its course: variant B shows a higher conversion rate than variant A. The temptation to declare a winner and ship the change is strong. But first there is a question to answer, the same one that runs through this whole series: is the difference we observe a real signal, or just statistical noise?

This calculator is the natural complement of the sample size calculator: that one works before the test and tells us how many users we need; this one works after and tells us whether the result we obtained is statistically significant. If you have read the article on hypothesis testing, you will recognise the machinery at once: behind the scenes sits a z-test for comparing two proportions.

Using it is immediate: we enter visitors and conversions for the two variants, choose a significance level, and the calculator returns the p-value, a verdict, and the confidence interval of the difference.


The calculator

The preloaded values are the ones we will work through step by step below: replace them with the numbers from your own test.

Significance calculator

Variant A (control)





Variant B







The formula: how the calculation works

The reasoning is the classic hypothesis-testing one. We start from the null hypothesis: the two variants convert at the same rate, and the observed difference is due to chance. Then we measure how “surprising” that difference would be if the null hypothesis were true: if it is too surprising, the null hypothesis does not hold.

There are three protagonists:

  • A and B: the observed conversion rates of the two variants (conversions divided by visitors).
  • : the pooled proportion, i.e. the overall conversion rate computed by combining the data from both variants. Why pooled? Under the null hypothesis the two proportions coincide, and the best estimate of that single proportion uses all the available data.
  • nA and nB: the visitors of the two variants.

The test statistic measures the observed difference in standard-error units:

\( z = \frac{\hat{p}_B – \hat{p}_A}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_A} + \frac{1}{n_B}\right)}} \)

The denominator is the standard error of the difference: it tells us how much the gap between the two rates would fluctuate if we repeated the test many times in a world where the variants are identical. The resulting z is read on the standard normal distribution: the p-value is the probability of observing a difference at least this extreme, in either direction, by pure chance. “In either direction” is not a footnote: the test is two-tailed, because before looking at the data we do not know whether B will do better or worse than A.

The reference values are always the same:

  • |z| > 1.645 → significant at 90%
  • |z| > 1.96 → significant at 95%
  • |z| > 2.576 → significant at 99%

Let’s work through an example, with the numbers preloaded in the calculator. Variant A received 8,500 visitors and 204 conversions; variant B 8,300 visitors and 251 conversions:

  • A = 204 / 8,500 = 0.0240 (2.40%)
  • B = 251 / 8,300 = 0.0302 (3.02%) — a +26% relative lift
  • pooled p̂ = (204 + 251) / (8,500 + 8,300) = 455 / 16,800 = 0.0271
  • standard error = √[0.0271 × 0.9729 × (1/8,500 + 1/8,300)] = 0.00250
  • z = (0.0302 − 0.0240) / 0.00250 = 2.49

So: z = 2.49 clears the 1.96 threshold and the p-value is 0.0127. The difference is significant at 95% — but, as you can see, not at 99% (0.0127 > 0.01). Same result, two different verdicts depending on how strict we chose to be: the significance level must be decided before looking at the data, not after.


Let’s verify it in R

I check the calculation in R with prop.test, switching off the continuity correction to stay aligned with the manual computation:

prop.test(c(251, 204), c(8300, 8500), correct = FALSE)

	2-sample test for equality of proportions
	without continuity correction

data:  c(251, 204) out of c(8300, 8500)
X-squared = 6.2075, df = 1, p-value = 0.01272
alternative hypothesis: two.sided
95 percent confidence interval:
 0.001325762 0.011156166
sample estimates:
    prop 1     prop 2
0.03024096 0.02400000

The numbers match: the p-value is the same as the manual calculation, and the X-squared statistic is simply our z squared (2.49² ≈ 6.21 — the chi-square test on a 2×2 table and the z-test on two proportions are the same test). As a bonus, R hands us the confidence interval of the difference: between 0.13 and 1.12 percentage points. That is the most valuable piece of information of all, and here is why.


How to read the result (without being fooled)

Significant does not mean important. This must always be kept firmly in mind: with very large samples, even tiny, commercially irrelevant differences become statistically significant. Significance tells us the difference is not due to chance — not that it is big. To understand how big it is, we look at the confidence interval of the difference: in our example it runs from +0.13 to +1.12 percentage points. If even the lower bound justifies the effort of shipping the change, we can proceed with confidence; if the interval includes negligible values, the “significant” verdict alone is not enough.

The p-value holds if the test stops when planned. The calculation assumes the sample size was fixed in advance (with the sample size calculator) and that the test stops there. Checking the results every day and stopping at the first p-value below 0.05 — the infamous peeking — dramatically inflates false positives: it is like flipping a coin until three heads come up in a row and declaring the coin rigged. We covered this in the guide to statistical tests for A/B analysis.

N.B.: the calculator uses a two-tailed test, the standard, prudent choice. One-tailed versions exist and “reward” a directional hypothesis with halved p-values, but they should be used only when the direction of the effect is genuinely known a priori — which, in everyday A/B testing practice, is almost never.


You might also like


The p-value answers a single question: does the effect exist? It does not tell us how large it is, nor whether it is worth shipping. For that we need two more tools — effect size and power analysis — and that is exactly where this series is headed next.


Further reading

The most complete reference on running online experiments rigorously remains Trustworthy Online Controlled Experiments by Ron Kohavi, Diane Tang and Ya Xu: the chapter on the pitfalls of interpreting results (peeking included) is worth the price of the book on its own.

Leave a Reply

Your email address will not be published. Required fields are marked *