Effect Size and Power Analysis: a Practical Guide in R

We closed the article on the A/B test significance calculator with a promise. We said that the p-value answers a single question — does the effect exist? — and that, on its own, it adds nothing else. It does not tell us how large the effect is, nor whether it is worth the effort of shipping it. It is time to keep that promise, because the two questions the p-value leaves hanging are exactly what separates reading data with method from stopping at the first threshold that glitters.

The two questions have precise names. The first — how big is it? — is the effect size. The second — with the data I have, could I even have seen an effect like this? — is the power of the test, and the reasoning that gets us to an answer is called power analysis. We examine them one at a time, as always with an example at hand.

Significant Doesn’t Mean Large

Let’s start with a situation that comes up more often than people running online tests would like. Suppose we tried two title tags on a very high-traffic page and collected one million sessions per variant. Variant A has a CTR of 3.00%, variant B of 3.05%: five hundredths of a percentage point of difference. Let’s check in R whether the gap is statistically significant:

# one million sessions per variant, CTR 3.00% vs 3.05%
prop.test(c(30000, 30500), c(1000000, 1000000), correct = FALSE)$p.value
# [1] 0.03899

The p-value is 0.039, below the 0.05 threshold. By the book, we should celebrate: the difference is “significant”. But let’s pause. Are we really about to rewrite the titles across the whole site to gain five hundredths of a point of CTR? That significant result hides an effect of laughable size, made detectable only by the sheer mass of data.

This is the point of no return: with a large enough sample, any difference becomes statistically significant, even the most trivial one. The p-value measures how confident we are that the effect isn’t zero; it does not measure how large the effect is. They are two different things, and conflating them is the mistake that leads to chasing wins that leave no trace on revenue. Effect size exists precisely to put magnitude back at the centre.

Effect Size: Measuring the “How Much”

The idea behind effect size is simple and, once seen, hard to forget: instead of asking only whether two groups differ, we measure by how much they differ, on a scale that does not depend on sample size. It is the difference between saying “B beats A” and saying “B beats A by half a standard deviation”. The first is news; the second is information you can decide on.

There are several effect-size measures, each tailored to a type of comparison. We look closely at two — one for means, one for proportions — because they cover most of the everyday work; the others we mention briefly at the end, with the right pointers.

Cohen’s d: the Effect Between Two Means

When we compare two means — the average time on page of two variants, the average session duration of two segments — the reference measure is Cohen’s d. The intuition is this: we take the difference between the two means and express it in “standard-deviation units”, so it becomes comparable across different contexts. A three-second difference weighs a lot if sessions all hover around that value, and almost nothing if they vary by minutes.

In formula, Cohen’s d is the ratio between the difference of the means and the combined standard deviation of the two groups:

\( d = \frac{\bar{x}_B – \bar{x}_A}{s_p} \\ \)

where x̄_A and x̄_B are the group means and s_p is the pooled standard deviation, a weighted average of the two standard deviations that brings together the internal variability of both groups:

\( s_p = \sqrt{\frac{(n_A – 1)\,s_A^2 + (n_B – 1)\,s_B^2}{n_A + n_B – 2}} \\ \)

with n_A, n_B the sample sizes and s_A, s_B the standard deviations of the two groups. The denominator is nothing more than the correct way to fuse two variabilities into a single reference measure.

Let’s do an example. We measured session duration (in seconds) on two versions of a page, twelve sessions per version. I compute Cohen’s d in R using the effsize package, which does the maths and also returns the qualitative label:

A <- c(48, 55, 52, 60, 46, 58, 51, 57, 49, 54, 53, 50)  # version A
B <- c(50, 58, 52, 62, 49, 57, 60, 53, 61, 51, 59, 54)  # version B

library(effsize)
cohen.d(B, A)

# Cohen's d
#
# d estimate: 0.6254922 (medium)
# 95 percent confidence interval:
#      lower      upper
# -0.2416187  1.4926030

The estimated d is 0.63, which effsize classifies as a medium effect. The conventional thresholds, proposed by Jacob Cohen, are 0.2 for a small effect, 0.5 for a medium one, 0.8 for a large one — but they should be taken for what they are: useful conventions to get oriented, not laws of nature. Cohen himself recommended interpreting them in light of one’s own field, not applying them blindly. In everyday SEO practice, a d of 0.63 on session duration is a change worth taking seriously.

There is, however, a detail worth the whole rest of the article, and it is already visible above: the confidence interval of d runs from −0.24 to 1.49. It crosses zero. In other words, with just twelve sessions per group, the estimated effect is medium, but the data are not enough to rule out that the true one is null. And indeed, if we feed the same numbers to a t-test, we find anything but a reassuring p-value:

t.test(B, A)
#
# 	Welch Two Sample t-test
# t = 1.5321, df = 21.9, p-value = 0.1398

A medium effect that the test declares not significant. This is not a contradiction: it is exactly the phenomenon that the power of a test exists to explain. Let’s hold that thought, we come back to it shortly.

Effect Size for Proportions (CTR and Conversions)

Time on page is a mean, but the daily bread of anyone doing SEO is proportions: CTR, conversion rate, bounce rate. Here Cohen’s d does not apply directly, and the natural effect-size measure is Cohen’s h, built specifically for the difference between two proportions.

The technical detail that makes it reliable is a transformation — the arcsine of the square root of the proportion — that serves to stabilise the variability (in a proportion, variability depends on the value itself, and is greatest around 50%). The formula is:

\( h = 2\arcsin\sqrt{p_2} – 2\arcsin\sqrt{p_1} \\ \)

where p₁ and p₂ are the two proportions compared. There is no need to compute it by hand: the ES.h function of the pwr package gives it to us. But before seeing it at work it is worth introducing the other half of the story, because that is where Cohen’s h shines.

First, though, let’s close the effect-size chapter with an honest mention of the other measures. When the groups compared are more than two — the classic ANOVA scenario — the typical measure is eta squared (η²), which tells what fraction of the total variability is explained by the factor under study; we laid its foundations when discussing the analysis of variance. When instead the outcome is binary — converts / does not convert — effect size is often expressed as an odds ratio, the ratio between the odds of success, the same object that governs logistic regression. Different tools for different questions, but the underlying idea does not change: put a number on the magnitude, not just on the existence.

The Power of a Test: Could We Have Seen It?

Let’s go back to our medium effect declared not significant. How can a d of 0.63 produce a p-value of 0.14? The answer lies in a concept that closes the inferential circle: the power of a test.

When we run a hypothesis test we risk two kinds of error. The first, the type I error, is crying out for an effect that isn’t there: we keep it under control with the threshold α (usually 0.05). The second, the type II error, is its opposite and far more insidious: failing to see an effect that is in fact there. The probability of committing it is denoted by β, and power is its complement:

\( \text{power} = 1 – \beta \\ \)

Put more plainly, power is the probability of noticing a real effect when it truly exists. A power of 0.80 — the standard people aim for — means that, if the effect exists at the hypothesised size, our test detects it four times out of five.

The crucial point is that power, the threshold α, effect size and sample size are not four independent knobs: they are bound by a constraint. Fix three of these values, and the fourth is determined. This is the entire idea of power analysis, and it is what makes it so useful: depending on which unknown we leave free, it answers two different operational questions.

And here is why our medium effect stayed invisible. With twelve sessions per group the power of the test was minuscule: the test was, quite simply, blind. A non-significant result, under these conditions, does not say “the effect isn’t there”; it says “I didn’t have good enough eyes to see it”. Confusing the two is one of the most expensive mistakes you can make reading an A/B test.

Power Analysis in R: How Much Data You Need

The first question power analysis can settle is the one every test should face before starting: how much data do I need? Let’s pick up our medium effect again. If we wanted to design a test able to detect a d of 0.63 with power 0.80 and threshold 0.05, I compute in R with the pwr package:

library(pwr)
pwr.t.test(d = 0.63, sig.level = 0.05, power = 0.80, type = "two.sample")
#
#      Two-sample t test power calculation
#               n = 40.53396
#               d = 0.63
#       sig.level = 0.05
#           power = 0.8
#     alternative = two.sided
# NOTE: n is number in *each* group

We would need about 41 sessions per group, not twelve. That is why our test was mute: it was looking for a medium effect with a third of the data required. Power analysis, done upstream, would have spared us an inconclusive test — and it is exactly the reasoning behind the sample size calculator: sample size and power are two sides of the same coin.

The second question is the mirror image and comes up after the fact, once the test is done: with the data I had, how much power did I really have? We see it better on a concrete case.

A Practical Case: the A/B Test That “Didn’t Work”

Suppose we tested two landing pages. A converted 60 visitors out of 1,500 (4.0%), B converted 78 out of 1,500 (5.2%). At a glance B looks clearly better — a point and two tenths of conversion more is not nothing. Let’s check in R whether the difference holds:

prop.test(c(60, 78), c(1500, 1500), correct = FALSE)
#
# 	2-sample test for equality of proportions
# X-squared = 2.461, df = 1, p-value = 0.1167

The p-value is 0.117: above 0.05. By-the-book verdict: difference not significant, test failed, file it away. But now we know better than to stop here. Let’s compute the power that test actually had, starting from the observed effect size:

library(pwr)
h <- ES.h(0.052, 0.040)   # Cohen's h between the two proportions
h
# [1] 0.0574024

pwr.2p.test(h = h, n = 1500, sig.level = 0.05)
#               power = 0.3492384

Power was 0.35. In other words: even if B had genuinely been better by that much, we had a little over one chance in three of noticing it. The test did not “prove the two pages are equal”: it was simply too weak to rule. And how much data would have been needed to reach decent power?

pwr.2p.test(h = h, power = 0.80, sig.level = 0.05)
#               n = 4764.053

Almost 4,800 visitors per variant, against the 1,500 we had. The difference between a test that “didn’t work” and a test never really in a position to work is all here — and you only see it if you pair power with effect size. Beware, then, of downgrading a non-significant result to “no effect”: almost always we are merely looking at an underpowered test.

Try It Yourself

To make the mechanism stick, here is an exercise with realistic data. We are designing an A/B test on a contact form. The current conversion rate (baseline) is 2.5%, and we would count it a success to bring it to 3.0%: half a point of improvement. We want a test with power 0.80 and threshold 0.05.

The task: compute the effect size with ES.h(0.030, 0.025), pass it to pwr.2p.test setting power = 0.80, and read off how many visitors per variant are needed. Then, as a cross-check, compute the power we would have if we stopped at 3,000 visitors per variant with pwr.2p.test(h = ..., n = 3000, ...).

To check your work: the effect size is h = 0.031, about 16,759 visitors per variant are needed for a power of 0.80, and with only 3,000 the power would collapse to 0.22. The moral is the one we now know: the smaller the effect we are chasing, the more data we need to see it — halving the minimum detectable difference does not double the sample required, it quadruples it.

Effect size and power complete the triad that the p-value, on its own, left unfinished: no longer just does the effect exist?, but also how big is it? and could I have seen it?. These are the three questions that turn a test from a propitiatory rite into a decision tool. And all three, on closer inspection, depend on a choice that comes before the test: how much data to collect, and how. That is the terrain of experimental design and sampling — the point where statistics stops merely judging the numbers we put in front of it and begins to tell us which numbers to go and look for.

Effect Size and Power Analysis: How Big Is the Effect (and How Much Data You Need)