{"id":3693,"date":"2026-06-17T08:45:42","date_gmt":"2026-06-17T07:45:42","guid":{"rendered":"https:\/\/www.gironi.it\/blog\/?p=3693"},"modified":"2026-06-18T14:19:34","modified_gmt":"2026-06-18T13:19:34","slug":"effect-size-and-power-analysis","status":"publish","type":"post","link":"https:\/\/www.gironi.it\/blog\/en\/effect-size-and-power-analysis\/","title":{"rendered":"Effect Size and Power Analysis: How Big Is the Effect (and How Much Data You Need)"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">We closed the article on the <a href=\"https:\/\/www.gironi.it\/blog\/en\/ab-test-significance-calculator\/\">A\/B test significance calculator<\/a> with a promise. We said that the p-value answers a single question \u2014 <em>does the effect exist?<\/em> \u2014 and that, on its own, it adds nothing else. It does not tell us how large the effect is, nor whether it is worth the effort of shipping it. It is time to keep that promise, because the two questions the p-value leaves hanging are exactly what separates reading data with method from stopping at the first threshold that glitters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The two questions have precise names. The first \u2014 <em>how big is it?<\/em> \u2014 is the <strong>effect size<\/strong>. The second \u2014 <em>with the data I have, could I even have seen an effect like this?<\/em> \u2014 is the <strong>power<\/strong> of the test, and the reasoning that gets us to an answer is called <strong>power analysis<\/strong>. We examine them one at a time, as always with an example at hand.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Significant Doesn&#8217;t Mean Large<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s start with a situation that comes up more often than people running online tests would like. Suppose we tried two title tags on a very high-traffic page and collected one million sessions per variant. Variant A has a CTR of 3.00%, variant B of 3.05%: five hundredths of a percentage point of difference. Let&#8217;s check in R whether the gap is statistically significant:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># one million sessions per variant, CTR 3.00% vs 3.05%\nprop.test(c(30000, 30500), c(1000000, 1000000), correct = FALSE)$p.value\n# [1] 0.03899<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The p-value is 0.039, below the 0.05 threshold. By the book, we should celebrate: the difference is &#8220;significant&#8221;. But let&#8217;s pause. Are we really about to rewrite the titles across the whole site to gain five hundredths of a point of CTR? That significant result hides an effect of laughable size, made detectable only by the sheer mass of data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>This is the point of no return<\/strong>: with a large enough sample, <em>any<\/em> difference becomes statistically significant, even the most trivial one. The p-value measures how confident we are that the effect isn&#8217;t zero; it does not measure how large the effect is. They are two different things, and conflating them is the mistake that leads to chasing wins that leave no trace on revenue. Effect size exists precisely to put magnitude back at the centre.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Effect Size: Measuring the &#8220;How Much&#8221;<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The idea behind effect size is simple and, once seen, hard to forget: instead of asking only <em>whether<\/em> two groups differ, we measure <em>by how much<\/em> they differ, on a scale that does not depend on sample size. It is the difference between saying &#8220;B beats A&#8221; and saying &#8220;B beats A by half a standard deviation&#8221;. The first is news; the second is information you can decide on.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are several effect-size measures, each tailored to a type of comparison. We look closely at two \u2014 one for means, one for proportions \u2014 because they cover most of the everyday work; the others we mention briefly at the end, with the right pointers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Cohen&#8217;s d: the Effect Between Two Means<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When we compare two means \u2014 the average time on page of two variants, the average session duration of two segments \u2014 the reference measure is <strong>Cohen&#8217;s d<\/strong>. The intuition is this: we take the difference between the two means and express it in &#8220;standard-deviation units&#8221;, so it becomes comparable across different contexts. A three-second difference weighs a lot if sessions all hover around that value, and almost nothing if they vary by minutes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In formula, Cohen&#8217;s d is the ratio between the difference of the means and the combined standard deviation of the two groups:<\/p>\n\n\n\n\\( d = \\frac{\\bar{x}_B &#8211; \\bar{x}_A}{s_p} \\\\ \\)\n\n\n\n<p class=\"wp-block-paragraph\">where <em>x\u0304<\/em><sub>A<\/sub> and <em>x\u0304<\/em><sub>B<\/sub> are the group means and <em>s<\/em><sub>p<\/sub> is the <strong>pooled standard deviation<\/strong>, a weighted average of the two standard deviations that brings together the internal variability of both groups:<\/p>\n\n\n\n\\( s_p = \\sqrt{\\frac{(n_A &#8211; 1)\\,s_A^2 + (n_B &#8211; 1)\\,s_B^2}{n_A + n_B &#8211; 2}} \\\\ \\)\n\n\n\n<p class=\"wp-block-paragraph\">with <em>n<\/em><sub>A<\/sub>, <em>n<\/em><sub>B<\/sub> the sample sizes and <em>s<\/em><sub>A<\/sub>, <em>s<\/em><sub>B<\/sub> the standard deviations of the two groups. The denominator is nothing more than the correct way to fuse two variabilities into a single reference measure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s do an example. We measured session duration (in seconds) on two versions of a page, twelve sessions per version. I compute Cohen&#8217;s d in R using the <code>effsize<\/code> package, which does the maths and also returns the qualitative label:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>A &lt;- c(48, 55, 52, 60, 46, 58, 51, 57, 49, 54, 53, 50)  # version A\nB &lt;- c(50, 58, 52, 62, 49, 57, 60, 53, 61, 51, 59, 54)  # version B\n\nlibrary(effsize)\ncohen.d(B, A)\n\n# Cohen's d\n#\n# d estimate: 0.6254922 (medium)\n# 95 percent confidence interval:\n#      lower      upper\n# -0.2416187  1.4926030<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The estimated d is <strong>0.63<\/strong>, which <code>effsize<\/code> classifies as a <strong>medium<\/strong> effect. The conventional thresholds, proposed by Jacob Cohen, are 0.2 for a small effect, 0.5 for a medium one, 0.8 for a large one \u2014 but they should be taken for what they are: useful conventions to get oriented, not laws of nature. Cohen himself recommended interpreting them in light of one&#8217;s own field, not applying them blindly. <em>In everyday SEO practice<\/em>, a d of 0.63 on session duration is a change worth taking seriously.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There is, however, a detail worth the whole rest of the article, and it is already visible above: the confidence interval of d runs from \u22120.24 to 1.49. It crosses zero. In other words, with just twelve sessions per group, the <em>estimated<\/em> effect is medium, but the data are not enough to rule out that the <em>true<\/em> one is null. And indeed, if we feed the same numbers to a t-test, we find anything but a reassuring p-value:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>t.test(B, A)\n#\n# \tWelch Two Sample t-test\n# t = 1.5321, df = 21.9, p-value = 0.1398<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A medium effect that the test declares <em>not<\/em> significant. This is not a contradiction: it is exactly the phenomenon that the power of a test exists to explain. Let&#8217;s hold that thought, we come back to it shortly.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Effect Size for Proportions (CTR and Conversions)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Time on page is a mean, but the daily bread of anyone doing SEO is proportions: CTR, conversion rate, bounce rate. Here Cohen&#8217;s d does not apply directly, and the natural effect-size measure is <strong>Cohen&#8217;s h<\/strong>, built specifically for the difference between two proportions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The technical detail that makes it reliable is a transformation \u2014 the arcsine of the square root of the proportion \u2014 that serves to stabilise the variability (in a proportion, variability depends on the value itself, and is greatest around 50%). The formula is:<\/p>\n\n\n\n\\( h = 2\\arcsin\\sqrt{p_2} &#8211; 2\\arcsin\\sqrt{p_1} \\\\ \\)\n\n\n\n<p class=\"wp-block-paragraph\">where <em>p<\/em><sub>1<\/sub> and <em>p<\/em><sub>2<\/sub> are the two proportions compared. There is no need to compute it by hand: the <code>ES.h<\/code> function of the <code>pwr<\/code> package gives it to us. But before seeing it at work it is worth introducing the other half of the story, because that is where Cohen&#8217;s h shines.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, though, let&#8217;s close the effect-size chapter with an honest mention of the other measures. When the groups compared are more than two \u2014 the classic ANOVA scenario \u2014 the typical measure is <strong>eta squared<\/strong> (\u03b7\u00b2), which tells what fraction of the total variability is explained by the factor under study; we laid its foundations when discussing the <a href=\"https:\/\/www.gironi.it\/blog\/en\/analysis-of-variance-anova-explained-simply\/\">analysis of variance<\/a>. When instead the outcome is binary \u2014 converts \/ does not convert \u2014 effect size is often expressed as an <strong>odds ratio<\/strong>, the ratio between the odds of success, the same object that governs <a href=\"https:\/\/www.gironi.it\/blog\/en\/logistic-regression-predicting-the-outcome-of-an-event\/\">logistic regression<\/a>. Different tools for different questions, but the underlying idea does not change: put a number on the magnitude, not just on the existence.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Power of a Test: Could We Have Seen It?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s go back to our medium effect declared not significant. How can a d of 0.63 produce a p-value of 0.14? The answer lies in a concept that closes the inferential circle: the <strong>power<\/strong> of a test.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When we run a hypothesis test we risk two kinds of error. The first, the type I error, is crying out for an effect that isn&#8217;t there: we keep it under control with the threshold \u03b1 (usually 0.05). The second, the type II error, is its opposite and far more insidious: <em>failing to see<\/em> an effect that is in fact there. The probability of committing it is denoted by \u03b2, and <strong>power<\/strong> is its complement:<\/p>\n\n\n\n\\( \\text{power} = 1 &#8211; \\beta \\\\ \\)\n\n\n\n<p class=\"wp-block-paragraph\">Put more plainly, power is the probability of noticing a real effect when it truly exists. A power of 0.80 \u2014 the standard people aim for \u2014 means that, if the effect exists at the hypothesised size, our test detects it four times out of five.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The crucial point is that power, the threshold \u03b1, effect size and sample size are not four independent knobs: they are <strong>bound by a constraint<\/strong>. Fix three of these values, and the fourth is determined. This is the entire idea of power analysis, and it is what makes it so useful: depending on which unknown we leave free, it answers two different operational questions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And here is why our medium effect stayed invisible. With twelve sessions per group the power of the test was minuscule: the test was, quite simply, <em>blind<\/em>. A non-significant result, under these conditions, does not say &#8220;the effect isn&#8217;t there&#8221;; it says &#8220;I didn&#8217;t have good enough eyes to see it&#8221;. Confusing the two is one of the most expensive mistakes you can make reading an A\/B test.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Power Analysis in R: How Much Data You Need<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The first question power analysis can settle is the one every test should face <em>before<\/em> starting: how much data do I need? Let&#8217;s pick up our medium effect again. If we wanted to design a test able to detect a d of 0.63 with power 0.80 and threshold 0.05, I compute in R with the <code>pwr<\/code> package:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>library(pwr)\npwr.t.test(d = 0.63, sig.level = 0.05, power = 0.80, type = \"two.sample\")\n#\n#      Two-sample t test power calculation\n#               n = 40.53396\n#               d = 0.63\n#       sig.level = 0.05\n#           power = 0.8\n#     alternative = two.sided\n# NOTE: n is number in *each* group<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We would need about <strong>41 sessions per group<\/strong>, not twelve. That is why our test was mute: it was looking for a medium effect with a third of the data required. Power analysis, done <em>upstream<\/em>, would have spared us an inconclusive test \u2014 and it is exactly the reasoning behind the <a href=\"https:\/\/www.gironi.it\/blog\/en\/ab-test-sample-size-calculator\/\">sample size calculator<\/a>: sample size and power are two sides of the same coin.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The second question is the mirror image and comes up <em>after the fact<\/em>, once the test is done: with the data I had, how much power did I really have? We see it better on a concrete case.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">A Practical Case: the A\/B Test That &#8220;Didn&#8217;t Work&#8221;<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Suppose we tested two landing pages. A converted 60 visitors out of 1,500 (4.0%), B converted 78 out of 1,500 (5.2%). At a glance B looks clearly better \u2014 a point and two tenths of conversion more is not nothing. Let&#8217;s check in R whether the difference holds:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>prop.test(c(60, 78), c(1500, 1500), correct = FALSE)\n#\n# \t2-sample test for equality of proportions\n# X-squared = 2.461, df = 1, p-value = 0.1167<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The p-value is 0.117: above 0.05. By-the-book verdict: difference not significant, test failed, file it away. But now we know better than to stop here. Let&#8217;s compute the power that test actually had, starting from the observed effect size:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>library(pwr)\nh &lt;- ES.h(0.052, 0.040)   # Cohen's h between the two proportions\nh\n# [1] 0.0574024\n\npwr.2p.test(h = h, n = 1500, sig.level = 0.05)\n#               power = 0.3492384<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Power was <strong>0.35<\/strong>. In other words: even if B had genuinely been better by that much, we had a little over one chance in three of noticing it. The test did not &#8220;prove the two pages are equal&#8221;: it was simply too weak to rule. And how much data would have been needed to reach decent power?<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pwr.2p.test(h = h, power = 0.80, sig.level = 0.05)\n#               n = 4764.053<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Almost <strong>4,800 visitors per variant<\/strong>, against the 1,500 we had. The difference between a test that &#8220;didn&#8217;t work&#8221; and a test never really in a position to work is all here \u2014 and you only see it if you pair power with effect size. <strong>Beware<\/strong>, then, of downgrading a non-significant result to &#8220;no effect&#8221;: almost always we are merely looking at an underpowered test.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Try It Yourself<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">To make the mechanism stick, here is an exercise with realistic data. We are designing an A\/B test on a contact form. The current conversion rate (baseline) is <strong>2.5%<\/strong>, and we would count it a success to bring it to <strong>3.0%<\/strong>: half a point of improvement. We want a test with power 0.80 and threshold 0.05.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The task: compute the effect size with <code>ES.h(0.030, 0.025)<\/code>, pass it to <code>pwr.2p.test<\/code> setting <code>power = 0.80<\/code>, and read off how many visitors per variant are needed. Then, as a cross-check, compute the power we would have if we stopped at 3,000 visitors per variant with <code>pwr.2p.test(h = ..., n = 3000, ...)<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To check your work: the effect size is <em>h<\/em> = 0.031, about <strong>16,759 visitors per variant<\/strong> are needed for a power of 0.80, and with only 3,000 the power would collapse to <strong>0.22<\/strong>. The moral is the one we now know: the smaller the effect we are chasing, the more data we need to see it \u2014 halving the minimum detectable difference does not double the sample required, it quadruples it.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">Effect size and power complete the triad that the p-value, on its own, left unfinished: no longer just <em>does the effect exist?<\/em>, but also <em>how big is it?<\/em> and <em>could I have seen it?<\/em>. These are the three questions that turn a test from a propitiatory rite into a decision tool. And all three, on closer inspection, depend on a choice that comes <em>before<\/em> the test: how much data to collect, and how. That is the terrain of experimental design and <a href=\"https:\/\/www.gironi.it\/blog\/en\/sampling-and-sample-size-how-much-data-do-you-really-need\/\">sampling<\/a> \u2014 the point where statistics stops merely judging the numbers we put in front of it and begins to tell us which numbers to go and look for.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Further Reading<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">On the rigorous use of effect size, power and sizing in the context of online experiments, the most complete reference remains <a href=\"https:\/\/www.amazon.it\/dp\/1108724264?tag=consulenzeinf-21&#038;ascsubtag=effect-size-and-power-analysis\" rel=\"nofollow sponsored noopener\" target=\"_blank\"><em>Trustworthy Online Controlled Experiments<\/em><\/a> by Ron Kohavi, Diane Tang and Ya Xu: the chapters on how to size a test and interpret its results are worth the purchase on their own. For an accessible take on the statistical reasoning behind these ideas \u2014 uncertainty, error, inference \u2014 <a href=\"https:\/\/www.amazon.it\/dp\/0241258766?tag=consulenzeinf-21&#038;ascsubtag=effect-size-and-power-analysis\" rel=\"nofollow sponsored noopener\" target=\"_blank\"><em>The Art of Statistics<\/em><\/a> by David Spiegelhalter is hard to beat.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We closed the article on the A\/B test significance calculator with a promise. We said that the p-value answers a single question \u2014 does the effect exist? \u2014 and that, on its own, it adds nothing else. It does not tell us how large the effect is, nor whether it is worth the effort of &hellip; <a href=\"https:\/\/www.gironi.it\/blog\/en\/effect-size-and-power-analysis\/\" class=\"more-link\">Leggi tutto<span class=\"screen-reader-text\"> &#8220;Effect Size and Power Analysis: How Big Is the Effect (and How Much Data You Need)&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","footnotes":""},"categories":[161],"tags":[],"class_list":["post-3693","post","type-post","status-publish","format-standard","hentry","category-statistics"],"lang":"en","translations":{"en":3693,"it":3691},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"post-thumbnail":false},"uagb_author_info":{"display_name":"Paolo Gironi","author_link":"https:\/\/www.gironi.it\/blog\/author\/autore-articoli\/"},"uagb_comment_info":0,"uagb_excerpt":"We closed the article on the A\/B test significance calculator with a promise. We said that the p-value answers a single question \u2014 does the effect exist? \u2014 and that, on its own, it adds nothing else. It does not tell us how large the effect is, nor whether it is worth the effort of&hellip;","_links":{"self":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3693","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/comments?post=3693"}],"version-history":[{"count":3,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3693\/revisions"}],"predecessor-version":[{"id":3707,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3693\/revisions\/3707"}],"wp:attachment":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/media?parent=3693"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/categories?post=3693"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/tags?post=3693"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}