statistics – paologironi blog

A/B Testing: How to Run Statistically Valid Experiments (and the Mistakes to Avoid)

Paolo Gironi — Fri, 19 Jun 2026 07:34:46 +0000

Over the previous articles we have looked at how hypothesis testing works and how the two-sample t-test lets us compare two groups rigorously. We have also built confidence intervals, learned to quantify the uncertainty of our estimates, and seen with the Central Limit Theorem why all this works even when the data are not normal.

But there is one question that, in the day-to-day reality of anyone doing SEO and marketing, comes up almost daily: which variant performs better? Which title tag brings more clicks? Which landing page converts more? Which meta description draws attention? It is not an academic question: it is the question that separates data-driven decisions from opinions disguised as strategies.

The good news is that we already have all the tools to answer it. A/B testing is nothing more than the direct application of the statistical inference concepts we have built step by step: hypothesis testing, comparison between groups, significance. In this article we put it all together.

What we’ll cover

What an A/B test is
Setting up an A/B test correctly
Worked example: conversion rate of two landing pages
The most common mistakes
Frequentist vs Bayesian approach
Practical SEO example: meta description A/B test
Try it yourself

What an A/B test is

An A/B test is, in essence, a controlled experiment: we take two variants of something (a page, a headline, a call-to-action), randomly assign users to one of the two variants, and measure which one produces better results.

Variant A is the control (the current version, the one we are already using). Variant B is the treatment (the new version we want to test). The logic is the same as a scientific experiment: we change one variable at a time, keep everything else constant, and observe whether the change produces a measurable effect.

Three elements make an A/B test reliable. Randomisation: users are assigned to A or B at random. This is essential, because if we showed A in the morning and B in the afternoon, any observed difference might depend on the time of day, not on the variant. The control group: without A as a reference, we wouldn’t know whether B’s results are good or bad. And finally a success metric defined in advance: CTR, conversion rate, time on page. The metric must be chosen before collecting the data, not after (we will come back to this point shortly).

But why do we need statistics? Because data are noisy. If variant A has a CTR of 5.0% and variant B of 5.3%, is that difference real or just random fluctuation? The naked eye cannot tell: we need a formal test. And it is precisely the two-sample test we have already seen — applied to proportions rather than means.

Setting up an A/B test correctly

Before collecting data, we have to set up the test rigorously. Let’s see how.

Choosing the metric. The metric must be clear, measurable and directly linked to the goal. For a title tag, the natural metric is the CTR (Click-Through Rate). For a landing page, the conversion rate. For a blog article, perhaps the average time on page. Always keep this in mind: a vague metric (“people like the page more”) is not a metric.

Defining the hypotheses. As in every statistical test, we start from a null hypothesis and an alternative hypothesis:

\(H_0\): the two variants have the same effect (no difference between A and B)
\(H_1\): the two variants have a different effect (a difference exists)

The statistical test. When we compare two proportions (such as two CTRs or two conversion rates), the appropriate test is the two-proportion z-test. The logic is the same as the two-sample t-test, but adapted to binary data (click/no-click, conversion/no-conversion).

The test statistic is computed as follows. First, we compute the pooled proportion, which is our best estimate of the common proportion under the null hypothesis:

\(
\hat{p} = \frac{x_1 + x_2}{n_1 + n_2} \\
\)

where \(x_1\) and \(x_2\) are the successes (clicks, conversions) in the two groups, and \(n_1\) and \(n_2\) the sample sizes.

Then we compute the z statistic:

\(
z = \frac{\hat{p}_1 – \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \\
\)

The numerator is the observed difference between the two proportions; the denominator is the standard error under the null hypothesis. The ratio tells us how many “standard-error units” separate the two proportions: the higher it is, the harder the difference is to attribute to chance.

Example: CTR of two title tags

Let’s take a concrete example. We tested two title tag variants for an important page on the site:

Title A (control): 1500 impressions, 75 clicks → CTR = 5.0%
Title B (treatment): 1500 impressions, 105 clicks → CTR = 7.0%

Title B looks better, but is the difference statistically significant? Let’s compute it step by step.

Step 1: the pooled proportion:

\(
\hat{p} = \frac{75 + 105}{1500 + 1500} = \frac{180}{3000} = 0.06 \\
\)

Step 2: the standard error:

\(
SE = \sqrt{0.06 \times 0.94 \times \left(\frac{1}{1500} + \frac{1}{1500}\right)} = \sqrt{0.0564 \times 0.00133} \approx 0.00867 \\
\)

Step 3: the z statistic:

\(
z = \frac{0.07 – 0.05}{0.00867} \approx 2.31 \\
\)

Step 4: the p-value. For a two-tailed test, \(p \approx 0.021\).

So: the p-value is below 0.05. We can reject the null hypothesis and conclude that the difference between the two title tags is statistically significant. Title B has a significantly higher CTR.

Let’s run the same test in R:

# Data
n1 <- 1500; x1 <- 75    # Title A
n2 <- 1500; x2 <- 105   # Title B
p1 <- x1 / n1  # 0.05
p2 <- x2 / n2  # 0.07

# Pooled proportion and z-test
p_pool <- (x1 + x2) / (n1 + n2)
se <- sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
z <- (p2 - p1) / se
p_value <- 2 * (1 - pnorm(abs(z)))

cat("z =", round(z, 3), "\n")
cat("p-value =", round(p_value, 4), "\n")

Result: z = 2.306, p-value = 0.0211.

Worked example: conversion rate of two landing pages

Let’s move on to a more elaborate example. An e-commerce store is testing two variants of its landing page:

Page A (current design): 1000 visitors, 35 conversions → conversion rate = 3.5%
Page B (new design): 1000 visitors, 58 conversions → conversion rate = 5.8%

The difference looks substantial (2.3 percentage points), but with these numbers is it enough to rule out chance?

Let’s check in R with prop.test(), which runs the two-proportion test:

result <- prop.test(
  x = c(35, 58),
  n = c(1000, 1000)
)

print(result)

The function returns the p-value of the test and, very usefully, the confidence interval of the difference between the two proportions. In this case the p-value is about 0.019 — below 0.05, so the difference is statistically significant.

But it is the confidence interval of the difference that gives us the most valuable information: not only whether B is better than A, but by how much, and with what margin of uncertainty. If the CI of the difference runs from about 0.4 to 4.2 percentage points, we know that B is almost certainly better, and the improvement lies within that range. That is far richer information than a simple “yes, it’s significant”.

n.b.: prop.test() applies a continuity correction (Yates’s correction) that makes the test slightly more conservative. For large samples the difference is negligible; for small samples, it is a welcome caution.

The most common mistakes

A/B testing is a powerful tool, but a treacherous one. The ease with which a test can be set up hides serious methodological pitfalls. Let’s look at the most frequent ones.

Stopping the test too early

It is the strongest temptation: after a few days, B looks clearly better than A. Why wait any longer? Because those preliminary results are noise, not signal.

The problem has a technical name: peeking. Every time we look at the interim data and decide whether to stop, we increase the probability of a false positive. It’s like tossing a coin: if we stop every time we get three heads in a row, we’ll conclude the coin is rigged. But it isn’t — we simply haven’t given it enough tosses.

How to avoid it: define the required sample size beforehand and wait until you reach that number before drawing conclusions. In the meantime, you can use our sample size calculator to determine how many users you need before launching the test.

Testing too many variants without correction

Another frequent mistake: testing three, four, five variants at the same time (A/B/C/D…) and then comparing them all pairwise. The problem is that of multiple comparisons: the more comparisons we make, the more likely we are to find at least one significant result by pure chance.

With 5 variants and 10 pairwise comparisons, the probability of finding at least one false positive rises from 5% to almost 40%. This is not a detail: it is an error that invalidates the entire test.

How to avoid it: if multiple comparisons are needed, apply a Bonferroni correction (divide the \(\alpha\) threshold by the number of comparisons) or, better still, limit yourself to testing one variant at a time.

Ignoring the power of the test

We know the risk of a false positive well (type I error, \(\alpha\)). But there is a mirror risk that is often ignored: the false negative (type II error, \(\beta\)). It happens when B really is better than A, but our test fails to detect it.

The most common cause? A sample that is too small. If we have only 100 visitors per variant, the test does not have enough “power” to detect small but real differences. We will conclude “no significant difference” not because the difference doesn’t exist, but because we didn’t have enough data to see it.

How to avoid it: compute the required sample size before launching the test, based on the minimum effect we want to detect. This is the subject of power analysis: use the sample size calculator to check whether your test has enough power.

Confusing statistical significance with practical significance

A low p-value does not automatically mean the result is important. With very large samples, even microscopic differences become statistically significant. If we test two variants on 500,000 visitors, a CTR difference of 0.01% (from 5.00% to 5.01%) might come out significant. But it is an operationally irrelevant difference.

Caution: the p-value answers the question “is the difference real?”, not the question “is the difference big enough to matter to us?”. For the latter we need a different measure — the effect size — which we cover in a dedicated article.

Frequentist vs Bayesian approach

Everything we have seen so far follows the frequentist approach: we compute a test statistic, compare it with a reference distribution, obtain a p-value and make a binary decision (reject or fail to reject \(H_0\)).

It works, and works well. But it has limits that you feel in everyday practice. The p-value does not tell us “by how much B is better than A”. It does not tell us “what the probability is that B is genuinely superior”. And if we collect new data, we cannot simply update the result: we have to recompute everything from scratch.

There is an alternative approach that answers directly the question we care about most: what is the probability that B is better than A? It is the Bayesian approach.

The idea is this. Instead of starting from a null hypothesis and trying to reject it, we start from a prior distribution that represents our initial knowledge about each variant’s conversion. Then, as we collect data, we update that distribution. The result is a posterior distribution that incorporates both our prior knowledge and the observed data.

For conversion rates, the natural distribution is the Beta: it is defined between 0 and 1 (like a proportion) and updates very elegantly. If we start from a prior \(\text{Beta}(\alpha, \beta)\) and observe \(s\) successes out of \(n\) trials, the posterior is:

\(
\text{Beta}(\alpha + s, \, \beta + n – s) \\
\)

Sounds hard? It’s very easy. Let’s use the data from the two landing pages in the previous example. We start from a non-informative prior \(\text{Beta}(1, 1)\) — which amounts to saying “we know nothing, any value between 0 and 1 is equally plausible”:

Page A: 35 conversions out of 1000 → posterior \(\text{Beta}(36, \, 966)\)
Page B: 58 conversions out of 1000 → posterior \(\text{Beta}(59, \, 943)\)

Let’s compute in R the probability that B is better than A:

set.seed(42)
n_sim <- 100000

# Posterior of the two variants
post_A <- rbeta(n_sim, shape1 = 36, shape2 = 966)
post_B <- rbeta(n_sim, shape1 = 59, shape2 = 943)

# Probability that B > A
prob_B_better <- mean(post_B > post_A)
cat("P(B > A) =", round(prob_B_better, 4), "\n")

# Distribution of the difference
diff <- post_B - post_A
cat("Median difference:", round(median(diff) * 100, 2), "pct points\n")
cat("95% CI of the difference:",
    round(quantile(diff, 0.025) * 100, 2), "-",
    round(quantile(diff, 0.975) * 100, 2), "pct points\n")

The result is striking: the probability that B is better than A is above 99%. But the real advantage of the Bayesian approach is that we obtain directly the distribution of the difference: not only do we know whether B is better, but by how much, with a credible interval that quantifies our uncertainty.

This is a substantial difference from the frequentist approach. The p-value tells us “the difference is unlikely under \(H_0\)“; the Bayesian result tells us “the probability that B is better is 99%, and the improvement lies between about 0.5 and 4.2 percentage points”. For an operational decision, the second piece of information is often more useful.

An important note: the full Bayesian approach deserves a dedicated article. Here we have only scratched the surface — the topic of informative priors, hierarchical models and their systematic application is a path of its own that we will tackle in the section devoted to Bayesian statistics.

Practical SEO example: meta description A/B test

Let’s look at one last scenario, very common in everyday practice. We have two meta description variants for a key page on the site. Alternating the two versions (two weeks each, to minimise seasonal effects) and consulting the Search Console data, we get:

Meta A: 3200 impressions, 128 clicks → CTR = 4.0%
Meta B: 3100 impressions, 155 clicks → CTR = 5.0%

Let’s check in R:

prop.test(c(128, 155), c(3200, 3100))

the p-value is about 0.064 — above the 0.05 threshold, so we cannot reject the null hypothesis. The confidence interval of the difference also includes zero, confirming the non-significance. A borderline result, which tells us: with these data we don’t have enough evidence to conclude that Meta B is genuinely better.

Which approach should we use? For a simple test like this, the frequentist approach with prop.test() is more than sufficient: we have large samples, the question is clear. The Bayesian approach becomes more valuable when the samples are small, when we want to update the result as new data arrive, or when we have prior knowledge to incorporate (for example, we know that for that type of page the CTR is typically between 3% and 7%).

But the operational decision must not rest on the p-value alone. We have to ask: is the difference (one percentage point more of CTR) big enough to justify the change? With 3000-plus impressions a month, one percentage point more means about 30 additional clicks. Is that significant for our business? This is a question statistics cannot resolve on its own — it is a judgement that falls to us.

Try it yourself

An e-commerce store is testing two call-to-action variants on a product page:

Variant A (“Add to cart”): 450 visits, 23 conversions
Variant B (“Buy it now”): 430 visits, 31 conversions

Compute the conversion rate of each variant
Run the test with prop.test(c(23, 31), c(450, 430)) and interpret the p-value
Does the confidence interval of the difference include zero?
At the 5% significance level, is the difference statistically significant?

Hint: if the p-value is above 0.05, we cannot conclude that one variant is better than the other — but this does not mean they are equal. It might simply mean we don’t have enough data. It is exactly the problem of the power of the test that we discussed.

A/B testing gives us a rigorous framework for making decisions based on data, not intuition. But as we have seen, a well-run test tells us whether there is a significant difference — it does not tell us how large that effect is, nor how much data we need to detect it with confidence. Those are the questions of effect size and power analysis, the next tools in our path. For the sample size, the interactive calculator lets you get the exact number in real time.

An Introduction to Principal Component Analysis (PCA)

Paolo Gironi — Fri, 19 Jun 2026 07:29:10 +0000

Principal Component Analysis (PCA) is a widely used statistical technique for reducing the complexity of large datasets. It aims to cut down the number of variables, transforming potentially correlated ones into a smaller set of uncorrelated variables called principal components.

This methodology answers the need to represent complex phenomena — described by a large number of variables — through a smaller number of variables that retain most of the original information. The primary goal is to maximise the variance captured by these new components, thereby ensuring minimal information loss.

In practice, PCA proves particularly useful when we face datasets with many variables that are correlated with one another. In such scenarios, analysing all the variables directly can become complex and hard to interpret. PCA lets us concentrate the information contained in the original variables into a reduced number of principal components, making it easier to spot underlying patterns and trends.

To grasp the idea of dimensionality reduction, picture a city with many interconnected streets. PCA works much like an urban-planning system that identifies the main traffic arteries. By focusing on these “main roads”, we get a clear view of the city’s structure and its traffic flows, without having to analyse every single side street.

In the specific context of web marketing and data analysis, PCA is a powerful tool for several reasons. It is effective for visualising and exploring high-dimensional datasets, making it easy to spot trends, patterns or outliers. It is also commonly used in the data pre-processing stage for machine learning algorithms, since it can extract the most informative features from large datasets while preserving the most relevant information. A further advantage is its ability to minimise or eliminate multicollinearity and overfitting, frequent problems in web marketing datasets characterised by many potentially correlated variables.

The Mathematical Foundations of PCA

To fully understand how PCA works, it helps to get familiar with a few key mathematical concepts.

Variance and covariance are statistical concepts central to PCA. Variance measures the dispersion of a single variable around its mean, indicating how far its values lie from the central value. Covariance, instead, quantifies how two variables change together: a positive covariance suggests the variables tend to rise or fall at the same time, while a negative covariance indicates an inverse relationship. The goal of PCA is to find components that exhibit the maximum possible variance, since greater variance is often associated with a greater amount of information. The covariance matrix is a tool that summarises the covariances between every possible pair of variables in a dataset. Its diagonal elements represent the variances of each variable, while the off-diagonal elements indicate the covariances between pairs. This matrix is a crucial input for the PCA algorithm, because it describes the structure of the linear relationships between the variables.

Eigenvalues and eigenvectors are the mathematical heart of PCA. In simple terms, the principal components of a dataset are the eigenvectors of its covariance matrix. An eigenvector represents a direction in the space of the original data, while its associated eigenvalue indicates the magnitude of the data’s variance along that direction. In other words, the eigenvectors identify the directions in which the data vary the most, and the eigenvalues quantify the importance of each of these directions in terms of explained variance.

Explained variance is a fundamental metric for assessing the importance of each principal component. It represents the proportion of the original data’s total variance that is captured by a specific principal component, computed by dividing the component’s eigenvalue by the sum of all eigenvalues. The cumulative explained variance indicates the total amount of variance captured by a given number of principal components, summing their individual proportions. This metric is crucial for deciding how many principal components to keep in order to represent the data adequately without losing a significant amount of information.

A side note: criteria such as the Kaiser rule — which suggests keeping only the components with eigenvalues greater than 1 — and the scree plot — a chart of the ordered eigenvalues that helps identify the “elbow” of the curve as a cut-off point — are useful for guiding the choice of the optimal number of principal components.

Practical Applications of PCA Across Different Fields

PCA is a versatile technique with a wide range of applications across different fields. In general, it is used for dimensionality reduction, the visualisation of complex data, noise removal and the extraction of relevant features for later analysis or for training machine learning models.

In image processing, PCA is used for compression, reducing the number of pixels needed to represent an image while keeping its essential features. In genomics and bioinformatics, it helps identify the most critical genes that drive variation, reducing the complexity of genomic data. In finance, it can be applied to risk analysis and portfolio optimisation, identifying the key economic factors that influence asset performance. In healthcare, it is used to analyse medical images such as MRI scans, to improve visualisation and aid diagnosis. In security, it finds application in biometric systems for fingerprint recognition, extracting the most relevant features. And in climatology, the technique is used to analyse and interpret large environmental datasets.

When it comes specifically to data analysis and marketing, PCA offers several benefits. It lets us simplify complex datasets, reduce the noise in the data, extract the most significant features for further analysis and improve the performance of predictive models. Its ability to visualise high-dimensional data in a two- or three-dimensional space makes it easier to identify patterns, trends and outliers, rendering the data more accessible to interpret.

Concrete Use of PCA in Web Marketing, SEO, SEM and Data Analysis

Principal Component Analysis can be applied effectively across various areas of web marketing, SEO, SEM and data analysis to gain meaningful insights and optimise strategies.

In the analysis of keyword data, PCA can be used to reduce the dimensionality of word or document embeddings. A keyword dataset can be characterised by numerous metrics such as search volume, competition level, cost per click (CPC) and various semantic features. By applying PCA, we can condense these many dimensions into a smaller number of principal components that capture the underlying themes or features of the keywords. This can simplify the analysis, for example by identifying groups of keywords with similar performance profiles.

For the analysis of web traffic metrics, PCA can help identify meaningful patterns. Traffic metrics such as sessions, bounce rate, time on page and conversions from different sources can be analysed with PCA to uncover latent variables that drive website performance. For instance, a principal component related to user engagement might emerge, alongside a second component tied to the effectiveness of the different traffic sources. This understanding can inform decisions on marketing budget allocation and website optimisation.

User segmentation based on online behaviour and demographic data is another area where PCA proves valuable. By analysing user data with many variables — purchase history, browsing behaviour and demographic information — PCA can identify natural groupings of users with similar characteristics. This makes it possible to create more clearly defined customer segments and to target marketing activities more effectively.

Finally, PCA can help improve the analysis of advertising campaign performance. Campaign performance metrics such as impressions, clicks, conversions and cost per acquisition can be analysed to identify the key factors that drive campaign success. For example, PCA might reveal that a specific combination of ad creative and targeting parameters is the main driver of conversions, providing valuable guidance for optimising campaign strategies and improving the return on investment.

Implementing PCA with R: Practical Examples

To implement PCA in R, we first need to set up the environment and load the necessary libraries. The fundamental ones include stats for the base PCA functions such as prcomp() and princomp(), factoextra for visualising the results, and potentially dplyr and ggplot2 for data manipulation and visualisation.

To illustrate how PCA applies in a web marketing context, we can create synthetic datasets that simulate real-world scenarios.

Example 1: Keyword ranking data

Suppose we have a dataset with information on several keywords, including monthly search volume, a competition score (from 0 to 1), the average cost per click (CPC) and the average position on Google’s and Bing’s search results pages. We can create a synthetic data frame in R as follows:

# Synthetic data for keyword ranking
set.seed(123)
n_keywords <- 100
keywords <- paste0("keyword_", 1:n_keywords)
search_volume <- round(runif(n_keywords, min = 100, max = 10000))
competition <- runif(n_keywords, min = 0.1, max = 0.9)
cpc <- round(rnorm(n_keywords, mean = 2.5, sd = 1), 2)
ranking_google <- round(rnorm(n_keywords, mean = 15, sd = 10), 0)
ranking_bing <- round(rnorm(n_keywords, mean = 12, sd = 8), 0)

keyword_data <- data.frame(
  Keyword = keywords,
  Search_Volume = search_volume,
  Competition = competition,
  CPC = cpc,
  Ranking_Google = ranking_google,
  Ranking_Bing = ranking_bing
)

head(keyword_data)
#     Keyword Search_Volume Competition  CPC Ranking_Google Ranking_Bing
# 1 keyword_1          2947   0.5799912 1.79             37            6
# 2 keyword_2          7904   0.3662588 2.76             28            6
# 3 keyword_3          4149   0.4908904 2.25             12            4
# 4 keyword_4          8842   0.8635791 2.15             20            4
# 5 keyword_5          9411   0.4863219 1.55             11            9
# 6 keyword_6           551   0.8122802 2.45             10           15

Example 2: Advertising campaign performance data

Similarly, we can create synthetic data for advertising campaign performance, including metrics such as impressions, clicks, conversions, total cost, click-through rate (CTR) and cost per acquisition (CPA).

# Synthetic data for advertising campaign performance
set.seed(456)
n_campaigns <- 50
campaign_ids <- paste0("campaign_", 1:n_campaigns)
impressions <- round(runif(n_campaigns, min = 1000, max = 100000))
clicks <- round(impressions * runif(n_campaigns, min = 0.01, max = 0.1))
conversions <- round(clicks * runif(n_campaigns, min = 0.005, max = 0.05))
cost <- round(clicks * runif(n_campaigns, min = 0.1, max = 2), 2)
ctr <- round((clicks / impressions) * 100, 2)
cpa <- round(cost / conversions, 2)
cpa[is.nan(cpa)] <- 0  # Handle NaN

campaign_data <- data.frame(
  Campaign_ID = campaign_ids,
  Impressions = impressions,
  Clicks = clicks,
  Conversions = conversions,
  Cost = cost,
  CTR = ctr,
  CPA = cpa
)

head(campaign_data)
#   Campaign_ID Impressions Clicks Conversions    Cost  CTR    CPA
# 1  campaign_1        9866    873          14 1093.32 8.85  78.09
# 2  campaign_2       21841   1788          20 3360.17 8.19 168.01
# 3  campaign_3       73563   2866          66 2764.48 3.90  41.89
# 4  campaign_4       85361   4121          73 1422.12 4.83  19.48
# 5  campaign_5       79051   3432         133 1623.28 4.34  12.21
# 6  campaign_6       33864   3064         126 6047.70 9.05  48.00

Once the datasets are ready, we can run PCA using the prcomp() function. It is essential to scale the data before applying PCA, to prevent variables with larger scales from dominating the analysis.

# PCA on the keyword ranking data (5 variables -> 5 components)
pca_keywords <- prcomp(keyword_data[, 2:6], scale. = TRUE)
summary(pca_keywords)
#                           PC1    PC2    PC3    PC4    PC5
# Standard deviation     1.1381 1.0298 0.9894 0.9305 0.8941
# Proportion of Variance 0.2591 0.2121 0.1958 0.1732 0.1599
# Cumulative Proportion  0.2591 0.4712 0.6670 0.8401 1.0000

# PCA on the advertising campaign data (6 variables -> 6 components)
pca_campaigns <- prcomp(campaign_data[, 2:7], scale. = TRUE)
summary(pca_campaigns)
#                           PC1    PC2    PC3     PC4    PC5     PC6
# Standard deviation     1.7837 1.2229 0.9303 0.49392 0.4250 0.18138
# Proportion of Variance 0.5303 0.2492 0.1442 0.04066 0.0301 0.00548
# Cumulative Proportion  0.5303 0.7795 0.9238 0.96442 0.9945 1.00000

The two summaries already tell a story. For the keyword data the variance is spread fairly evenly across the five components (the first captures only 26%): a sign that those metrics are largely uncorrelated, and that PCA cannot compress them much without losing information. For the campaign data, instead, the first two components together account for almost 78% of the variance — the metrics are strongly correlated (more impressions, more clicks, more conversions, more cost), and two dimensions are enough to describe most of what is going on.

The output of summary() provides crucial information such as the standard deviations of the principal components, the proportion of variance explained by each component and the cumulative proportion. The loadings (or rotation matrix), accessible via pca_keywords\( rotation and pca_campaigns \)rotation, show the correlation between the original variables and the principal components, helping to interpret the meaning of each component. The scores (or component coordinates), accessible via pca_keywords\( x and pca_campaigns \)x, represent the projection of the original data onto the new space defined by the principal components.

To visualise the results, we can use the scree plot and the biplot. The scree plot (obtained with plot(pca_keywords) and plot(pca_campaigns)) shows the eigenvalues in decreasing order and helps identify the optimal number of components to keep. The biplot (obtained with biplot(pca_keywords) and biplot(pca_campaigns)) displays both the scores of the observations and the loadings of the variables in the plane defined by the first two principal components, providing a visual representation of the relationships between observations and variables.

Checking and Interpreting the PCA Results

To check the accuracy of the R code and of the interpretations, it is advisable to consult the official documentation of the prcomp() and princomp() functions in R’s stats package, as well as the documentation of the factoextra library for the visualisations. If needed, the results can be compared with those obtained from other statistical software or online resources. It is important to keep in mind the assumptions underlying PCA, such as the linearity of the relationships between the variables and the sensitivity to the scale of the data, as well as the potential impact of outliers.

Making sense of the principal components in the context of web marketing data requires an understanding of what the original variables mean and of how they contribute to each component, as indicated by the loadings. For example, if in the PCA on the keyword ranking data the first principal component has high, positive loadings for search volume and CPC, it might be interpreted as a measure of “high-potential keywords”. The interpretation requires solid domain knowledge of web marketing.

It is important to consider the limitations of PCA. It assumes linear relationships between the variables and can entail a loss of information when reducing dimensionality. For data with non-linear relationships, alternative techniques such as t-SNE and UMAP may be more appropriate.

Conclusion: Leveraging PCA to Optimise Web Marketing Strategies

Principal Component Analysis stands out as a powerful and versatile analytical tool for optimising web marketing strategies. The benefits of using PCA in this domain are manifold. First, its ability to reduce the dimensionality of complex datasets makes it possible to simplify the analysis and focus on the most relevant information. Second, PCA lets us identify underlying patterns in the data that might not be evident from a surface-level analysis, revealing meaningful relationships between different web marketing metrics. Furthermore, using PCA as a pre-processing step can improve the performance of predictive models, reducing noise and multicollinearity in the data. Finally, the ability to visualise high-dimensional data in a reduced space makes it easier to understand and communicate the insights drawn from the analysis.

For further exploration and more advanced applications, one could consider using PCA as a preliminary step for clustering algorithms, in order to segment keywords, users or advertising campaigns more effectively. Integrating PCA into predictive modelling pipelines could lead to more robust and interpretable models. Finally, looking into techniques such as sparse PCA could be useful for intrinsically selecting the most important variables in the web marketing context.

Correlation: Pearson, Spearman and Kendall (and Why It Isn’t Causation)

Paolo Gironi — Fri, 19 Jun 2026 07:10:34 +0000

Anyone who looks at a website’s data does it constantly, often without noticing: they spot that two things seem to move together. Pages that sit higher in the SERP get more clicks; the ones where users linger longer convert more; longer articles appear to rank better. These are valuable hunches, but they stay vague until we answer a precise question: how much do these pairs of numbers move together? And in what sense? We need an index that turns the impression “they go hand in hand” into a comparable measure. That index is correlation, and it is one of the most used — and most misunderstood — tools in all of applied statistics.

Let’s say right away what correlation is not, because this is where the trouble starts. Correlation measures whether and how much two variables are associated; it does not say that one causes the other, and it does not build a model to predict one from the other. That second step — prediction — is the job of regression, which we’ll cover separately. Here we stay on the previous rung: understanding, with a single number, whether two metrics travel together.

From Covariance to Correlation

The starting idea is simple. If two variables move together, when one sits above its own mean the other tends to sit above its own too; when one drops below, the other follows. We can measure this tendency by multiplying, for each observation, the deviation of x from its mean by the deviation of y from its, and averaging the result. This is the covariance:

\( \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^{n} (x_i – \bar{x})(y_i – \bar{y}) \\ \)

where x̄ and ȳ are the means of the two variables and n the number of observations. When the deviations share the same sign (both above or both below the mean) the product is positive; when they have opposite signs it is negative. A positive covariance thus signals that the two variables tend to grow together, a negative one that when one rises the other falls.

Covariance, however, has a flaw that makes it useless as a yardstick: it depends on the units of measurement. The covariance between sessions and seconds-on-page is one number, the one between sessions and conversion rate another, and the two can’t be compared because they speak different languages. To get a clean measure we divide it by the two standard deviations, stripping it of units and forcing it into a fixed range. The result is the Pearson correlation coefficient:

\( r = \frac{\sum_{i=1}^{n} (x_i – \bar{x})(y_i – \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i – \bar{x})^2} \; \sqrt{\sum_{i=1}^{n} (y_i – \bar{y})^2}} \\ \)

The numerator is nothing but the covariance (up to the factor n); the denominator is the product of the two spreads, and serves precisely to normalise. The result is a pure number between −1 and +1: it equals +1 when the points lie exactly on a rising line, −1 when they lie on a falling line, 0 when there is no linear association at all. The closer r gets to the extremes, the tighter the linear relationship.

Pearson: Linear Association (and Its Trap)

Let’s put it straight to work on a case every SEO knows by heart: the link between SERP position and CTR, the click-through rate. We all know that the further down the results page you go, the fewer clicks you get. Let’s take ten positions with their observed CTRs and compute Pearson’s coefficient in R:

pos <- 1:10
ctr <- c(28.5, 15.7, 11.0, 7.2, 8.0, 5.1, 4.0, 3.2, 2.8, 2.6)  # CTR % by position

cor(pos, ctr)
# [1] -0.852

The coefficient is −0.852: strong, negative, exactly as we expected. And yet something doesn’t add up. The link between position and CTR is iron-clad — it almost never happens that a lower position yields more clicks — and we’d expect a value even closer to −1. Why does Pearson stop at −0.85?

The answer is the most important point in the whole article. Pearson measures only the linear association, that is, how well the points line up along a straight line. But the CTR curve is not a straight line: it plummets from the first to the third position and then flattens out. The relationship is very strong, it’s just curved. Pearson, which looks for straight lines, reads that curvature as “imperfection” and lowers the grade. It isn’t wrong: it’s answering a question — “how linear is this?” — that in this case isn’t the right one.

Spearman and Kendall: Monotonic Association

For many SEO relationships we care about something weaker than linearity: it’s enough to know whether, as one variable grows, the other grows systematically (or falls systematically), without insisting it does so at a constant pace. A relationship like this is called monotonic, and to measure it there’s Spearman’s rank correlation coefficient, denoted ρ (rho).

Spearman’s trick is elegant: instead of working on the values, it works on their ranks. It replaces each number with its place in the standings (the smallest becomes 1, the next 2, and so on) and then computes an ordinary Pearson on these ranks. This way the exact shape of the curve disappears — only the order matters — and what remains is how faithfully the order of x reproduces that of y. We compute it on the same data as before:

cor(pos, ctr, method = "spearman")
# [1] -0.988

Now the coefficient is −0.988, pressed up against −1. It’s the correct picture of the situation: as the position worsens, the CTR falls almost without exception. (That “almost” is no accident: in the data I left a small, realistic inversion, position 5 yielding more than position 4, as happens when a rich snippet inflates a result’s CTR; it’s exactly the kind of ripple that keeps ρ from reaching an exact −1.) Where Pearson saw a “good but not great” association, Spearman recognises the near-perfect monotonic relationship that is actually there.

There’s a third measure worth knowing, Kendall’s tau (τ). It too works on order, but with a different logic: across all pairs of observations, it counts how many are concordant (if x rises, y rises too) and how many discordant, then takes the balance. I compute it in R, again on the same data:

cor(pos, ctr, method = "kendall")
# [1] -0.956

Kendall returns −0.956, also close to the extremes but typically a touch more conservative than Spearman. In everyday practice the choice is less complicated than it seems: Pearson when we care about a linear relationship and the data have no violent tails or outliers; Spearman when the relationship is monotonic but curved, or when the data are already ranks (positions, standings), or when a couple of outliers might throw Pearson off; Kendall when the observations are few or there are many ties, a situation in which its statistical properties hold up better.

The Correlation Matrix

We rarely have only two metrics to compare. More often we have a handful — sessions, average duration, conversions, bounce rate — and we’d like to see all the associations at a glance. R’s cor() function, applied to an entire data frame, returns the correlation matrix: the coefficient of each variable with every other. I build it on twelve example pages:

ga4 <- data.frame(
  sessions      = c(120, 340, 210, 560, 430, 780, 650, 290, 510, 880, 360, 720),
  avg_duration  = c(31,  55,  48,  44,  58,  63,  71,  52,  46,  68,  60,  64),
  conversions   = c(3,   8,   4,   21,  11,  24,  19,  9,   17,  29,  7,   22),
  bounce_rate   = c(70,  61,  66,  44,  57,  41,  46,  59,  52,  38,  63,  45)
)

round(cor(ga4), 2)
#              sessions avg_duration conversions bounce_rate
# sessions         1.00         0.73        0.98       -0.97
# avg_duration     0.73         1.00        0.58       -0.62
# conversions      0.98         0.58        1.00       -0.99
# bounce_rate     -0.97        -0.62       -0.99        1.00

It reads like a two-way table: the diagonal is all 1s (every variable is perfectly correlated with itself), and the matrix is symmetric because the correlation of x with y is the same as y with x. As we can see, sessions and conversions travel almost in unison (0.98: more traffic, more conversions — no surprise), bounce rate is negatively correlated with everything else, while average duration associates with conversions far less than intuition would suggest (0.58). A matrix like this is a precious starting map for deciding where to look. It helps to visualise it as a heatmap (with packages such as corrplot), where colour intensity makes the strong links jump out.

One warning, though, belongs here in bold, because it’s the heart of the matter: a correlation matrix is not a causal map. It tells us which numbers move together, not which moves which, nor whether what moves them is a third factor we don’t even have in the table.

Correlation Is Not Causation

It’s the most repeated phrase in statistics, and the most ignored in practice. It’s worth seeing where it trips us up, because in SEO the stumble is a daily one. Take the classic observation: longer articles rank better. Let’s measure the association between content length and a ranking score (higher = better placed):

length     <- c(620, 850, 1100, 1300, 1500, 1800, 2100, 2400, 2800, 3200)
rank_score <- c(3,   8,   6,    11,   9,    7,    14,   10,   16,   15)

cor(length, rank_score)
# [1] 0.842

A fine 0.842: the correlation is there, and it’s robust. The temptation to conclude “I’ll lengthen my articles and climb the rankings” is overwhelming — and almost always wrong. Faced with a correlation, before talking about cause we must put at least three alternative explanations on the table. It could be a direct cause (length genuinely helps ranking). It could be reverse causation (pages that already rank well get more care and are expanded over time). Or — the most frequent and most insidious case — there could be a confounding factor moving both: the site’s authority. An authoritative domain tends both to produce deeper (hence longer) content and to rank better (for reasons that have nothing to do with length). Length and ranking rise together not because one causes the other, but because a third element drags them both.

This hidden third element is the root of some of the most spectacular errors in data analysis: it can even flip the sign of a relationship when the data are aggregated the wrong way, the phenomenon known as Simpson’s paradox. Establishing a causal link is a craft of its own, requiring controlled experiments or dedicated techniques; correlation, on its own, will never get there. Its job is a different one, and a valuable one: flagging the pairs of metrics worth investigating more deeply.

Try It Yourself

To lock in the mechanism, here’s an exercise with realistic data. For ten pages we have the number of referring domains linking to them and their monthly organic traffic, and we want to understand how strongly the two are associated:

bl  <- c(5, 12, 8, 25, 18, 40, 33, 60, 52, 95)        # referring domains
org <- c(180, 240, 420, 510, 760, 690, 1250, 1100, 1900, 1650)  # organic sessions/month

The task: compute both Pearson’s coefficient with cor(bl, org) and Spearman’s with cor(bl, org, method = "spearman"), and reflect on why they differ.

To check your work: Pearson is 0.815 and Spearman 0.855. Both are high and tell the same underlying story — more referring domains, more traffic — but the fact that Spearman is a bit higher than Pearson tells us something: the relationship is more monotonic than linear, a sign that beyond a certain threshold each extra link brings less marginal traffic than the straight line would want. And, of course, neither number entitles us to say that buying backlinks will raise traffic: here too the site’s authority might be moving both things together.

With correlation we’ve learned to answer the question of whether, and how much, two metrics are associated — choosing Pearson, Spearman or Kendall each time depending on the shape of the link. It’s the indispensable rung before the next question, the one anyone analysing data eventually asks: given an association, can I use one variable to predict the other, and draw the line that ties them together? From here on we no longer just measure the strength of a link, we model it: this is the territory of linear regression, where the very coefficient r we’ve just met returns to the stage, this time in the service of prediction.

Effect Size and Power Analysis: How Big Is the Effect (and How Much Data You Need)

Paolo Gironi — Wed, 17 Jun 2026 07:45:42 +0000

We closed the article on the A/B test significance calculator with a promise. We said that the p-value answers a single question — does the effect exist? — and that, on its own, it adds nothing else. It does not tell us how large the effect is, nor whether it is worth the effort of shipping it. It is time to keep that promise, because the two questions the p-value leaves hanging are exactly what separates reading data with method from stopping at the first threshold that glitters.

The two questions have precise names. The first — how big is it? — is the effect size. The second — with the data I have, could I even have seen an effect like this? — is the power of the test, and the reasoning that gets us to an answer is called power analysis. We examine them one at a time, as always with an example at hand.

Significant Doesn’t Mean Large

Let’s start with a situation that comes up more often than people running online tests would like. Suppose we tried two title tags on a very high-traffic page and collected one million sessions per variant. Variant A has a CTR of 3.00%, variant B of 3.05%: five hundredths of a percentage point of difference. Let’s check in R whether the gap is statistically significant:

# one million sessions per variant, CTR 3.00% vs 3.05%
prop.test(c(30000, 30500), c(1000000, 1000000), correct = FALSE)$p.value
# [1] 0.03899

The p-value is 0.039, below the 0.05 threshold. By the book, we should celebrate: the difference is “significant”. But let’s pause. Are we really about to rewrite the titles across the whole site to gain five hundredths of a point of CTR? That significant result hides an effect of laughable size, made detectable only by the sheer mass of data.

This is the point of no return: with a large enough sample, any difference becomes statistically significant, even the most trivial one. The p-value measures how confident we are that the effect isn’t zero; it does not measure how large the effect is. They are two different things, and conflating them is the mistake that leads to chasing wins that leave no trace on revenue. Effect size exists precisely to put magnitude back at the centre.

Effect Size: Measuring the “How Much”

The idea behind effect size is simple and, once seen, hard to forget: instead of asking only whether two groups differ, we measure by how much they differ, on a scale that does not depend on sample size. It is the difference between saying “B beats A” and saying “B beats A by half a standard deviation”. The first is news; the second is information you can decide on.

There are several effect-size measures, each tailored to a type of comparison. We look closely at two — one for means, one for proportions — because they cover most of the everyday work; the others we mention briefly at the end, with the right pointers.

Cohen’s d: the Effect Between Two Means

When we compare two means — the average time on page of two variants, the average session duration of two segments — the reference measure is Cohen’s d. The intuition is this: we take the difference between the two means and express it in “standard-deviation units”, so it becomes comparable across different contexts. A three-second difference weighs a lot if sessions all hover around that value, and almost nothing if they vary by minutes.

In formula, Cohen’s d is the ratio between the difference of the means and the combined standard deviation of the two groups:

\( d = \frac{\bar{x}_B – \bar{x}_A}{s_p} \\ \)

where x̄_A and x̄_B are the group means and s_p is the pooled standard deviation, a weighted average of the two standard deviations that brings together the internal variability of both groups:

\( s_p = \sqrt{\frac{(n_A – 1)\,s_A^2 + (n_B – 1)\,s_B^2}{n_A + n_B – 2}} \\ \)

with n_A, n_B the sample sizes and s_A, s_B the standard deviations of the two groups. The denominator is nothing more than the correct way to fuse two variabilities into a single reference measure.

Let’s do an example. We measured session duration (in seconds) on two versions of a page, twelve sessions per version. I compute Cohen’s d in R using the effsize package, which does the maths and also returns the qualitative label:

A <- c(48, 55, 52, 60, 46, 58, 51, 57, 49, 54, 53, 50)  # version A
B <- c(50, 58, 52, 62, 49, 57, 60, 53, 61, 51, 59, 54)  # version B

library(effsize)
cohen.d(B, A)

# Cohen's d
#
# d estimate: 0.6254922 (medium)
# 95 percent confidence interval:
#      lower      upper
# -0.2416187  1.4926030

The estimated d is 0.63, which effsize classifies as a medium effect. The conventional thresholds, proposed by Jacob Cohen, are 0.2 for a small effect, 0.5 for a medium one, 0.8 for a large one — but they should be taken for what they are: useful conventions to get oriented, not laws of nature. Cohen himself recommended interpreting them in light of one’s own field, not applying them blindly. In everyday SEO practice, a d of 0.63 on session duration is a change worth taking seriously.

There is, however, a detail worth the whole rest of the article, and it is already visible above: the confidence interval of d runs from −0.24 to 1.49. It crosses zero. In other words, with just twelve sessions per group, the estimated effect is medium, but the data are not enough to rule out that the true one is null. And indeed, if we feed the same numbers to a t-test, we find anything but a reassuring p-value:

t.test(B, A)
#
# 	Welch Two Sample t-test
# t = 1.5321, df = 21.9, p-value = 0.1398

A medium effect that the test declares not significant. This is not a contradiction: it is exactly the phenomenon that the power of a test exists to explain. Let’s hold that thought, we come back to it shortly.

Effect Size for Proportions (CTR and Conversions)

Time on page is a mean, but the daily bread of anyone doing SEO is proportions: CTR, conversion rate, bounce rate. Here Cohen’s d does not apply directly, and the natural effect-size measure is Cohen’s h, built specifically for the difference between two proportions.

The technical detail that makes it reliable is a transformation — the arcsine of the square root of the proportion — that serves to stabilise the variability (in a proportion, variability depends on the value itself, and is greatest around 50%). The formula is:

\( h = 2\arcsin\sqrt{p_2} – 2\arcsin\sqrt{p_1} \\ \)

where p₁ and p₂ are the two proportions compared. There is no need to compute it by hand: the ES.h function of the pwr package gives it to us. But before seeing it at work it is worth introducing the other half of the story, because that is where Cohen’s h shines.

First, though, let’s close the effect-size chapter with an honest mention of the other measures. When the groups compared are more than two — the classic ANOVA scenario — the typical measure is eta squared (η²), which tells what fraction of the total variability is explained by the factor under study; we laid its foundations when discussing the analysis of variance. When instead the outcome is binary — converts / does not convert — effect size is often expressed as an odds ratio, the ratio between the odds of success, the same object that governs logistic regression. Different tools for different questions, but the underlying idea does not change: put a number on the magnitude, not just on the existence.

The Power of a Test: Could We Have Seen It?

Let’s go back to our medium effect declared not significant. How can a d of 0.63 produce a p-value of 0.14? The answer lies in a concept that closes the inferential circle: the power of a test.

When we run a hypothesis test we risk two kinds of error. The first, the type I error, is crying out for an effect that isn’t there: we keep it under control with the threshold α (usually 0.05). The second, the type II error, is its opposite and far more insidious: failing to see an effect that is in fact there. The probability of committing it is denoted by β, and power is its complement:

\( \text{power} = 1 – \beta \\ \)

Put more plainly, power is the probability of noticing a real effect when it truly exists. A power of 0.80 — the standard people aim for — means that, if the effect exists at the hypothesised size, our test detects it four times out of five.

The crucial point is that power, the threshold α, effect size and sample size are not four independent knobs: they are bound by a constraint. Fix three of these values, and the fourth is determined. This is the entire idea of power analysis, and it is what makes it so useful: depending on which unknown we leave free, it answers two different operational questions.

And here is why our medium effect stayed invisible. With twelve sessions per group the power of the test was minuscule: the test was, quite simply, blind. A non-significant result, under these conditions, does not say “the effect isn’t there”; it says “I didn’t have good enough eyes to see it”. Confusing the two is one of the most expensive mistakes you can make reading an A/B test.

Power Analysis in R: How Much Data You Need

The first question power analysis can settle is the one every test should face before starting: how much data do I need? Let’s pick up our medium effect again. If we wanted to design a test able to detect a d of 0.63 with power 0.80 and threshold 0.05, I compute in R with the pwr package:

library(pwr)
pwr.t.test(d = 0.63, sig.level = 0.05, power = 0.80, type = "two.sample")
#
#      Two-sample t test power calculation
#               n = 40.53396
#               d = 0.63
#       sig.level = 0.05
#           power = 0.8
#     alternative = two.sided
# NOTE: n is number in *each* group

We would need about 41 sessions per group, not twelve. That is why our test was mute: it was looking for a medium effect with a third of the data required. Power analysis, done upstream, would have spared us an inconclusive test — and it is exactly the reasoning behind the sample size calculator: sample size and power are two sides of the same coin.

The second question is the mirror image and comes up after the fact, once the test is done: with the data I had, how much power did I really have? We see it better on a concrete case.

A Practical Case: the A/B Test That “Didn’t Work”

Suppose we tested two landing pages. A converted 60 visitors out of 1,500 (4.0%), B converted 78 out of 1,500 (5.2%). At a glance B looks clearly better — a point and two tenths of conversion more is not nothing. Let’s check in R whether the difference holds:

prop.test(c(60, 78), c(1500, 1500), correct = FALSE)
#
# 	2-sample test for equality of proportions
# X-squared = 2.461, df = 1, p-value = 0.1167

The p-value is 0.117: above 0.05. By-the-book verdict: difference not significant, test failed, file it away. But now we know better than to stop here. Let’s compute the power that test actually had, starting from the observed effect size:

library(pwr)
h <- ES.h(0.052, 0.040)   # Cohen's h between the two proportions
h
# [1] 0.0574024

pwr.2p.test(h = h, n = 1500, sig.level = 0.05)
#               power = 0.3492384

Power was 0.35. In other words: even if B had genuinely been better by that much, we had a little over one chance in three of noticing it. The test did not “prove the two pages are equal”: it was simply too weak to rule. And how much data would have been needed to reach decent power?

pwr.2p.test(h = h, power = 0.80, sig.level = 0.05)
#               n = 4764.053

Almost 4,800 visitors per variant, against the 1,500 we had. The difference between a test that “didn’t work” and a test never really in a position to work is all here — and you only see it if you pair power with effect size. Beware, then, of downgrading a non-significant result to “no effect”: almost always we are merely looking at an underpowered test.

Try It Yourself

To make the mechanism stick, here is an exercise with realistic data. We are designing an A/B test on a contact form. The current conversion rate (baseline) is 2.5%, and we would count it a success to bring it to 3.0%: half a point of improvement. We want a test with power 0.80 and threshold 0.05.

The task: compute the effect size with ES.h(0.030, 0.025), pass it to pwr.2p.test setting power = 0.80, and read off how many visitors per variant are needed. Then, as a cross-check, compute the power we would have if we stopped at 3,000 visitors per variant with pwr.2p.test(h = ..., n = 3000, ...).

To check your work: the effect size is h = 0.031, about 16,759 visitors per variant are needed for a power of 0.80, and with only 3,000 the power would collapse to 0.22. The moral is the one we now know: the smaller the effect we are chasing, the more data we need to see it — halving the minimum detectable difference does not double the sample required, it quadruples it.

Effect size and power complete the triad that the p-value, on its own, left unfinished: no longer just does the effect exist?, but also how big is it? and could I have seen it?. These are the three questions that turn a test from a propitiatory rite into a decision tool. And all three, on closer inspection, depend on a choice that comes before the test: how much data to collect, and how. That is the terrain of experimental design and sampling — the point where statistics stops merely judging the numbers we put in front of it and begins to tell us which numbers to go and look for.

A/B Test Significance Calculator

Paolo Gironi — Fri, 12 Jun 2026 19:47:32 +0000

Our A/B test has run its course: variant B shows a higher conversion rate than variant A. The temptation to declare a winner and ship the change is strong. But first there is a question to answer, the same one that runs through this whole series: is the difference we observe a real signal, or just statistical noise?

This calculator is the natural complement of the sample size calculator: that one works before the test and tells us how many users we need; this one works after and tells us whether the result we obtained is statistically significant. If you have read the article on hypothesis testing, you will recognise the machinery at once: behind the scenes sits a z-test for comparing two proportions.

Using it is immediate: we enter visitors and conversions for the two variants, choose a significance level, and the calculator returns the p-value, a verdict, and the confidence interval of the difference.

The calculator
The formula: how the calculation works
Let’s verify it in R
How to read the result (without being fooled)
Further reading

The calculator

The preloaded values are the ones we will work through step by step below: replace them with the numbers from your own test.

Significance calculator

Variant A (control)

Visitors

Conversions

Variant B

Visitors

Conversions

Significance (α)

—

The formula: how the calculation works

The reasoning is the classic hypothesis-testing one. We start from the null hypothesis: the two variants convert at the same rate, and the observed difference is due to chance. Then we measure how “surprising” that difference would be if the null hypothesis were true: if it is too surprising, the null hypothesis does not hold.

There are three protagonists:

p̂_A and p̂_B: the observed conversion rates of the two variants (conversions divided by visitors).
p̂: the pooled proportion, i.e. the overall conversion rate computed by combining the data from both variants. Why pooled? Under the null hypothesis the two proportions coincide, and the best estimate of that single proportion uses all the available data.
n_A and n_B: the visitors of the two variants.

The test statistic measures the observed difference in standard-error units:

\( z = \frac{\hat{p}_B – \hat{p}_A}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_A} + \frac{1}{n_B}\right)}} \)

The denominator is the standard error of the difference: it tells us how much the gap between the two rates would fluctuate if we repeated the test many times in a world where the variants are identical. The resulting z is read on the standard normal distribution: the p-value is the probability of observing a difference at least this extreme, in either direction, by pure chance. “In either direction” is not a footnote: the test is two-tailed, because before looking at the data we do not know whether B will do better or worse than A.

The reference values are always the same:

|z| > 1.645 → significant at 90%
|z| > 1.96 → significant at 95%
|z| > 2.576 → significant at 99%

Let’s work through an example, with the numbers preloaded in the calculator. Variant A received 8,500 visitors and 204 conversions; variant B 8,300 visitors and 251 conversions:

p̂_A = 204 / 8,500 = 0.0240 (2.40%)
p̂_B = 251 / 8,300 = 0.0302 (3.02%) — a +26% relative lift
pooled p̂ = (204 + 251) / (8,500 + 8,300) = 455 / 16,800 = 0.0271
standard error = √[0.0271 × 0.9729 × (1/8,500 + 1/8,300)] = 0.00250
z = (0.0302 − 0.0240) / 0.00250 = 2.49

So: z = 2.49 clears the 1.96 threshold and the p-value is 0.0127. The difference is significant at 95% — but, as you can see, not at 99% (0.0127 > 0.01). Same result, two different verdicts depending on how strict we chose to be: the significance level must be decided before looking at the data, not after.

Let’s verify it in R

I check the calculation in R with prop.test, switching off the continuity correction to stay aligned with the manual computation:

prop.test(c(251, 204), c(8300, 8500), correct = FALSE)

	2-sample test for equality of proportions
	without continuity correction

data:  c(251, 204) out of c(8300, 8500)
X-squared = 6.2075, df = 1, p-value = 0.01272
alternative hypothesis: two.sided
95 percent confidence interval:
 0.001325762 0.011156166
sample estimates:
    prop 1     prop 2
0.03024096 0.02400000

The numbers match: the p-value is the same as the manual calculation, and the X-squared statistic is simply our z squared (2.49² ≈ 6.21 — the chi-square test on a 2×2 table and the z-test on two proportions are the same test). As a bonus, R hands us the confidence interval of the difference: between 0.13 and 1.12 percentage points. That is the most valuable piece of information of all, and here is why.

How to read the result (without being fooled)

Significant does not mean important. This must always be kept firmly in mind: with very large samples, even tiny, commercially irrelevant differences become statistically significant. Significance tells us the difference is not due to chance — not that it is big. To understand how big it is, we look at the confidence interval of the difference: in our example it runs from +0.13 to +1.12 percentage points. If even the lower bound justifies the effort of shipping the change, we can proceed with confidence; if the interval includes negligible values, the “significant” verdict alone is not enough.

The p-value holds if the test stops when planned. The calculation assumes the sample size was fixed in advance (with the sample size calculator) and that the test stops there. Checking the results every day and stopping at the first p-value below 0.05 — the infamous peeking — dramatically inflates false positives: it is like flipping a coin until three heads come up in a row and declaring the coin rigged. We covered this in the guide to statistical tests for A/B analysis.

N.B.: the calculator uses a two-tailed test, the standard, prudent choice. One-tailed versions exist and “reward” a directional hypothesis with halved p-values, but they should be used only when the direction of the effect is genuinely known a priori — which, in everyday A/B testing practice, is almost never.

The p-value answers a single question: does the effect exist? It does not tell us how large it is, nor whether it is worth shipping. For that we need two more tools — effect size and power analysis — and that is exactly where this series is headed next.

The Statistics and SEO Library: the Books I Recommend (and Why)

Paolo Gironi — Thu, 11 Jun 2026 08:11:56 +0000

There is a question that comes back, reliably, every time I publish an article along this path: “so, which book should I read to study these things?”. Until now I have answered one piece at a time, in the “Further Reading” section that closes each article. Here I do the reverse: I gather the whole library on a single page, with the reason each title earned its place on the shelf.

This is not a ranking and not a catalogue: these are the books I actually use, the ones many of the examples and explanations in the articles come from. Few of them, chosen with a simple criterion: each book must let anyone working with data in SEO and marketing take one concrete step forward, without requiring a degree in mathematics.

A note on transparency before we start: the links below are Amazon affiliate links. If you buy a book through them, the blog receives a small commission at no extra cost to you: it is the most painless way I have found to cover the server bills.

Where to Start

The Art of Statistics — David Spiegelhalter

If I could keep only one, it would be this. The Art of Statistics does not teach formulas: it teaches how to reason about data before trusting it, which is exactly the skill missing when someone reads a Search Console report and jumps to conclusions. Spiegelhalter — a Cambridge professor and a science communicator of rare clarity — builds every chapter around a real case: botched polls, misread medical statistics, the famous Berkeley admissions case (the same case I told when discussing Simpson’s Paradox).

I cite it practically everywhere on this blog: from sampling to confidence intervals, by way of the Central Limit Theorem. You can read it without pen and paper, and re-read it with profit. (For Italian readers there is also an excellent Italian edition, L’arte della statistica.)

Finalmente ho capito la statistica — Maurizio De Pra

The title says it all (“statistics, I finally got it”). Finalmente ho capito la statistica (Italian edition) is the book for absolute beginners who want a gradual path, plenty of examples and a modest price. It covers the territory of probability distributions well — the ones this blog’s path takes from the Poisson to the Beta — together with the foundations of probabilistic reasoning. It does not replace a textbook, but it does what a textbook cannot: it takes the fear away.

When Data Lies

How to Lie with Statistics — Darrell Huff

Written in 1954 and never aged. How to Lie with Statistics is the short, venomous catalogue of the tricks numbers can be made to play: biased samples, conveniently chosen averages, truncated chart axes, percentages stripped of their context. Huff wrote for newspaper readers; I recommend it to anyone reading SEO tool reports and vendor slide decks, where those very tricks are alive and well. If you have been through Simpson’s Paradox you already know that aggregate data can lie: Huff completes the picture with all the other ways.

You can read it in an afternoon, and from that afternoon on you never look at a chart the same way again. (Italian readers can find it as Mentire con le statistiche.)

The Textbook for Getting Serious: Inference

Statistica — Newbold, Carlson, Thorne

Sooner or later the moment comes when popular science is not enough: you want the applicability conditions of a test, the complete formulas, the exercises to check you understood. Statistica by Newbold, Carlson and Thorne (Italian edition) is the reference university textbook for the whole of inference: hypothesis testing, confidence intervals, chi-square, ANOVA — in practice, the theoretical backbone of my guide to statistical tests for A/B analysis.

Let me be frank: it is a university textbook, and it costs like one. But it is one of those books you buy once and consult for years.

Regression, Time Series, Models

Introduzione all’econometria — Stock, Watson

The name may be intimidating (econometrics?), but the content is exactly what anyone needs to go beyond basic linear regression: multiple regression, omitted variables, diagnostics, time series. Introduzione all’econometria by Stock and Watson (Italian edition; the English original is Introduction to Econometrics) has a quality that is rare in textbooks: a constant focus on the interpretation of results, not just their computation. Which is, after all, where the difference between a useful analysis and an exercise in style is decided.

The (Fallible) Art of Prediction

The Signal and the Noise — Nate Silver

Anyone working with data sooner or later has to make a forecast — and an estimate of next quarter’s organic traffic is a forecast in every respect. The Signal and the Noise tells the story of why predictions fail so often: too much faith in models, the temptation to mistake noise for signal, the inability to reason in probabilities. Silver — the man who called the 2012 US presidential election right in all fifty states — moves through poker, earthquakes, weather and finance, and along the way delivers the best narrative introduction to Bayesian reasoning I know of. It is the popular companion to the time series chapter: first you learn to build a forecast, then you learn to distrust it. (There is also an Italian edition: Il segnale e il rumore.)

Online Experimentation

Trustworthy Online Controlled Experiments — Kohavi, Tang, Xu

On A/B testing there is simply no equivalent: Trustworthy Online Controlled Experiments is the book on the subject, written by the people who led experimentation at Microsoft, Google and LinkedIn. Inside is everything I have touched in these articles — sample size, test power, mistakes to avoid — plus ten years of real-world cases about what goes wrong in actual experiments. I also used it to build my sample size calculator. Very readable.

The Bayesian Path

Bayesian Statistics the Fun Way — Will Kurt

Bayesian statistics has a reputation for being hard, and its textbooks do their best to confirm it. Bayesian Statistics the Fun Way does the opposite: Will Kurt explains priors, posteriors and Bayesian updating with examples taken from Star Wars and Lego bricks, and — something I particularly appreciate — uses R for the computational side, exactly as I do here. It is the right book for grasping the Bayesian logic (and the reason behind the Beta distribution) before tackling the formal theory.

Towards Machine Learning

An Introduction to Statistical Learning — James, Witten, Hastie, Tibshirani

The contemporary classic of statistical learning, known to everyone as “ISL”. An Introduction to Statistical Learning covers, with the right balance of intuition and formalism, the topics of the more advanced part of this path: logistic regression, decision trees, PCA, with hands-on labs in R. N.b.: the authors distribute the PDF for free from their website — the printed edition remains for those who, like me, prefer to annotate study books in pencil.

Introduction to Machine Learning — Ethem Alpaydın

For those who want the theoretical foundations of machine learning — the ones that in a university course would come before the labs — Introduction to Machine Learning by Alpaydın is the reference I cited in my introductory guide to ML. More formal than ISL: one to pick up after it, not instead of it.

The Working Language: R

R for Data Science — Wickham, Çetinkaya-Rundel, Grolemund

There was an obvious gap on this shelf: R code shows up in nearly every article of this blog — from the chi-square test to time series — but the book to learn the language from was missing. R for Data Science (second edition) fills the gap: Hadley Wickham is the author of the tidyverse, the package ecosystem that made R modern, and the book teaches the whole workflow — import, tidy, transform, visualise, communicate — on real data, with no superfluous theory. Like ISL, it can be read for free on the authors’ website: one more reason to have no excuses.

Communicating Data

Storytelling with Data — Cole Nussbaumer Knaflic

The most rigorous analysis in the world is worth little if the person receiving it does not understand it — and in marketing an analysis almost always has to be told to someone: a client, a manager, a meeting. Storytelling with Data teaches how to turn the default charts of Excel and Looker Studio into clear messages: choosing the right chart, removing the ink that carries no information, directing attention where it matters, building a narrative around the number. Of the whole shelf it is probably the book that pays for itself fastest: you can apply it to your very next report. (There is also an Italian edition, Data storytelling, published by Apogeo.)

A Niche Read

Monte Carlo Methods in Financial Engineering — Paul Glasserman

This is the most specialised book on the shelf, and I list it out of honesty towards anyone who has reached the Monte Carlo method and wants to go all the way: Monte Carlo Methods in Financial Engineering by Glasserman is the complete reference on simulation applied to finance. Not a beach read: it is the text you reach for when the others are no longer enough.

The Library at a Glance

To get your bearings quickly, here is the complete shelf in table form:

Book	Who it’s for	Language
The Art of Statistics — Spiegelhalter	Everyone: the starting point	EN (also IT)
Finalmente ho capito la statistica — De Pra	Absolute beginners, distributions	IT
How to Lie with Statistics — Huff	Defending yourself from doctored numbers	EN (also IT)
Statistica — Newbold, Carlson, Thorne	For rigour: inference and tests	IT
Introduzione all’econometria — Stock, Watson	Regression and time series	IT (orig. EN)
The Signal and the Noise — Silver	Why predictions fail	EN (also IT)
Trustworthy Online Controlled Experiments — Kohavi et al.	A/B testing and experimentation	EN
Bayesian Statistics the Fun Way — Kurt	The Bayesian approach, with R	EN
An Introduction to Statistical Learning — James et al.	Practical machine learning, with R	EN
Introduction to Machine Learning — Alpaydın	Theoretical foundations of ML	EN
R for Data Science — Wickham et al.	Learning R, from raw data to charts	EN
Storytelling with Data — Knaflic	Communicating data and reports	EN (also IT)
Monte Carlo Methods in Financial Engineering — Glasserman	Advanced simulation	EN

This shelf is not closed. As the blog’s path widens — the statistical paradoxes I have started to explore, the bootstrap, text analysis — the library will widen too, and this page will be updated accordingly. In the meantime, if one single recommendation had to suffice: start with Spiegelhalter, and let the articles on this blog be your gym.

Simpson’s Paradox in SEO: When Aggregate Data Can Lie

Paolo Gironi — Wed, 27 May 2026 12:47:27 +0000

It’s the last day of the month. We’re putting together the SEO report for our main client. We open Google Search Console, set the month-over-month comparison, and a chill runs down our spine: the site’s overall organic CTR has collapsed from 4.5% to 3.5%.

Before writing the bad-news email and bracing ourselves to justify the drop, let’s do the right thing: disaggregate the data to understand where we’re losing ground. We look at performance by device and discover something seemingly impossible:

CTR on Desktop rose from 5.0% to 5.5%.
CTR on Mobile rose from 2.0% to 2.5%.

We stare at the screen. How is it mathematically possible that performance improved everywhere, yet the overall total dropped by a full percentage point?

We haven’t broken Google Search Console, and we haven’t forgotten elementary-school arithmetic. We’ve simply just fallen victim to Simpson’s Paradox.

What Is Simpson’s Paradox

Simpson’s Paradox is a statistical phenomenon in which a trend that appears clearly within several groups of data disappears — or even reverses — when the groups are combined into a single total.

In the everyday practice of SEO and marketing, this almost always happens because of a hidden confounding variable: in our case, the relative weight of the segments we’re analyzing. It’s the same reasoning we meet when discussing conditional probability, where what matters is not the marginal figure but the one conditioned on a subgroup.

When we work with rates and percentages (CTR, conversion rate, bounce rate), looking at the aggregate figure without considering the underlying volumes is one of the most insidious traps for anyone analyzing data.

The Proof: Anatomy of a Fake Collapse

Let’s go back to our monthly report and put the absolute numbers behind those percentages. Only then can we understand what really happened between Month 1 and Month 2.

Segment	Month 1 (impr. · clicks · CTR)	Month 2 (impr. · clicks · CTR)	Trend
Desktop	10,000 · 500 · 5.0%	10,000 · 550 · 5.5%	rising
Mobile	2,000 · 40 · 2.0%	20,000 · 500 · 2.5%	rising
Aggregate total	12,000 · 540 · 4.5%	30,000 · 1,050 · 3.5%	falling

Here’s the point: we don’t have an SEO problem — on the contrary, we’ve had a remarkable success. Our Mobile rankings have exploded, bringing in 18,000 more impressions than the previous month.

Mobile traffic, however, has historically had a structurally lower CTR than Desktop (more noise in the SERP, faster scrolling, distractions). That huge influx of low-CTR impressions “watered down” the global average, dragging it downward. The aggregate figure told us “we’re getting worse”; the disaggregated data tells us “we’re improving across the board, but our traffic mix has changed”.

The mathematical reason is simple, and it’s worth keeping firmly in mind: the aggregate CTR is not the average of the segments’ CTRs, but a weighted average of them, where the weights are each segment’s share of impressions. As a formula:

\( \text{CTR}_{\text{agg}} = \frac{\sum_i \text{clicks}_i}{\sum_i \text{impressions}_i} = \sum_i w_i \cdot \text{CTR}_i, \qquad w_i = \frac{\text{impressions}_i}{\sum_j \text{impressions}_j} \\ \)

where \(\text{CTR}_i\) is the CTR of segment i and \(w_i\) is its weight, that is, the fraction of impressions it owns. In Month 2 the weight of Mobile went from 1/6 to 2/3 of the total: even though every individual CTR rose, the average shifted toward the (low) value of the segment that had become dominant. It’s not the math that has gone crazy: it’s the mix that has changed.

Let’s reconstruct the whole thing in R, so we can see the mechanism at work instead of taking it on faith:

# Reconstruct the two months' data
df <- data.frame(
  segment     = c("Desktop", "Mobile", "Desktop", "Mobile"),
  month       = c("Month 1",  "Month 1", "Month 2",  "Month 2"),
  impressions = c(10000,      2000,      10000,      20000),
  clicks      = c(500,        40,        550,        500)
)

# CTR of each segment
df$ctr <- df$clicks / df$impressions

# Aggregate CTR per month: a WEIGHTED average over impressions,
# NOT the arithmetic mean of the CTRs
agg <- aggregate(cbind(clicks, impressions) ~ month, data = df, FUN = sum)
agg$aggregate_ctr <- agg$clicks / agg$impressions
print(agg)

As the output shows, the aggregate drops from 4.5% to 3.5% while both segments rise. N.B.: the arithmetic mean of Month 2’s two CTRs would be 4% (the simple average of 5.5% and 2.5%), quite different from the real 3.5%. The entire difference is in the weights.

Two More SEO Scenarios Where the Paradox Strikes

CTR by device is the textbook example, but Simpson’s Paradox lurks just about everywhere in our dashboards.

1. The Conversion Rate Collapse (Informational vs. Transactional Intent)

We’re working on an e-commerce site and the organic conversion rate goes from 3% to 1.5%. A disaster? Not necessarily. If we’ve just launched a corporate blog that has started ranking well for hundreds of informational, top-of-the-funnel keywords, we’ve brought thousands of users to the site who are far from the purchase stage (with a physiological CR close to 0.1%). The CR of our product pages may be stable or growing, but the sheer volume of blog traffic has distorted the aggregate average.

2. Cannibalization or Ranking Expansion?

One of our long-standing product pages used to rank only for 5 exact transactional keywords: 100 impressions, 10 clicks, 10% CTR. We decide to optimize its content, and the next month Google rewards its semantics, ranking it for 80 new long-tail and related keywords. Now the page gets 5,000 impressions and 100 clicks: 2% CTR. If we look only at the page’s average CTR in Search Console, it seems our optimization destroyed it; if we look at the absolute clicks, we’ve multiplied them tenfold.

How to Defend Yourself (Takeaways for the Analyst)

How do we survive Simpson’s Paradox when presenting data to a client or stakeholder? Four precautions.

Never trust the aggregate figure alone. When analyzing relative metrics (conversion rates, click rates, averages), the global total is often the least useful number of all.
Segment until you find homogeneity. Always split the data along logical dimensions before drawing conclusions: by device (Desktop/Mobile), by query type (brand/non-brand), and by page type (blog/product).
Look for the shift in weights. If a global rate collapses but the subgroups hold steady, ask: “has the traffic mix changed?”. Almost always, a low-performing segment has suddenly increased its volumes.
Educate the client. In a report, don’t just show the CTR drop: show the disaggregated table. Explaining the mechanism doesn’t just save the monthly report — it positions us as analysts who reason about data rather than being at its mercy.

Data doesn’t lie, but aggregate data makes for an excellent magician. The most solid defense, however, isn’t statistical but experimental: when we get to decide how to assign traffic — randomizing users between two versions of a page — the mix stops being a variable beyond our control. That’s exactly what we do with a rigorously run A/B test, the next step on our path: seeing how a controlled experiment neutralizes at the root the confounding variables that here we’ve merely unmasked.

Sampling and Sample Size: How Much Data Do You Really Need?

Paolo Gironi — Wed, 06 May 2026 14:46:09 +0000

We can almost never measure the entire population (for example, all future visitors to a site). We have to work on a sample. And here lies the delicate balance: a sample that is too small leads to wrong conclusions, while one that is unnecessarily large wastes time and resources. So the question becomes: how much data do we really need?

How to Choose Who to Measure: Types of Sampling

Before figuring out how much data we need, we must understand how to collect it. The three main methods are:

Simple random sampling: Every user has exactly the same probability of being chosen. It’s the gold standard, what we try to achieve when we randomize users in an A/B test.
Stratified sampling: We divide users into groups (e.g., Mobile and Desktop traffic) and randomly sample within each group, respecting the original proportions. It ensures that no important minority is ignored.
Systematic sampling: We choose one user every k (e.g., one user every 10). Easy to implement, but tricky when the data hide a cyclicity (imagine sampling one user every 7: if we end up with only Mondays, the estimate will be skewed from the start).

Sample Size: The Math Behind the Estimation

The intuition is straightforward: the smaller the effect we are looking for (or the more variable the data), the more observations we need to distinguish it from background noise. Sounds hard to formalise? It is more linear than it seems.

To calculate the exact number, we need three ingredients:

Confidence level: How sure do we want to be? We usually use 95% (which corresponds to a Z-score of 1.96).
Margin of error (E): The maximum error we are willing to accept (e.g., 1% or 0.01).
Expected proportion (p): The estimated conversion rate. If we have no idea, we use 0.5 (50%): it represents maximum uncertainty and yields the largest possible sample, so it is the most conservative choice.

The formula to estimate a proportion (like the Conversion Rate) is:

n = (Z² × p(1 – p)) / E²

Let’s Calculate It in R and Python

Let’s run a quick example. We want to estimate the Conversion Rate of a new page with a margin of error of 1% (0.01) and a confidence level of 95% (Z = 1.96). To stay on the safe side, we set p = 0.5.

The examples below are in both R and Python — pick whichever language feels more familiar.

Let’s calculate it in R:

# Sample size calculation for a proportion
Z <- 1.96
p <- 0.5
E <- 0.01

n <- (Z^2 * p * (1-p)) / E^2
print(paste("Required size:", round(n)))
# Output: Required size: 9604

Let’s verify it in Python:

# Sample size calculation for a proportion
Z = 1.96
p = 0.5
E = 0.01

n = (Z**2 * p * (1-p)) / E**2
print(f"Required size: {round(n)}")
# Output: Required size: 9604

As we can see, around 9,604 users are needed to reach that precision. N.B.: if we accepted a margin of error of 2% (E=0.02), the number would collapse to about 2,401. That is the effect of E squared in the denominator: halving the precision requirement divides the required sample by four. Worth keeping in mind whenever we decide which margin to accept.

From Estimation to A/B Testing

The formula seen so far estimates a single proportion. But in everyday CRO (Conversion Rate Optimization) work the actual problem is almost always a different one: comparing two proportions, as in an A/B test.

In that case the logic is the same, but the formula gets more complex because two new concepts come into play: the Effect Size (the minimum difference we want to detect) and the Statistical Power.

To skip the manual calculation, I built an interactive A/B test sample size calculator: it does the dirty work and also indicates how many days the test should run, given the page’s average traffic.

Sampling Error vs Bias

One point worth keeping firmly in mind before closing. Sampling error (the one the formula handles) is inevitable and shrinks as the data grow. But there is a far more insidious enemy, and no formula captures it: bias.

If we test a page only during the weekend, we might collect a million visits (sampling error practically zero), but the sample will not be representative of weekday users. So: no formula can save a sample that is biased at the source. A thousand observations gathered well beat a million gathered badly.

Try It Yourself

A product page receives roughly 10,000 impressions per month on Google, with an observed CTR of 3.5%. We want to estimate the true CTR with a margin of error of 1 percentage point (E = 0.01) and 95% confidence.

Compute the required sample size with the formula above, first using p = 0.5 (conservative case) and then p = 0.035 (observed CTR).
Compare the two results: how much does the data requirement change once we have a reasonable estimate of p?
Given 10,000 impressions per month, how many months are needed to satisfy the conservative estimate?
If we accepted a 2% margin (E = 0.02), how would the collection time change?

Hint: in R, a minimal function is enough — sample_size <- function(Z, p, E) ceiling((Z^2 * p * (1-p)) / E^2) — to be called twice with the two values of p.

Now we know how to collect an adequate sample and how much data we need. One question remains: how do we use that sample to rigorously compare two versions of the same page? This is where actual A/B testing comes in, and it is the next step of the path.

The Monte Carlo Method Explained Simply with Real-World Applications

Paolo Gironi — Wed, 11 Mar 2026 14:49:05 +0000

What is the Monte Carlo method

The story of the Monte Carlo method begins in the most unlikely way: with a mathematician in bed playing cards. In 1946, Stanisław Ulam, a Polish mathematician recovering from surgery, found himself playing solitaire to pass the time. Being a mathematician, he wondered: what are the chances of winning a game?

The problem was theoretically solvable: just enumerate every possible combination of cards and count the favorable ones. In practice, however, the number of combinations was so enormous that an exact calculation was completely impractical. Ulam then had an insight as simple as it was powerful: instead of computing the exact probability, why not simulate hundreds of games and count how many times you win?

The idea is disarmingly simple. If we play 1,000 games and win 230 of them, we can estimate the probability of winning at about 23%. The more games we simulate, the closer our estimate gets to the true value. This is, in essence, the Monte Carlo method: using random simulation to solve problems that would be too complex to tackle analytically.

Ulam shared the idea with his colleague John von Neumann, arguably the most brilliant mathematician of the 20th century, who immediately saw its potential. Von Neumann realized that ENIAC — one of the very first electronic computers, which filled an entire room — could run thousands of simulations in reasonable time. Together, they developed the method for a problem far more serious than solitaire: the diffusion of neutrons in atomic weapons, as part of the Manhattan Project at Los Alamos.

The name “Monte Carlo” was chosen as a code name, a reference to the famous Monte Carlo Casino in Monaco. Legend has it that the inspiration came from Ulam’s uncle, a notorious gambler. After all, the heart of the method is chance itself: generating random numbers to explore spaces of possibility too vast to traverse systematically.

From those early nuclear experiments of the 1940s, the Monte Carlo method has spread to every field of science and engineering. Today it is one of the most widely used computational tools in the world, from particle physics to finance, from cinematic rendering to drug discovery. Let’s see how it works.

Fundamental concepts

The Monte Carlo method rests on a statistical principle we’ve encountered before: the law of large numbers. In simple terms, this law tells us that the average of a random sample approaches the population average as the sample grows. Translated into Monte Carlo language: the more simulations we run, the more accurate our result will be.

To run a Monte Carlo simulation, we need random numbers. In practice, computers don’t generate truly random numbers: they use deterministic algorithms that produce sequences of pseudo-random numbers with statistical properties indistinguishable from real randomness. In R, for example, the runif() function generates uniformly distributed numbers between 0 and 1.

A crucial aspect is the rate of convergence. The Monte Carlo estimation error decreases as 1/√n, where n is the number of simulations. This means that to halve the error, we need to quadruple our simulations; to gain one more decimal digit of precision, we need 100 times more iterations. It’s not particularly efficient, but the beauty of the method lies in the fact that it works regardless of the problem’s complexity: whether the problem has 2 or 2,000 variables, the convergence rate remains the same.

In practice, we must always balance desired precision with available computational resources. Increasing the number of simulations comes at a cost in computation time. Fortunately, modern computers make this trade-off much more favorable than in the days of ENIAC.

The Monte Carlo method in action

Let’s see concretely how the Monte Carlo method is applied. The procedure follows four fundamental steps:

1. Define the model. First, we identify the problem’s variables and the probability distributions that govern them. For instance, if we want to simulate an investment’s return, the model will include the expected return (mean) and volatility (standard deviation), typically assuming normally distributed returns.

2. Generate random scenarios. Using a pseudo-random number generator, we produce thousands of possible scenarios. Each scenario represents an “alternative history”: one way things could play out.

3. Compute the result for each scenario. For each scenario, we apply the model and obtain a result. If we’re simulating an investment, the result is the final portfolio value.

4. Aggregate the results. Finally, we analyze the set of results: we compute the mean, the median, the percentiles. This gives us not just an estimate of the expected outcome, but an entire distribution of possibilities. And this is where Monte Carlo truly shines: it tells us not only “how much we’re likely to earn” but also “how much we could lose in the worst case.”

Let’s use a quick example to illustrate convergence. Imagine flipping a coin and trying to estimate the probability of heads. After 10 flips, we might get 7 heads (70%), an estimate far from the true 50%. After 100 flips, we’ll be closer, perhaps 53%. After 10,000 flips, our estimate will be very close to 50%. This is Monte Carlo in action: replacing a theoretical calculation with an experiment repeated thousands of times.

The power of the method lies in its flexibility. While analytical methods require closed-form solutions (which often don’t exist for complex problems), Monte Carlo only requires the ability to simulate the process. If we can write a program that generates one scenario, Monte Carlo gives us the distribution of outcomes.

Practical examples: estimating π and portfolio returns

Example 1: estimating the value of π

The most classic and pedagogically effective example of the Monte Carlo method is estimating the number π. The idea is elegant: consider a square of side 2 with a circle of radius 1 inscribed inside it. The area of the square is 4, the area of the circle is π. If we generate random points inside the square, the proportion falling inside the circle will be approximately π/4.

We compute this in R with 100,000 points:

set.seed(123)
n <- 100000
x <- runif(n, -1, 1)
y <- runif(n, -1, 1)
inside <- (x^2 + y^2) <= 1
pi_estimate <- 4 * sum(inside) / n
pi_estimate
# [1] 3.13956

The same in Python:

import random
random.seed(123)
n = 100000
inside = sum(1 for _ in range(n)
             if random.uniform(-1, 1)**2 + random.uniform(-1, 1)**2 <= 1)
pi_estimate = 4 * inside / n
print(pi_estimate)
# 3.14268

With 100,000 points we already get a reasonable estimate, though not extremely precise: we’re accurate to about two decimal places. As we mentioned, gaining another digit of precision would require roughly 100 times more points. The computer does all the heavy lifting.

Example 2: estimating portfolio returns

Let’s move to an example closer to real-world applications. Suppose we have a portfolio of three stocks with the following characteristics:

Stock	Expected Return	Standard Deviation	Portfolio Weight
A	8%	12%	40%
B	10%	15%	30%
C	12%	18%	30%

We want to estimate the probability that the portfolio return exceeds 10%. We simulate in R with 10,000 scenarios:

set.seed(42)
sim_A <- rnorm(10000, mean = 0.08, sd = 0.12)
sim_B <- rnorm(10000, mean = 0.10, sd = 0.15)
sim_C <- rnorm(10000, mean = 0.12, sd = 0.18)
sim_portfolio <- 0.4 * sim_A + 0.3 * sim_B + 0.3 * sim_C
prob_result <- mean(sim_portfolio >= 0.10)
prob_result
# [1] 0.4504

The same in Python:

import random
random.seed(42)
n = 10000
count = 0
for _ in range(n):
    a = random.gauss(0.08, 0.12)
    b = random.gauss(0.10, 0.15)
    c = random.gauss(0.12, 0.18)
    ptf = 0.4 * a + 0.3 * b + 0.3 * c
    if ptf >= 0.10:
        count += 1
print(count / n)
# 0.4479

The result tells us there’s roughly a 45% chance of exceeding 10% return. Notice how Monte Carlo gives us not a single number, but an entire distribution: we could easily compute the median return, the worst-case 5th percentile, the probability of loss, and so on.

Monte Carlo Simulator

To make the concept even more tangible, we’ve built an interactive simulator that applies the Monte Carlo method to predict the future value of an investment. The underlying model is the Geometric Brownian Motion (GBM), the same model used in the famous Black-Scholes framework for options pricing.

Intuitively, an asset’s future price is computed as the current price multiplied by a random growth factor. The formula is:

S(t+1) = S(t) × exp((μ − σ²/2) + σ × Z)

where μ is the expected annual return (the “average growth”), σ is the volatility (how much the price fluctuates — our measure of uncertainty), and Z is a random number drawn from a normal distribution. Each simulation generates a different path: some scenarios see the portfolio grow substantially, others see it decline. The histogram shows the distribution of all possible outcomes.

Modern applications of the Monte Carlo method

From the nuclear physics of the 1940s, the Monte Carlo method has spread to domains that Ulam and von Neumann could never have imagined. Let’s look at some of the most fascinating applications.

3D rendering and cinema. Every time we watch a Pixar film or a blockbuster with visual effects, we’re admiring Monte Carlo at work. The technique is called path tracing: to compute the color of each pixel, the software simulates millions of light rays bouncing between surfaces in the scene. Each ray follows a random path, and the average of thousands of paths produces the photorealistic image we see on screen.

Finance and risk management. In the financial world, Monte Carlo is ubiquitous. Banks use it to calculate Value at Risk (VaR) — the maximum probable loss of a portfolio over a given time horizon. It’s the same principle as our simulator, applied to portfolios with hundreds of assets and complex correlations. Pricing exotic options that lack closed-form solutions also relies on Monte Carlo simulations.

Drug discovery. In pharmaceutical research, Monte Carlo is used to simulate molecular docking: how a candidate molecule binds to a target protein. By simulating millions of possible spatial configurations, researchers identify the most promising compounds before synthesizing them in the lab, saving years of experimentation.

Climate models. Models predicting climate change are inherently uncertain: they depend on emission scenarios, atmospheric feedback, ocean dynamics. Monte Carlo allows exploration of thousands of parameter combinations and generates the uncertainty bands we see in IPCC reports. Not a single prediction, but a distribution of possible futures.

Artificial intelligence. In machine learning, a technique called Monte Carlo dropout uses simulation to estimate the uncertainty of a neural network’s predictions. And the famous AlphaGo by DeepMind, which in 2016 defeated the world Go champion, used Monte Carlo Tree Search (MCTS) to explore possible moves in a game with more configurations than atoms in the universe.

Field	Example	What is simulated
Cinema/3D	Path tracing (Pixar)	Light ray paths
Finance	Value at Risk	Market scenarios
Pharmaceuticals	Molecular docking	Spatial configurations
Climate	IPCC models	Parameter combinations
AI	AlphaGo (MCTS)	Possible moves

Advantages and limitations of the Monte Carlo method

Like any statistical tool, the Monte Carlo method has its strengths and limitations. Let’s examine them honestly.

Flexibility. The greatest advantage is versatility: Monte Carlo applies to complex problems of any size and in any field, from finance to engineering, physics to biology. It doesn’t require closed-form solutions, only the ability to simulate the process.

Accuracy. With a sufficient number of simulations, the estimate can be made arbitrarily precise. The more we run the method, the closer the result converges to the true value.

Scalability. Unlike grid-based methods, which suffer from the “curse of dimensionality” (cost explodes with the number of variables), Monte Carlo maintains the same convergence rate regardless of the number of dimensions. This makes it the only practical tool for high-dimensional problems.

However, the method also presents significant limitations:

Slow convergence. The 1/√n rate means that gaining one digit of precision requires 100 times more simulations. For problems demanding very high precision, this can be prohibitive.

Computational cost. For complex problems (many variables, heavy models), each individual simulation may require significant time. Multiplied by thousands or millions of iterations, the cost becomes considerable.

To mitigate these limitations, variance reduction techniques have been developed over the years, enabling more precise results with fewer simulations:

Importance sampling: sampling from an alternative distribution that “concentrates” simulations in the most informative regions.
Control variates: using a correlated variable with known expected value to reduce the estimate’s variance.
Stratified sampling: dividing the space into homogeneous subgroups and sampling from each.
Antithetic variates: exploiting pairs of negatively correlated random numbers to reduce variance.

The Monte Carlo method represents one of the most powerful tools in computational statistics. In future articles, we’ll explore how some of these techniques — particularly the bootstrap, a close relative of Monte Carlo — apply to concrete problems in statistical inference.

A/B Test Sample Size Calculator

paolo — Fri, 06 Mar 2026 08:07:28 +0000

One of the most common questions when planning an A/B test is: how many users do I need to get a reliable result? The answer is not a magic number: it depends on the size of the effect we want to detect, the baseline conversion rate, and the level of statistical certainty we require.

Calculating the sample size in advance is essential to avoid two classic mistakes: stopping the test too early and declaring a winner that does not exist, or letting it run too long, wasting traffic and time. In other words, it is about finding the right balance between resources and rigour.

If you have read the article on A/B Testing, you will recall that power analysis is the statistical method that lets us determine this threshold. And if you have studied confidence intervals, you already know that significance level and test power are not abstract concepts but operational levers that directly affect sample size.

The calculator below automates this process: simply enter your test parameters to instantly get the number of observations needed per variant and, if you know your daily traffic, an estimate of the test duration in days.

The calculator

Enter the parameters of your A/B test and the calculator will instantly return the required sample size.

Sample Size Calculator

Baseline conversion rate (%)

The current conversion rate of the control variant

Minimum detectable effect — MDE (% relative)

The smallest relative improvement we consider meaningful (e.g. 20% = from 5% to 6%)

Significance level (α)

Power (1−β)

Daily traffic (optional)

Total daily visitors to estimate test duration

Sample size per variant
—

The formula: how the calculation works

The calculator uses the standard formula for comparing two proportions with a two-tailed z-test. Let us walk through it step by step.

We start with the parameters we enter:

p₁: the baseline conversion rate (control), expressed as a proportion. If our CR is 5%, then p₁ = 0.05.
p₂: the expected conversion rate for the variant. If the minimum detectable effect (MDE) is 20% relative, then p₂ = p₁ × (1 + MDE/100) = 0.05 × 1.20 = 0.06.
α: the significance level, i.e. the probability of declaring an effect when there is none (Type I error). With α = 0.05 we work at 95% confidence.
1 − β: the power of the test, i.e. the probability of detecting an effect when it actually exists. With power 0.80, we have an 80% chance of catching the effect.

The formula is:

\( n = \frac{\left[z_{\alpha/2} + z_{\beta}\right]^2 \cdot \left[p_1(1-p_1) + p_2(1-p_2)\right]}{(p_1 – p_2)^2} \)

Where z_α/2 and z_β are the quantiles of the standard normal distribution. For the most common values:

α = 0.05 → z_α/2 = 1.96
α = 0.01 → z_α/2 = 2.576
β = 0.20 (power 0.80) → z_β = 0.842
β = 0.10 (power 0.90) → z_β = 1.282

Worked example. Suppose we have a baseline conversion rate of 3% and we want to detect a 20% relative increase (i.e. going from 3% to 3.6%), with α = 0.05 and power = 0.80:

p₁ = 0.03, p₂ = 0.036
z_α/2 = 1.96, z_β = 0.842
Numerator: (1.96 + 0.842)² × [0.03 × 0.97 + 0.036 × 0.964] = 7.849 × 0.0638 = 0.5008
Denominator: (0.03 − 0.036)² = 0.000036
n = 0.5008 / 0.000036 ≈ 13,911 per variant

So to detect a 20% relative effect on a 3% CR, we need roughly 13,900 observations per variant (nearly 28,000 in total). These numbers are worth reflecting on: if our site gets 500 visitors a day, the test will take about 56 days. This is one of the reasons why, in practice, most A/B tests on medium-traffic sites take weeks, not days.

How to use the calculator

How to choose the MDE. The minimum detectable effect is the trickiest parameter. Rather than asking “how much would we like the metric to improve”, we should ask: what is the smallest improvement that would justify the effort of implementing the change? An MDE of 5% relative requires enormous samples; an MDE of 50% is easy to detect but rarely realistic. The 10–30% range is a good starting point for most conversion rate tests.

An important detail: the MDE in the calculator is relative, not absolute. An MDE of 20% on a baseline CR of 5% means we are looking to detect a shift from 5% to 6% (one absolute percentage point, but 20% of the starting value).

How to estimate daily traffic. The traffic to enter is that of the pages involved in the test, not the total site traffic. If the test is on the checkout page and it receives 300 visits per day, the correct value is 300. You can get this figure from your analytics tool (GA4, Matomo, or similar) by averaging the last 30 days to smooth out daily fluctuations.

statistics – paologironi blog

A/B Testing: How to Run Statistically Valid Experiments (and the Mistakes to Avoid)

What we’ll cover

What an A/B test is

Setting up an A/B test correctly

Example: CTR of two title tags

Worked example: conversion rate of two landing pages

The most common mistakes

Stopping the test too early

Testing too many variants without correction

Ignoring the power of the test

Confusing statistical significance with practical significance

Frequentist vs Bayesian approach

Practical SEO example: meta description A/B test

Try it yourself

Further Reading

An Introduction to Principal Component Analysis (PCA)

The Mathematical Foundations of PCA

Practical Applications of PCA Across Different Fields

Concrete Use of PCA in Web Marketing, SEO, SEM and Data Analysis

Implementing PCA with R: Practical Examples

Checking and Interpreting the PCA Results

Conclusion: Leveraging PCA to Optimise Web Marketing Strategies

Further Reading

Correlation: Pearson, Spearman and Kendall (and Why It Isn’t Causation)

From Covariance to Correlation

Pearson: Linear Association (and Its Trap)

Spearman and Kendall: Monotonic Association

The Correlation Matrix

Correlation Is Not Causation

Try It Yourself

Further Reading

Effect Size and Power Analysis: How Big Is the Effect (and How Much Data You Need)

Significant Doesn’t Mean Large

Effect Size: Measuring the “How Much”

Cohen’s d: the Effect Between Two Means

Effect Size for Proportions (CTR and Conversions)

The Power of a Test: Could We Have Seen It?

Power Analysis in R: How Much Data You Need

A Practical Case: the A/B Test That “Didn’t Work”

Try It Yourself

Further Reading

A/B Test Significance Calculator

Contents

The calculator

Significance calculator

The formula: how the calculation works

Let’s verify it in R

How to read the result (without being fooled)

You might also like

Further reading

The Statistics and SEO Library: the Books I Recommend (and Why)

Where to Start

The Art of Statistics — David Spiegelhalter

Finalmente ho capito la statistica — Maurizio De Pra

When Data Lies

How to Lie with Statistics — Darrell Huff

The Textbook for Getting Serious: Inference

Statistica — Newbold, Carlson, Thorne

Regression, Time Series, Models

Introduzione all’econometria — Stock, Watson

The (Fallible) Art of Prediction

The Signal and the Noise — Nate Silver

Online Experimentation

Trustworthy Online Controlled Experiments — Kohavi, Tang, Xu

The Bayesian Path

Bayesian Statistics the Fun Way — Will Kurt

Towards Machine Learning

An Introduction to Statistical Learning — James, Witten, Hastie, Tibshirani

Introduction to Machine Learning — Ethem Alpaydın

The Working Language: R

R for Data Science — Wickham, Çetinkaya-Rundel, Grolemund

Communicating Data

Storytelling with Data — Cole Nussbaumer Knaflic

A Niche Read

Monte Carlo Methods in Financial Engineering — Paul Glasserman

The Library at a Glance

Simpson’s Paradox in SEO: When Aggregate Data Can Lie

What Is Simpson’s Paradox

The Proof: Anatomy of a Fake Collapse