Inferential Statistics: a Complete Learning Path, from Tests to A/B Testing

Every time we watch a number move up or down — a landing page’s conversions, time on page, an email’s open rate — we ask ourselves, more or less consciously, a single question: is this effect real, or is it just chance striking a pose for the camera? Inferential statistics is born precisely here. It is the art of moving from the little we observe — a sample, a few weeks of data, two variants of a page — to a defensible statement about what we cannot see: the underlying reality, the whole population, the true effect. Without this step, SEO and marketing remain, as we have written elsewhere, a bag of tricks: we look at a number and decide by gut feeling.

Learning inferential statistics, however, does not mean piling up scattered formulas. It means walking a road that starts from understanding why we can infer anything at all from samples, passes through the tests that put our hypotheses to the proof, learns how to cope when the data refuse to follow the textbook rules, and arrives at controlled experiments — A/B tests — which remain the cleanest way to establish whether a move of ours really works.

This page is that road, in order. We do not re-explain the theory here: each stage is an article on the blog, and the order in which we have arranged them is the order in which it makes sense to read them. Anyone starting from scratch can follow them in sequence, from first to last; anyone with some grounding can jump to the group they need. The four sections that follow — the foundations, the classic tests, the cases where the data won’t cooperate, and finally measuring and experimenting — are the four movements of a single path. We start with the foundations.

The foundations

Before the tests, we need to understand what they rest on. The three stages in this section answer the basic questions: why a sample can tell us something about the population, how a hypothesis is formulated and put to the proof, and how the uncertainty of an estimate is quantified.
These are the bricks that hold up everything else: skipping them means using the tests without knowing what they actually promise.

The central limit theorem is the non-negotiable starting point. It explains why, by summing or averaging many small random variations, we obtain that bell curve which keeps reappearing everywhere in statistics. It is the reason we can say something sensible about an enormous population by looking at a modest sample, and it is the theoretical basis of nearly every test that comes afterwards.

Hypothesis testing is the logical procedure with which we turn the question “is this effect real?” into something decidable. Here we learn the concepts that will return in every later article: the null hypothesis, the p-value, the significance threshold, and what it means (and does not mean) to “reject the null hypothesis”. It is the grammar of the whole path.

Confidence intervals complete the picture by shifting attention from the yes/no of the test to the measure of uncertainty. Instead of a blunt answer, they give us a plausible range for the true value. Understanding how they are built — and above all what they do not mean — is what separates someone who reads a number from someone who knows how to interpret it.

The classic tests

With the foundations in place, we step into the toolbox proper. This section gathers the tests we meet in 90% of real cases: comparing two means, working out whether two characteristics are associated, comparing several groups at once.
Each test has its field of use, and the difficulty is not the calculation — the software handles that — but choosing the right tool for the right question.

The t-distribution and hypothesis testing is the first concrete step. When samples are small and we do not know the population’s true variability, the bell curve is no longer enough: we need the t-distribution, a little more cautious because it accounts for how little we know. It is the bridge between the theory of the foundations and the applied tests.

The two-sample t-test is probably the most used test of all: it serves to establish whether two groups have genuinely different means. Here we learn the crucial distinction between dependent and independent samples — the same page measured before and after, or two different pages — because picking the wrong version distorts the result.

The chi-square test changes the type of data: no longer means, but counts and categories. It serves to work out whether two characteristics are associated — the acquisition channel and conversion, for instance — or whether an observed distribution departs from the expected one. It is the tool of choice when our data are tables of frequencies.

Analysis of variance (ANOVA) extends the comparison to more than two groups in one go. When the variants to test are three, four or more, repeating many pairwise t-tests is a mistake: ANOVA is the correct answer, and grasping its logic is the step that carries us from the elementary tests to the more structured ones.

When the data won’t cooperate

The classic tests rest on a comfortable but fragile assumption: that the data follow, at least roughly, the bell curve. In operational reality it often happens that they do not — skewed distributions, anomalous values, ordinal scales.
This section tackles that territory: how to recognise when the conditions of the parametric tests do not hold, and which tools to use in their place without giving up rigour.

Statistical parametric and non-parametric tests is the conceptual map of this fork. It explains what a parametric test really assumes, how to tell whether those assumptions are met, and why in many real cases it is wiser to rely on methods that ask nothing about the shape of the distribution. It is the step that teaches us not to apply a test with our eyes closed.

The Wilcoxon test is the concrete alternative to the t-test when the data are not normal. Instead of the raw values it works on their ranks, and this makes it robust to outliers and skewed distributions. Knowing when to prefer it to the t-test is a skill that, in everyday practice, makes the difference between a solid conclusion and a fragile one.

Measuring and experimenting

We reach the operational heart: it is not enough to know whether an effect exists, we need to know how large it is and how to design an experiment that measures it honestly.
This section gathers the stages that bring inferential statistics into the daily work of those who do SEO and marketing — A/B tests — and includes two ready-to-use calculators and the subtlest trap one falls into after having learned all the rest.

Effect size and power analysis moves the discussion from “is it significant?” to “how much does it matter, and how much data do I need to notice it?”. The effect size measures the real magnitude of an effect, the power analysis tells how large a sample is needed to catch it. It is the step that distinguishes a designed experiment from an improvised one.

The guide to statistical tests for A/B analysis draws together the whole path by applying it to the case that interests digital workers most. It shows how to choose the right test depending on the type of metric — conversions, durations, average values — and connects the tools seen so far to a concrete marketing decision.

A/B testing is the discipline of controlled experiments: two variants, random assignment, rigorous comparison. Here we see how the whole path — sampling, hypotheses, tests, effect size — converges into the cleanest method for establishing whether a change really works instead of relying on opinion.

So as not to redo the calculations by hand every time, two practical tools accompany this phase. The A/B test sample size calculator answers the question to ask before launching a test: how many visitors are needed to catch a difference of a certain magnitude with the desired certainty. The A/B test significance calculator steps in instead afterwards: given the numbers collected, it says whether the observed difference between the two variants is statistically solid or compatible with chance.

There is, finally, one last trap, and it arrives precisely when we believe we have everything under control. The peeking problem shows how peeking at the results of an A/B test before the end — stopping the moment the data prove us right — silently inflates false positives, even when each single test is done by the book. It is the stage that teaches us to distrust our own haste: the moment we decide to look counts as much as the result we see.

Where to start

If this is the first contact with the subject, the entry point is only one: the central limit theorem and, right after it, hypothesis testing. They are the two stages from which everything else takes on meaning; tackle the others out of sequence and, sooner or later, we always come back here.

This is the first of the thematic paths we are building to navigate the blog’s articles. Others will arrive, devoted to related themes — regression, time series, the pitfalls of marketing data — conceived like this one: not new explanations, but maps that line up what is already there. Inferential statistics, though, comes before them all: it is the toolbox from which every other path, sooner or later, ends up drawing.