Basic Statistics: a Learning Path to Describe Your Data

Before we even ask whether an effect is real, we need to know how to look at data for what it is: how many visits, how much time on page, how many conversions per channel. It is a seemingly trivial gesture, and yet this is where the difference is decided between those who make informed decisions and those who guess. Basic statistics is exactly this preliminary craft: bringing order to a heap of numbers, describing them without betraying them, and starting to reason about the uncertainty that every measurement carries with it. It is not the spectacular part — there are no tests yet, nor experiments — but it is the part without which everything else rests on nothing.

Learning it well, however, does not mean collecting scattered definitions. It means walking a road that starts from understanding how the data in front of us are made, passes through the correct ways to summarise them in a few sensible numbers, brushes against the first rudiments of probability — the language with which we speak of what is uncertain — and arrives at the bridge that leads towards statistics proper: the move from the sample we observe to an estimate of what we cannot see.

This page is that road, in order. We do not re-explain the theory here: each stage is an article on the blog, and the order in which we have arranged them is the order in which it makes sense to read them. Anyone starting from scratch can follow them in sequence, from first to last; anyone with some grounding can jump to the group they need. The three sections that follow — describing the data, the first steps into probability, and finally the move from the sample to the estimate — are the three movements of a single path, conceived for those who want to build the foundations before facing tests and A/B tests. We start where every analysis starts: from the data themselves.

Describing the data

Before any calculation comes the most neglected question: what type of data do we have in our hands? A colour, a grade, a temperature and a revenue figure are not treated the same way, and confusing them leads to averages that mean nothing.
Summarising data without first having understood it is the fastest way to obtain numbers that are precise and wrong.

The scales of measurement are the non-negotiable starting point. They explain the difference between nominal, ordinal, interval and ratio data — that is, between what we can only label, what we can order, and what it makes sense to sum and average. It is the distinction that decides which tools we will be able to use in all the later stages: getting it wrong at the start means dragging the error all the way to the end.

The measures of position answer the most natural question we put to a set of numbers: where is the centre? Mean, median and mode look like synonyms and are nothing of the sort; each tells a different story, and choosing the right one — the median when there are extreme values, for instance — is what separates an honest summary from a misleading one.

The measures of dispersion complete the portrait by shifting attention from the centre to the width. Two sets of data can have the same mean and behave in opposite ways: variance and standard deviation measure how far the values stray from the centre, and it is precisely this variability — not the mean — that is the heart of everything that comes afterwards, from tests to confidence intervals.

The Gini index closes the section with a more specific but valuable tool: it measures how concentrated or unequally distributed a quantity is. Born to study income inequality, it comes in handy every time we want to know whether a few pages, a few products or a few customers weigh disproportionately on the total. It is a first taste of how a single number can capture the shape of an entire distribution.

The first steps into probability

Describing what has already happened is half the work; the other half is reasoning about what might happen. Probability is the language of uncertainty, and without a sketch of it the statistical tests remain formulas applied blindly.
Every statistical statement about a sample is, beneath the surface, a statement about probability: learning to handle it is what makes conclusions defensible.

Probability, permutations and combinations are the first brick. Here we learn to count in an orderly way — in how many ways the elements of a set can be arranged, and when order matters or does not — because calculating a probability means, almost always, counting the favourable cases over the possible ones. It looks like a combinatorial digression, and it is instead the basis of all probabilistic reasoning.

Contingency tables and conditional probability bring probability onto the concrete ground of cross-tabulated data. When we relate two characteristics — the acquisition channel and conversion, for instance — the interesting question is not how probable an event is in absolute terms, but how probable it is given another. It is the concept behind much of marketing analysis, and grasping it here avoids crude errors further on.

From the sample to the estimate

We reach the bridge. So far we have described the data we have; now we ask what we can say, starting from that little, about what we have not measured — the whole population, the true effect.
This section gathers the stages that turn descriptive statistics into something that looks beyond the sample, and that opens the door to inference proper.

The central limit theorem is the result that makes the whole leap possible. It explains why, by averaging many small random variations, we obtain that bell curve which keeps reappearing everywhere, and it is the reason we can say something sensible about an enormous population by looking at a modest sample. It is, at the same time, the last stage of the foundations and the first of inference.

Confidence intervals are the practical translation of that idea. Instead of a blunt estimate — “the mean is 3.2” — they give us a plausible range for the true value, honestly declaring how much uncertainty we carry with us. Understanding how they are built, and above all what they do not mean, is what separates someone who reads a number from someone who knows how to interpret it.

Sampling and sample size close the circle by answering the most practical question of all: how much data is really needed? Collecting a representative sample, and knowing in advance how large it must be to say something solid, is the skill that distinguishes a designed analysis from an improvised one — and it is exactly the point from which the next path, the one on inference, starts.

Where to start

If this is the first contact with the subject, the entry point is only one: the scales of measurement and, right after them, the measures of position. They are the two stages from which everything else takes on meaning: until we know what type of data we have and where its centre falls, every later calculation risks resting on nothing. Tackle the others out of sequence and, sooner or later, we always come back here.

This is one of the thematic paths we are building to navigate the blog’s articles: basic statistics is the starting point, the toolbox from which every other path, sooner or later, ends up drawing. The natural next step, once the foundations are solid, is the path devoted to inferential statistics: it is there that the leap from the sample to the population becomes the real craft of tests and experiments.