statistics

Correlation: Pearson, Spearman and Kendall (and Why It Isn’t Causation)

Anyone who looks at a website’s data does it constantly, often without noticing: they spot that two things seem to move together. Pages that sit higher in the SERP get more clicks; the ones where users linger longer convert more; longer articles appear to rank better. These are valuable hunches, but they stay vague until we answer a precise question: how much do these pairs of numbers move together? And in what sense? We need an index that turns the impression “they go hand in hand” into a comparable measure. That index is correlation, and it is one of the most used — and most misunderstood — tools in all of applied statistics.

Let’s say right away what correlation is not, because this is where the trouble starts. Correlation measures whether and how much two variables are associated; it does not say that one causes the other, and it does not build a model to predict one from the other. That second step — prediction — is the job of regression, which we’ll cover separately. Here we stay on the previous rung: understanding, with a single number, whether two metrics travel together.

From Covariance to Correlation

The starting idea is simple. If two variables move together, when one sits above its own mean the other tends to sit above its own too; when one drops below, the other follows. We can measure this tendency by multiplying, for each observation, the deviation of x from its mean by the deviation of y from its, and averaging the result. This is the covariance:

\( \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^{n} (x_i – \bar{x})(y_i – \bar{y}) \\ \)

where x̄ and ȳ are the means of the two variables and n the number of observations. When the deviations share the same sign (both above or both below the mean) the product is positive; when they have opposite signs it is negative. A positive covariance thus signals that the two variables tend to grow together, a negative one that when one rises the other falls.

Covariance, however, has a flaw that makes it useless as a yardstick: it depends on the units of measurement. The covariance between sessions and seconds-on-page is one number, the one between sessions and conversion rate another, and the two can’t be compared because they speak different languages. To get a clean measure we divide it by the two standard deviations, stripping it of units and forcing it into a fixed range. The result is the Pearson correlation coefficient:

\( r = \frac{\sum_{i=1}^{n} (x_i – \bar{x})(y_i – \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i – \bar{x})^2} \; \sqrt{\sum_{i=1}^{n} (y_i – \bar{y})^2}} \\ \)

The numerator is nothing but the covariance (up to the factor n); the denominator is the product of the two spreads, and serves precisely to normalise. The result is a pure number between −1 and +1: it equals +1 when the points lie exactly on a rising line, −1 when they lie on a falling line, 0 when there is no linear association at all. The closer r gets to the extremes, the tighter the linear relationship.

Pearson: Linear Association (and Its Trap)

Let’s put it straight to work on a case every SEO knows by heart: the link between SERP position and CTR, the click-through rate. We all know that the further down the results page you go, the fewer clicks you get. Let’s take ten positions with their observed CTRs and compute Pearson’s coefficient in R:

pos <- 1:10
ctr <- c(28.5, 15.7, 11.0, 7.2, 8.0, 5.1, 4.0, 3.2, 2.8, 2.6)  # CTR % by position

cor(pos, ctr)
# [1] -0.852

The coefficient is −0.852: strong, negative, exactly as we expected. And yet something doesn’t add up. The link between position and CTR is iron-clad — it almost never happens that a lower position yields more clicks — and we’d expect a value even closer to −1. Why does Pearson stop at −0.85?

The answer is the most important point in the whole article. Pearson measures only the linear association, that is, how well the points line up along a straight line. But the CTR curve is not a straight line: it plummets from the first to the third position and then flattens out. The relationship is very strong, it’s just curved. Pearson, which looks for straight lines, reads that curvature as “imperfection” and lowers the grade. It isn’t wrong: it’s answering a question — “how linear is this?” — that in this case isn’t the right one.

Spearman and Kendall: Monotonic Association

For many SEO relationships we care about something weaker than linearity: it’s enough to know whether, as one variable grows, the other grows systematically (or falls systematically), without insisting it does so at a constant pace. A relationship like this is called monotonic, and to measure it there’s Spearman’s rank correlation coefficient, denoted ρ (rho).

Spearman’s trick is elegant: instead of working on the values, it works on their ranks. It replaces each number with its place in the standings (the smallest becomes 1, the next 2, and so on) and then computes an ordinary Pearson on these ranks. This way the exact shape of the curve disappears — only the order matters — and what remains is how faithfully the order of x reproduces that of y. We compute it on the same data as before:

cor(pos, ctr, method = "spearman")
# [1] -0.988

Now the coefficient is −0.988, pressed up against −1. It’s the correct picture of the situation: as the position worsens, the CTR falls almost without exception. (That “almost” is no accident: in the data I left a small, realistic inversion, position 5 yielding more than position 4, as happens when a rich snippet inflates a result’s CTR; it’s exactly the kind of ripple that keeps ρ from reaching an exact −1.) Where Pearson saw a “good but not great” association, Spearman recognises the near-perfect monotonic relationship that is actually there.

There’s a third measure worth knowing, Kendall’s tau (τ). It too works on order, but with a different logic: across all pairs of observations, it counts how many are concordant (if x rises, y rises too) and how many discordant, then takes the balance. I compute it in R, again on the same data:

cor(pos, ctr, method = "kendall")
# [1] -0.956

Kendall returns −0.956, also close to the extremes but typically a touch more conservative than Spearman. In everyday practice the choice is less complicated than it seems: Pearson when we care about a linear relationship and the data have no violent tails or outliers; Spearman when the relationship is monotonic but curved, or when the data are already ranks (positions, standings), or when a couple of outliers might throw Pearson off; Kendall when the observations are few or there are many ties, a situation in which its statistical properties hold up better.

The Correlation Matrix

We rarely have only two metrics to compare. More often we have a handful — sessions, average duration, conversions, bounce rate — and we’d like to see all the associations at a glance. R’s cor() function, applied to an entire data frame, returns the correlation matrix: the coefficient of each variable with every other. I build it on twelve example pages:

ga4 <- data.frame(
  sessions      = c(120, 340, 210, 560, 430, 780, 650, 290, 510, 880, 360, 720),
  avg_duration  = c(31,  55,  48,  44,  58,  63,  71,  52,  46,  68,  60,  64),
  conversions   = c(3,   8,   4,   21,  11,  24,  19,  9,   17,  29,  7,   22),
  bounce_rate   = c(70,  61,  66,  44,  57,  41,  46,  59,  52,  38,  63,  45)
)

round(cor(ga4), 2)
#              sessions avg_duration conversions bounce_rate
# sessions         1.00         0.73        0.98       -0.97
# avg_duration     0.73         1.00        0.58       -0.62
# conversions      0.98         0.58        1.00       -0.99
# bounce_rate     -0.97        -0.62       -0.99        1.00

It reads like a two-way table: the diagonal is all 1s (every variable is perfectly correlated with itself), and the matrix is symmetric because the correlation of x with y is the same as y with x. As we can see, sessions and conversions travel almost in unison (0.98: more traffic, more conversions — no surprise), bounce rate is negatively correlated with everything else, while average duration associates with conversions far less than intuition would suggest (0.58). A matrix like this is a precious starting map for deciding where to look. It helps to visualise it as a heatmap (with packages such as corrplot), where colour intensity makes the strong links jump out.

One warning, though, belongs here in bold, because it’s the heart of the matter: a correlation matrix is not a causal map. It tells us which numbers move together, not which moves which, nor whether what moves them is a third factor we don’t even have in the table.

Correlation Is Not Causation

It’s the most repeated phrase in statistics, and the most ignored in practice. It’s worth seeing where it trips us up, because in SEO the stumble is a daily one. Take the classic observation: longer articles rank better. Let’s measure the association between content length and a ranking score (higher = better placed):

length     <- c(620, 850, 1100, 1300, 1500, 1800, 2100, 2400, 2800, 3200)
rank_score <- c(3,   8,   6,    11,   9,    7,    14,   10,   16,   15)

cor(length, rank_score)
# [1] 0.842

A fine 0.842: the correlation is there, and it’s robust. The temptation to conclude “I’ll lengthen my articles and climb the rankings” is overwhelming — and almost always wrong. Faced with a correlation, before talking about cause we must put at least three alternative explanations on the table. It could be a direct cause (length genuinely helps ranking). It could be reverse causation (pages that already rank well get more care and are expanded over time). Or — the most frequent and most insidious case — there could be a confounding factor moving both: the site’s authority. An authoritative domain tends both to produce deeper (hence longer) content and to rank better (for reasons that have nothing to do with length). Length and ranking rise together not because one causes the other, but because a third element drags them both.

This hidden third element is the root of some of the most spectacular errors in data analysis: it can even flip the sign of a relationship when the data are aggregated the wrong way, the phenomenon known as Simpson’s paradox. Establishing a causal link is a craft of its own, requiring controlled experiments or dedicated techniques; correlation, on its own, will never get there. Its job is a different one, and a valuable one: flagging the pairs of metrics worth investigating more deeply.

Try It Yourself

To lock in the mechanism, here’s an exercise with realistic data. For ten pages we have the number of referring domains linking to them and their monthly organic traffic, and we want to understand how strongly the two are associated:

bl  <- c(5, 12, 8, 25, 18, 40, 33, 60, 52, 95)        # referring domains
org <- c(180, 240, 420, 510, 760, 690, 1250, 1100, 1900, 1650)  # organic sessions/month

The task: compute both Pearson’s coefficient with cor(bl, org) and Spearman’s with cor(bl, org, method = "spearman"), and reflect on why they differ.

To check your work: Pearson is 0.815 and Spearman 0.855. Both are high and tell the same underlying story — more referring domains, more traffic — but the fact that Spearman is a bit higher than Pearson tells us something: the relationship is more monotonic than linear, a sign that beyond a certain threshold each extra link brings less marginal traffic than the straight line would want. And, of course, neither number entitles us to say that buying backlinks will raise traffic: here too the site’s authority might be moving both things together.

With correlation we’ve learned to answer the question of whether, and how much, two metrics are associated — choosing Pearson, Spearman or Kendall each time depending on the shape of the link. It’s the indispensable rung before the next question, the one anyone analysing data eventually asks: given an association, can I use one variable to predict the other, and draw the line that ties them together? From here on we no longer just measure the strength of a link, we model it: this is the territory of linear regression, where the very coefficient r we’ve just met returns to the stage, this time in the service of prediction.