  {"id":3821,"date":"2026-06-19T08:10:34","date_gmt":"2026-06-19T07:10:34","guid":{"rendered":"https:\/\/www.gironi.it\/blog\/?p=3821"},"modified":"2026-06-19T08:10:35","modified_gmt":"2026-06-19T07:10:35","slug":"correlation","status":"publish","type":"post","link":"https:\/\/www.gironi.it\/blog\/en\/correlation\/","title":{"rendered":"Correlation: Pearson, Spearman and Kendall (and Why It Isn&#8217;t Causation)"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Anyone who looks at a website&#8217;s data does it constantly, often without noticing: they spot that two things seem to move together. Pages that sit higher in the SERP get more clicks; the ones where users linger longer convert more; longer articles appear to rank better. These are valuable hunches, but they stay vague until we answer a precise question: <em>how much<\/em> do these pairs of numbers move together? And in what sense? We need an index that turns the impression &#8220;they go hand in hand&#8221; into a comparable measure. That index is <strong>correlation<\/strong>, and it is one of the most used \u2014 and most misunderstood \u2014 tools in all of applied statistics.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s say right away what correlation is <em>not<\/em>, because this is where the trouble starts. Correlation measures whether and how much two variables are associated; it does not say that one causes the other, and it does not build a model to predict one from the other. That second step \u2014 prediction \u2014 is the job of regression, which we&#8217;ll cover separately. Here we stay on the previous rung: understanding, with a single number, whether two metrics travel together.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">From Covariance to Correlation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The starting idea is simple. If two variables move together, when one sits above its own mean the other tends to sit above its own too; when one drops below, the other follows. We can measure this tendency by multiplying, for each observation, the deviation of <em>x<\/em> from its mean by the deviation of <em>y<\/em> from its, and averaging the result. This is the <strong>covariance<\/strong>:<\/p>\n\n\n\n\\( \\text{cov}(x, y) = \\frac{1}{n} \\sum_{i=1}^{n} (x_i &#8211; \\bar{x})(y_i &#8211; \\bar{y}) \\\\ \\)\n\n\n\n<p class=\"wp-block-paragraph\">where <em>x\u0304<\/em> and <em>\u0233<\/em> are the means of the two variables and <em>n<\/em> the number of observations. When the deviations share the same sign (both above or both below the mean) the product is positive; when they have opposite signs it is negative. A positive covariance thus signals that the two variables tend to grow together, a negative one that when one rises the other falls.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Covariance, however, has a flaw that makes it useless as a yardstick: <strong>it depends on the units of measurement<\/strong>. The covariance between sessions and seconds-on-page is one number, the one between sessions and conversion rate another, and the two can&#8217;t be compared because they speak different languages. To get a clean measure we divide it by the two standard deviations, stripping it of units and forcing it into a fixed range. The result is the <strong>Pearson correlation coefficient<\/strong>:<\/p>\n\n\n\n\\( r = \\frac{\\sum_{i=1}^{n} (x_i &#8211; \\bar{x})(y_i &#8211; \\bar{y})}{\\sqrt{\\sum_{i=1}^{n} (x_i &#8211; \\bar{x})^2} \\; \\sqrt{\\sum_{i=1}^{n} (y_i &#8211; \\bar{y})^2}} \\\\ \\)\n\n\n\n<p class=\"wp-block-paragraph\">The numerator is nothing but the covariance (up to the factor <em>n<\/em>); the denominator is the product of the two spreads, and serves precisely to normalise. The result is a pure number between <strong>\u22121 and +1<\/strong>: it equals +1 when the points lie exactly on a rising line, \u22121 when they lie on a falling line, 0 when there is no linear association at all. The closer <em>r<\/em> gets to the extremes, the tighter the linear relationship.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Pearson: Linear Association (and Its Trap)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s put it straight to work on a case every SEO knows by heart: the link between <strong>SERP position<\/strong> and <strong>CTR<\/strong>, the click-through rate. We all know that the further down the results page you go, the fewer clicks you get. Let&#8217;s take ten positions with their observed CTRs and compute Pearson&#8217;s coefficient in R:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pos &lt;- 1:10\nctr &lt;- c(28.5, 15.7, 11.0, 7.2, 8.0, 5.1, 4.0, 3.2, 2.8, 2.6)  # CTR % by position\n\ncor(pos, ctr)\n# [1] -0.852<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The coefficient is <strong>\u22120.852<\/strong>: strong, negative, exactly as we expected. And yet something doesn&#8217;t add up. The link between position and CTR is iron-clad \u2014 it almost never happens that a lower position yields more clicks \u2014 and we&#8217;d expect a value even closer to \u22121. Why does Pearson stop at \u22120.85?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The answer is the most important point in the whole article. <strong>Pearson measures only the linear association<\/strong>, that is, how well the points line up along a <em>straight line<\/em>. But the CTR curve is not a straight line: it plummets from the first to the third position and then flattens out. The relationship is very strong, it&#8217;s just <em>curved<\/em>. Pearson, which looks for straight lines, reads that curvature as &#8220;imperfection&#8221; and lowers the grade. It isn&#8217;t wrong: it&#8217;s answering a question \u2014 &#8220;how linear is this?&#8221; \u2014 that in this case isn&#8217;t the right one.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Spearman and Kendall: Monotonic Association<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">For many SEO relationships we care about something weaker than linearity: it&#8217;s enough to know whether, as one variable grows, the other grows <em>systematically<\/em> (or falls systematically), without insisting it does so at a constant pace. A relationship like this is called <strong>monotonic<\/strong>, and to measure it there&#8217;s <strong>Spearman&#8217;s<\/strong> rank correlation coefficient, denoted \u03c1 (rho).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Spearman&#8217;s trick is elegant: instead of working on the values, it works on their <strong>ranks<\/strong>. It replaces each number with its place in the standings (the smallest becomes 1, the next 2, and so on) and then computes an ordinary Pearson on these ranks. This way the exact shape of the curve disappears \u2014 only the order matters \u2014 and what remains is how faithfully the order of <em>x<\/em> reproduces that of <em>y<\/em>. We compute it on the same data as before:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cor(pos, ctr, method = \"spearman\")\n# [1] -0.988<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now the coefficient is <strong>\u22120.988<\/strong>, pressed up against \u22121. It&#8217;s the correct picture of the situation: as the position worsens, the CTR falls almost without exception. (That &#8220;almost&#8221; is no accident: in the data I left a small, realistic inversion, position 5 yielding more than position 4, as happens when a rich snippet inflates a result&#8217;s CTR; it&#8217;s exactly the kind of ripple that keeps \u03c1 from reaching an exact \u22121.) Where Pearson saw a &#8220;good but not great&#8221; association, Spearman recognises the near-perfect monotonic relationship that is actually there.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There&#8217;s a third measure worth knowing, <strong>Kendall&#8217;s tau<\/strong> (\u03c4). It too works on order, but with a different logic: across all pairs of observations, it counts how many are <em>concordant<\/em> (if <em>x<\/em> rises, <em>y<\/em> rises too) and how many <em>discordant<\/em>, then takes the balance. I compute it in R, again on the same data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cor(pos, ctr, method = \"kendall\")\n# [1] -0.956<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Kendall returns <strong>\u22120.956<\/strong>, also close to the extremes but typically a touch more conservative than Spearman. In everyday practice the choice is less complicated than it seems: <strong>Pearson<\/strong> when we care about a linear relationship and the data have no violent tails or outliers; <strong>Spearman<\/strong> when the relationship is monotonic but curved, or when the data are already ranks (positions, standings), or when a couple of outliers might throw Pearson off; <strong>Kendall<\/strong> when the observations are few or there are many ties, a situation in which its statistical properties hold up better.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Correlation Matrix<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We rarely have only two metrics to compare. More often we have a handful \u2014 sessions, average duration, conversions, bounce rate \u2014 and we&#8217;d like to see <em>all<\/em> the associations at a glance. R&#8217;s <code>cor()<\/code> function, applied to an entire data frame, returns the <strong>correlation matrix<\/strong>: the coefficient of each variable with every other. I build it on twelve example pages:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>ga4 &lt;- data.frame(\n  sessions      = c(120, 340, 210, 560, 430, 780, 650, 290, 510, 880, 360, 720),\n  avg_duration  = c(31,  55,  48,  44,  58,  63,  71,  52,  46,  68,  60,  64),\n  conversions   = c(3,   8,   4,   21,  11,  24,  19,  9,   17,  29,  7,   22),\n  bounce_rate   = c(70,  61,  66,  44,  57,  41,  46,  59,  52,  38,  63,  45)\n)\n\nround(cor(ga4), 2)\n#              sessions avg_duration conversions bounce_rate\n# sessions         1.00         0.73        0.98       -0.97\n# avg_duration     0.73         1.00        0.58       -0.62\n# conversions      0.98         0.58        1.00       -0.99\n# bounce_rate     -0.97        -0.62       -0.99        1.00<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">It reads like a two-way table: the diagonal is all 1s (every variable is perfectly correlated with itself), and the matrix is symmetric because the correlation of <em>x<\/em> with <em>y<\/em> is the same as <em>y<\/em> with <em>x<\/em>. As we can see, sessions and conversions travel almost in unison (0.98: more traffic, more conversions \u2014 no surprise), bounce rate is negatively correlated with everything else, while average duration associates with conversions far less than intuition would suggest (0.58). A matrix like this is a precious starting map for deciding where to look. It helps to visualise it as a <strong>heatmap<\/strong> (with packages such as <code>corrplot<\/code>), where colour intensity makes the strong links jump out.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One warning, though, belongs here in bold, because it&#8217;s the heart of the matter: <strong>a correlation matrix is not a causal map<\/strong>. It tells us which numbers move together, not which moves which, nor whether what moves them is a third factor we don&#8217;t even have in the table.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Correlation Is Not Causation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">It&#8217;s the most repeated phrase in statistics, and the most ignored in practice. It&#8217;s worth seeing where it trips us up, because in SEO the stumble is a daily one. Take the classic observation: longer articles rank better. Let&#8217;s measure the association between content length and a ranking score (higher = better placed):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>length     &lt;- c(620, 850, 1100, 1300, 1500, 1800, 2100, 2400, 2800, 3200)\nrank_score &lt;- c(3,   8,   6,    11,   9,    7,    14,   10,   16,   15)\n\ncor(length, rank_score)\n# [1] 0.842<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">A fine <strong>0.842<\/strong>: the correlation is there, and it&#8217;s robust. The temptation to conclude &#8220;I&#8217;ll lengthen my articles and climb the rankings&#8221; is overwhelming \u2014 and almost always wrong. Faced with a correlation, before talking about cause we must put at least three alternative explanations on the table. It could be a <strong>direct cause<\/strong> (length genuinely helps ranking). It could be <strong>reverse causation<\/strong> (pages that already rank well get more care and are expanded over time). Or \u2014 the most frequent and most insidious case \u2014 there could be a <strong>confounding factor<\/strong> moving both: the site&#8217;s authority. An authoritative domain tends both to produce deeper (hence longer) content and to rank better (for reasons that have nothing to do with length). Length and ranking rise together not because one causes the other, but because a third element drags them both.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This hidden third element is the root of some of the most spectacular errors in data analysis: it can even flip the sign of a relationship when the data are aggregated the wrong way, the phenomenon known as <a href=\"https:\/\/www.gironi.it\/blog\/en\/simpsons-paradox-in-seo-when-aggregate-data-can-lie\/\">Simpson&#8217;s paradox<\/a>. Establishing a causal link is a craft of its own, requiring controlled experiments or dedicated techniques; correlation, on its own, will never get there. Its job is a different one, and a valuable one: flagging the pairs of metrics worth investigating more deeply.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Try It Yourself<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">To lock in the mechanism, here&#8217;s an exercise with realistic data. For ten pages we have the number of referring domains linking to them and their monthly organic traffic, and we want to understand how strongly the two are associated:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>bl  &lt;- c(5, 12, 8, 25, 18, 40, 33, 60, 52, 95)        # referring domains\norg &lt;- c(180, 240, 420, 510, 760, 690, 1250, 1100, 1900, 1650)  # organic sessions\/month<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The task: compute both Pearson&#8217;s coefficient with <code>cor(bl, org)<\/code> and Spearman&#8217;s with <code>cor(bl, org, method = \"spearman\")<\/code>, and reflect on why they differ.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To check your work: Pearson is <strong>0.815<\/strong> and Spearman <strong>0.855<\/strong>. Both are high and tell the same underlying story \u2014 more referring domains, more traffic \u2014 but the fact that Spearman is a bit higher than Pearson tells us something: the relationship is more <em>monotonic<\/em> than <em>linear<\/em>, a sign that beyond a certain threshold each extra link brings less marginal traffic than the straight line would want. And, of course, neither number entitles us to say that buying backlinks <em>will<\/em> raise traffic: here too the site&#8217;s authority might be moving both things together.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">With correlation we&#8217;ve learned to answer the question of <em>whether, and how much, two metrics are associated<\/em> \u2014 choosing Pearson, Spearman or Kendall each time depending on the shape of the link. It&#8217;s the indispensable rung before the next question, the one anyone analysing data eventually asks: given an association, can I use one variable to <em>predict<\/em> the other, and draw the line that ties them together? From here on we no longer just measure the strength of a link, we model it: this is the territory of <a href=\"https:\/\/www.gironi.it\/blog\/en\/correlation-and-regression-analysis-linear-regression\/\">linear regression<\/a>, where the very coefficient <em>r<\/em> we&#8217;ve just met returns to the stage, this time in the service of prediction.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Further Reading<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">On correlation, causation and the art of not confusing the two, the book I recommend most often is <a href=\"https:\/\/www.amazon.it\/dp\/0241258766?tag=consulenzeinf-21\" rel=\"nofollow sponsored noopener\" target=\"_blank\"><em>The Art of Statistics<\/em><\/a> by David Spiegelhalter: it walks through real cases where an association does \u2014 and does not \u2014 imply a cause, with exactly the clarity that anyone coming from applications needs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Anyone who looks at a website&#8217;s data does it constantly, often without noticing: they spot that two things seem to move together. Pages that sit higher in the SERP get more clicks; the ones where users linger longer convert more; longer articles appear to rank better. These are valuable hunches, but they stay vague until &hellip; <a href=\"https:\/\/www.gironi.it\/blog\/en\/correlation\/\" class=\"more-link\">Leggi tutto<span class=\"screen-reader-text\"> &#8220;Correlation: Pearson, Spearman and Kendall (and Why It Isn&#8217;t Causation)&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","footnotes":""},"categories":[161],"tags":[],"class_list":["post-3821","post","type-post","status-publish","format-standard","hentry","category-statistics"],"lang":"en","translations":{"en":3821,"it":3820},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"post-thumbnail":false},"uagb_author_info":{"display_name":"Paolo Gironi","author_link":"https:\/\/www.gironi.it\/blog\/author\/autore-articoli\/"},"uagb_comment_info":0,"uagb_excerpt":"Anyone who looks at a website&#8217;s data does it constantly, often without noticing: they spot that two things seem to move together. Pages that sit higher in the SERP get more clicks; the ones where users linger longer convert more; longer articles appear to rank better. These are valuable hunches, but they stay vague until&hellip;","_links":{"self":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3821","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/comments?post=3821"}],"version-history":[{"count":1,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3821\/revisions"}],"predecessor-version":[{"id":3823,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3821\/revisions\/3823"}],"wp:attachment":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/media?parent=3821"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/categories?post=3821"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/tags?post=3821"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}