{"id":3947,"date":"2026-06-30T07:42:57","date_gmt":"2026-06-30T06:42:57","guid":{"rendered":"https:\/\/www.gironi.it\/blog\/?p=3947"},"modified":"2026-06-30T07:42:59","modified_gmt":"2026-06-30T06:42:59","slug":"keyword-clustering","status":"publish","type":"post","link":"https:\/\/www.gironi.it\/blog\/en\/keyword-clustering\/","title":{"rendered":"Keyword Clustering: grouping thousands of queries with K-means and hierarchical clustering"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">It happens with every reasonably serious project: you export the keyword list from Search Console or a tool, and you find yourself facing thousands of rows. Three thousand, ten thousand queries. Reading them one by one is unthinkable, and grouping them by hand &#8220;by feel&#8221; is slow, subjective and impossible to reproduce.<br>Yet we need that grouping: we want to understand which big families of searches exist in our market, in order to decide where to create content, which pages to build, what to bet on.<br>The question is: can we let the data reveal the groups, instead of imposing them ourselves? Turning that mountain of queries into a few homogeneous sets is the job of <em>keyword clustering<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We have already tackled a close problem, classifying the <a href=\"https:\/\/www.gironi.it\/blog\/en\/naive-bayes-search-intent\/\">intent of a query with Naive Bayes<\/a> \u2014 but there we had an ingredient we lack today: a set of <em>already labelled<\/em> examples to learn from. Here nobody has handed us the labels. This is the territory of <em>clustering<\/em>, one of the most used tools of <a href=\"https:\/\/www.gironi.it\/blog\/en\/understanding-the-basics-of-machine-learning-a-beginners-guide\/\">machine learning<\/a>, and in this article we build it in R with its two classic algorithms: <em>K-means<\/em> and hierarchical clustering.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What we will cover<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"#without-labels\">Grouping without labels: the idea of clustering<\/a><\/li><li><a href=\"#k-means\">K-means: centroids and the problem of choosing k<\/a><\/li><li><a href=\"#reading-clusters\">Reading the clusters: who are these groups?<\/a><\/li><li><a href=\"#hierarchical\">Hierarchical clustering: the dendrogram<\/a><\/li><li><a href=\"#which-method\">Which method, and the traps<\/a><\/li><li><a href=\"#try-it-yourself\">Try it yourself<\/a><\/li><li><a href=\"#further-reading\">Further reading<\/a><\/li><\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"without-labels\">Grouping without labels: the idea of clustering<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The difference with Naive Bayes is one of principle, not of detail. There we did <em>supervised learning<\/em>: we had queries already marked as informational, navigational or transactional, and we taught the algorithm to recognise new ones. Here we do <em>unsupervised learning<\/em>: nobody told us which and how many groups exist. <strong>Clustering does not verify a label we already know: it looks for a structure we did not know was there.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Grouping requires two things. The first is describing each keyword with numbers: in our example we will use search volume, cost per click (<em>cpc<\/em>), average position and word count.<br>The second is a notion of <em>distance<\/em>: two keywords are &#8220;close&#8221; if their numbers are alike. The most common distance is the Euclidean one, the same we would use on a map, only computed in a four-dimensional space (one per metric).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There is, however, a trap to defuse straight away. Volume is measured in tens of thousands, cpc in cents of a euro: left as they are, the distances would be dominated by volume, and cpc would count for almost nothing.<br><strong>Before computing any distance we must put all the variables on the same scale<\/strong>, standardising them \u2014 in R with the <code>scale()<\/code> function, which subtracts the mean from each column and divides by the standard deviation. Only then does one euro of difference in cpc and ten thousand searches of difference in volume &#8220;weigh&#8221; the same.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"k-means\">K-means: centroids and the problem of choosing k<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The idea of K-means is almost naive in its simplicity. We decide into how many groups (k) we want to split the data; the algorithm places k representative points, the <em>centroids<\/em>, and then repeats two steps until things settle: it assigns each keyword to the nearest centroid, then moves each centroid to the centre of the keywords assigned to it. At each pass the groups grow a little more cohesive, until they stop moving.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What the algorithm tries to minimise, in words, is the internal spread of the groups: the sum of the (squared) distances of each point from the centroid of its own cluster. In a formula:<\/p>\n\n\n\n\\( \\text{WCSS} = \\sum_{k=1}^{K} \\sum_{x \\in C_k} \\lVert x &#8211; \\mu_k \\rVert^2 \\\\ \\)\n\n\n\n<p class=\"wp-block-paragraph\">where \\( C_k \\) is the k-th cluster, \\( \\mu_k \\) its centroid and the double sum runs over all points of all groups. The lower the WCSS (<em>within-cluster sum of squares<\/em>), the more compact the groups.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The sore point remains: we have to decide k ourselves, before starting. A help comes from the <em>elbow method<\/em>: we try several values of k and watch how the WCSS falls. At first adding a cluster helps a lot, then the improvements become marginal; the &#8220;elbow&#8221; of the curve \u2014 the point where the descent flattens \u2014 suggests a reasonable k. I build the keyword table and compute the WCSS from 1 to 6 groups in R:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>kw &lt;- data.frame(\n  keyword = c(\"running shoes\",\"running shoes men\",\"nike pegasus\",\"nike pegasus 40\",\n              \"best trail running shoes 2026\",\"how to choose running shoes\",\n              \"running shoes deal\",\"buy trail shoes online\",\n              \"trail vs road running shoes\",\"running shoes overpronation\",\n              \"asics gel nimbus\",\"saucony endorphin\",\"trail running shoes review\",\n              \"discount running shoes\",\"running shoe store london\"),\n  volume   = c(40000,18000,12000,8000,2400,1900,3200,880,1300,2100,9000,4000,1500,2600,720),\n  cpc      = c(0.45,0.55,0.30,0.35,0.40,0.10,0.95,1.10,0.08,0.30,0.28,0.33,0.15,0.90,0.85),\n  position = c(3.1,4.2,2.0,5.5,8.1,11.2,6.0,9.4,14.0,7.3,2.5,6.8,12.1,5.9,4.7),\n  n_words  = c(2,3,2,3,5,5,3,4,6,3,3,2,3,3,4)\n)\n\n# standardise the four metrics (different scales -&gt; same weight)\nX &lt;- scale(kw[, c(\"volume\",\"cpc\",\"position\",\"n_words\")])\n\n# elbow method: WCSS for k from 1 to 6\nset.seed(1)\nwss &lt;- sapply(1:6, function(k) kmeans(X, centers = k, nstart = 10)$tot.withinss)\nround(wss, 1)\n# [1] 56.0 34.4 20.7 12.4  8.7  6.9<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The drop is steep up to three groups (56 \u2192 34 \u2192 21) and then slows down markedly (21 \u2192 12 \u2192 9 \u2192 7). The elbow is never a sharp line \u2014 it is a reading, not a theorem \u2014 but here it reasonably points to <strong>k = 3<\/strong>. So I run K-means with three centroids:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>set.seed(1)\nkm &lt;- kmeans(X, centers = 3, nstart = 25)\nkw$cluster &lt;- km$cluster\ntable(km$cluster)\n# 1 2 3\n# 4 7 4<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">n.b. the argument <code>nstart = 25<\/code> restarts the algorithm 25 times from different initial centroids, keeping the best solution: K-means can in fact get stuck in a local minimum depending on where it starts, and restarting several times is the standard defence. The <code>set.seed(1)<\/code> only serves to make the example reproducible (including the cluster numbering, which is itself arbitrary).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"reading-clusters\">Reading the clusters: who are these groups?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Having three groups is useless until we understand <em>what<\/em> they represent. The most direct way is to look at the average metrics of each cluster.<br>I compute them on the original scale (not the standardised one, which is unreadable):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>aggregate(kw[, c(\"volume\",\"cpc\",\"position\",\"n_words\")],\n          by = list(cluster = kw$cluster), FUN = mean)\n#   cluster volume  cpc position n_words\n# 1       1   1775 0.18    11.35    4.75\n# 2       2  13300 0.37     4.49    2.57\n# 3       3   1850 0.95     6.50    3.50<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now the groups speak.<br><strong>Cluster 2<\/strong> collects the very high-volume queries (13,300 searches on average), short (two-three words), well positioned and with modest cpc: they are the generic and brand <em>heads<\/em> \u2014 &#8220;running shoes&#8221;, &#8220;nike pegasus&#8221;, &#8220;asics gel nimbus&#8221;. <strong>Cluster 1<\/strong> has low volumes, long queries (almost five words), low positions and minimal cpc: it is the <em>informational<\/em> long-tail \u2014 &#8220;how to choose running shoes&#8221;, &#8220;trail vs road running shoes&#8221;, &#8220;trail running shoes review&#8221;. <strong>Cluster 3<\/strong> stands out for a very high cpc (\u20ac0.95) and contains the clearly <em>commercial<\/em> queries \u2014 &#8220;running shoes deal&#8221;, &#8220;buy trail shoes online&#8221;, &#8220;discount running shoes&#8221;, &#8220;running shoe store london&#8221;.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is worth pausing a moment on what just happened. <strong>Without giving the algorithm any label, the three groups that emerge closely echo a split by intent \u2014 informational research, generic and brand heads, commercial queries \u2014 close to the one that with Naive Bayes we had instead to teach it through examples.<\/strong><br>The alignment, mind you, is not magic: it emerges because the metrics we chose (cpc, length, position) <em>indirectly track<\/em> intent, not because clustering knows it \u2014 of the queries&#8217; meaning, here, it has not seen a single word. The reading is still immediately actionable: the informational cluster calls for articles and guides, the commercial one for product and offer pages, the high-volume one for robust pillar pages.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"hierarchical\">Hierarchical clustering: the dendrogram<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">K-means forced us to choose k in advance. <em>Hierarchical<\/em> clustering flips the approach: it decides nothing a priori, and builds instead a complete tree of groupings. It starts with each keyword as a group of its own, then progressively merges the two closest, then the two closest among those remaining, and so on up to a single big group. The result is a <em>dendrogram<\/em>: a tree showing at what &#8220;height&#8221; (that is, at what distance) each merge happens.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I build it in R by first computing the distance matrix, then the tree:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>d  &lt;- dist(X)                       # euclidean distances between standardised keywords\nhc &lt;- hclust(d, method = \"ward.D2\") # hierarchical tree (Ward's criterion)\nplot(hc, labels = kw$keyword)       # the dendrogram<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The beauty of the dendrogram is that we make the decision on <em>how many<\/em> groups to keep <em>after<\/em> seeing it, simply by &#8220;cutting&#8221; it at a chosen height: a low cut leaves many small groups, a high cut a few large ones. I cut at three groups and compare with K-means:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>kw$cluster_hc &lt;- cutree(hc, k = 3)\ntable(kw$cluster_hc)\n# 1 2 3\n# 8 3 4<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The partition is not identical to the K-means one (here the groups have 8, 3 and 4 keywords), and that is normal: the two methods optimise different criteria and on little data the differences show.<br>But the substance of the groupings \u2014 the high-cpc transactional block, the high-volume heads, the informational tail \u2014 remains recognisable in both. When two different methods converge on the same story, we can trust that story a little more.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"which-method\">Which method, and the traps<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The choice between the two, in practice, follows a few simple rules. <strong>K-means<\/strong> is fast and efficient even on tens of thousands of keywords, but it wants to be told k and tends to build &#8220;spherical&#8221; groups of similar size. The <strong>hierarchical<\/strong> one does not ask for k in advance and gives the dendrogram \u2014 precious for <em>seeing<\/em> how the groups nest inside one another \u2014 but becomes heavy when the keywords are too many. The most common practice: explore with hierarchical on a sample to get a sense of the number of groups, then apply K-means to the whole set with the k thus identified.<\/p>\n\n\n\n<p class=\"has-light-gray-background-color has-background wp-block-paragraph\">A word of caution, the most important of all: <strong>clustering always finds groups, even when there are none.<\/strong> Even pure noise comes back dutifully split into k tidy clusters. The number of groups, the metrics chosen to describe the keywords, the standardisation: they are all <em>our<\/em> decisions, and each one changes the result. A grouping is not a truth discovered in the data, it is a working hypothesis that only makes sense if it survives the test of business common sense. If we cannot explain a cluster in words, it probably does not really exist.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There is then a question of dimensions. Here we used four metrics, but in real operations the variables describing a keyword can be many more, and with many dimensions distances lose meaning (everything tends to look equally far). It is exactly the problem that <a href=\"https:\/\/www.gironi.it\/blog\/en\/principal-component-analysis-pca\/\">principal component analysis<\/a> knows how to ease, compressing many metrics into a few components before passing the baton to clustering.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"try-it-yourself\">Try it yourself<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The best way to understand clustering is to watch it change its answer as the choices change. Building on the code above:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Skip the standardisation: run K-means directly on <code>kw[, c(\"volume\",\"cpc\",\"position\",\"n_words\")]<\/code> without <code>scale()<\/code>. Do the groups all collapse onto volume? It is the practical demonstration of why we standardise.<\/li><li>Change k: try four or five groups and re-read the averages. Does the informational cluster split into sensible sub-themes or are you just cutting noise?<\/li><li>Change the merging criterion of the hierarchical method: <code>method = \"complete\"<\/code> or <code>\"average\"<\/code> instead of <code>\"ward.D2\"<\/code>. Does the dendrogram change shape? And the groups cut at k=3?<\/li><\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">A hint: always keep an eye on the per-cluster averages with <code>aggregate()<\/code>. It is there, and not in the code, that you decide whether a grouping is useful or just an elegant partition of nothing.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">We grouped the keywords by how they <em>behave<\/em> \u2014 volume, cost, position.<br>But what matters most to an SEO is left out: their <em>meaning<\/em>. Two queries can have different metrics and mean the same thing, or similar metrics and opposite intents. Grouping by sense, and not only by numbers, means turning the very text of the queries into something measurable: it is the job of <em>text mining<\/em>, where words become vectors and similarity is computed on language. And that is where we will pick up next.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"further-reading\">Further reading<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to go deeper into clustering \u2014 K-means, hierarchical methods, the choice of the number of groups and the pitfalls of interpretation \u2014 <em><a href=\"https:\/\/www.amazon.it\/dp\/1461471370?tag=consulenzeinf-21\" rel=\"nofollow sponsored noopener\" target=\"_blank\">An Introduction to Statistical Learning<\/a><\/em> by James, Witten, Hastie and Tibshirani devotes a lucid chapter to unsupervised learning, with R labs that retrace exactly the steps we saw here. It is the reference I recommend to anyone who wants to move from the toy example to clustering on real data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It happens with every reasonably serious project: you export the keyword list from Search Console or a tool, and you find yourself facing thousands of rows. Three thousand, ten thousand queries. Reading them one by one is unthinkable, and grouping them by hand &#8220;by feel&#8221; is slow, subjective and impossible to reproduce.Yet we need that &hellip; <a href=\"https:\/\/www.gironi.it\/blog\/en\/keyword-clustering\/\" class=\"more-link\">Leggi tutto<span class=\"screen-reader-text\"> &#8220;Keyword Clustering: grouping thousands of queries with K-means and hierarchical clustering&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","footnotes":""},"categories":[161],"tags":[],"class_list":["post-3947","post","type-post","status-publish","format-standard","hentry","category-statistics"],"lang":"en","translations":{"en":3947,"it":3946},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"post-thumbnail":false},"uagb_author_info":{"display_name":"Paolo Gironi","author_link":"https:\/\/www.gironi.it\/blog\/author\/autore-articoli\/"},"uagb_comment_info":0,"uagb_excerpt":"It happens with every reasonably serious project: you export the keyword list from Search Console or a tool, and you find yourself facing thousands of rows. Three thousand, ten thousand queries. Reading them one by one is unthinkable, and grouping them by hand &#8220;by feel&#8221; is slow, subjective and impossible to reproduce.Yet we need that&hellip;","_links":{"self":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3947","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/comments?post=3947"}],"version-history":[{"count":1,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3947\/revisions"}],"predecessor-version":[{"id":3948,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3947\/revisions\/3948"}],"wp:attachment":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/media?parent=3947"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/categories?post=3947"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/tags?post=3947"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}