Categories: statistics

Keyword Clustering: grouping thousands of queries with K-means and hierarchical clustering

It happens with every reasonably serious project: you export the keyword list from Search Console or a tool, and you find yourself facing thousands of rows. Three thousand, ten thousand queries. Reading them one by one is unthinkable, and grouping them by hand “by feel” is slow, subjective and impossible to reproduce.
Yet we need that grouping: we want to understand which big families of searches exist in our market, in order to decide where to create content, which pages to build, what to bet on.
The question is: can we let the data reveal the groups, instead of imposing them ourselves? Turning that mountain of queries into a few homogeneous sets is the job of keyword clustering.

We have already tackled a close problem, classifying the intent of a query with Naive Bayes — but there we had an ingredient we lack today: a set of already labelled examples to learn from. Here nobody has handed us the labels. This is the territory of clustering, one of the most used tools of machine learning, and in this article we build it in R with its two classic algorithms: K-means and hierarchical clustering.

What we will cover:

Grouping without labels: the idea of clustering
K-means: centroids and the problem of choosing k
Reading the clusters: who are these groups?
Hierarchical clustering: the dendrogram
Which method, and the traps
Try it yourself
Further reading

Grouping without labels: the idea of clustering

The difference with Naive Bayes is one of principle, not of detail. There we did supervised learning: we had queries already marked as informational, navigational or transactional, and we taught the algorithm to recognise new ones. Here we do unsupervised learning: nobody told us which and how many groups exist. Clustering does not verify a label we already know: it looks for a structure we did not know was there.

Grouping requires two things. The first is describing each keyword with numbers: in our example we will use search volume, cost per click (cpc), average position and word count.
The second is a notion of distance: two keywords are “close” if their numbers are alike. The most common distance is the Euclidean one, the same we would use on a map, only computed in a four-dimensional space (one per metric).

There is, however, a trap to defuse straight away. Volume is measured in tens of thousands, cpc in cents of a euro: left as they are, the distances would be dominated by volume, and cpc would count for almost nothing.
Before computing any distance we must put all the variables on the same scale, standardising them — in R with the scale() function, which subtracts the mean from each column and divides by the standard deviation. Only then does one euro of difference in cpc and ten thousand searches of difference in volume “weigh” the same.

K-means: centroids and the problem of choosing k

The idea of K-means is almost naive in its simplicity. We decide into how many groups (k) we want to split the data; the algorithm places k representative points, the centroids, and then repeats two steps until things settle: it assigns each keyword to the nearest centroid, then moves each centroid to the centre of the keywords assigned to it. At each pass the groups grow a little more cohesive, until they stop moving.

What the algorithm tries to minimise, in words, is the internal spread of the groups: the sum of the (squared) distances of each point from the centroid of its own cluster. In a formula:

\( \text{WCSS} = \sum_{k=1}^{K} \sum_{x \in C_k} \lVert x – \mu_k \rVert^2 \\ \)

where \( C_k \) is the k-th cluster, \( \mu_k \) its centroid and the double sum runs over all points of all groups. The lower the WCSS (within-cluster sum of squares), the more compact the groups.

The sore point remains: we have to decide k ourselves, before starting. A help comes from the elbow method: we try several values of k and watch how the WCSS falls. At first adding a cluster helps a lot, then the improvements become marginal; the “elbow” of the curve — the point where the descent flattens — suggests a reasonable k. I build the keyword table and compute the WCSS from 1 to 6 groups in R:

kw <- data.frame(
  keyword = c("running shoes","running shoes men","nike pegasus","nike pegasus 40",
              "best trail running shoes 2026","how to choose running shoes",
              "running shoes deal","buy trail shoes online",
              "trail vs road running shoes","running shoes overpronation",
              "asics gel nimbus","saucony endorphin","trail running shoes review",
              "discount running shoes","running shoe store london"),
  volume   = c(40000,18000,12000,8000,2400,1900,3200,880,1300,2100,9000,4000,1500,2600,720),
  cpc      = c(0.45,0.55,0.30,0.35,0.40,0.10,0.95,1.10,0.08,0.30,0.28,0.33,0.15,0.90,0.85),
  position = c(3.1,4.2,2.0,5.5,8.1,11.2,6.0,9.4,14.0,7.3,2.5,6.8,12.1,5.9,4.7),
  n_words  = c(2,3,2,3,5,5,3,4,6,3,3,2,3,3,4)
)

# standardise the four metrics (different scales -> same weight)
X <- scale(kw[, c("volume","cpc","position","n_words")])

# elbow method: WCSS for k from 1 to 6
set.seed(1)
wss <- sapply(1:6, function(k) kmeans(X, centers = k, nstart = 10)$tot.withinss)
round(wss, 1)
# [1] 56.0 34.4 20.7 12.4  8.7  6.9

The drop is steep up to three groups (56 → 34 → 21) and then slows down markedly (21 → 12 → 9 → 7). The elbow is never a sharp line — it is a reading, not a theorem — but here it reasonably points to k = 3. So I run K-means with three centroids:

set.seed(1)
km <- kmeans(X, centers = 3, nstart = 25)
kw$cluster <- km$cluster
table(km$cluster)
# 1 2 3
# 4 7 4

n.b. the argument nstart = 25 restarts the algorithm 25 times from different initial centroids, keeping the best solution: K-means can in fact get stuck in a local minimum depending on where it starts, and restarting several times is the standard defence. The set.seed(1) only serves to make the example reproducible (including the cluster numbering, which is itself arbitrary).

Reading the clusters: who are these groups?

Having three groups is useless until we understand what they represent. The most direct way is to look at the average metrics of each cluster.
I compute them on the original scale (not the standardised one, which is unreadable):

aggregate(kw[, c("volume","cpc","position","n_words")],
          by = list(cluster = kw$cluster), FUN = mean)
#   cluster volume  cpc position n_words
# 1       1   1775 0.18    11.35    4.75
# 2       2  13300 0.37     4.49    2.57
# 3       3   1850 0.95     6.50    3.50

Now the groups speak.
Cluster 2 collects the very high-volume queries (13,300 searches on average), short (two-three words), well positioned and with modest cpc: they are the generic and brand heads — “running shoes”, “nike pegasus”, “asics gel nimbus”. Cluster 1 has low volumes, long queries (almost five words), low positions and minimal cpc: it is the informational long-tail — “how to choose running shoes”, “trail vs road running shoes”, “trail running shoes review”. Cluster 3 stands out for a very high cpc (€0.95) and contains the clearly commercial queries — “running shoes deal”, “buy trail shoes online”, “discount running shoes”, “running shoe store london”.

It is worth pausing a moment on what just happened. Without giving the algorithm any label, the three groups that emerge closely echo a split by intent — informational research, generic and brand heads, commercial queries — close to the one that with Naive Bayes we had instead to teach it through examples.
The alignment, mind you, is not magic: it emerges because the metrics we chose (cpc, length, position) indirectly track intent, not because clustering knows it — of the queries’ meaning, here, it has not seen a single word. The reading is still immediately actionable: the informational cluster calls for articles and guides, the commercial one for product and offer pages, the high-volume one for robust pillar pages.

Hierarchical clustering: the dendrogram

K-means forced us to choose k in advance. Hierarchical clustering flips the approach: it decides nothing a priori, and builds instead a complete tree of groupings. It starts with each keyword as a group of its own, then progressively merges the two closest, then the two closest among those remaining, and so on up to a single big group. The result is a dendrogram: a tree showing at what “height” (that is, at what distance) each merge happens.

I build it in R by first computing the distance matrix, then the tree:

d  <- dist(X)                       # euclidean distances between standardised keywords
hc <- hclust(d, method = "ward.D2") # hierarchical tree (Ward's criterion)
plot(hc, labels = kw$keyword)       # the dendrogram

The beauty of the dendrogram is that we make the decision on how many groups to keep after seeing it, simply by “cutting” it at a chosen height: a low cut leaves many small groups, a high cut a few large ones. I cut at three groups and compare with K-means:

kw$cluster_hc <- cutree(hc, k = 3)
table(kw$cluster_hc)
# 1 2 3
# 8 3 4

The partition is not identical to the K-means one (here the groups have 8, 3 and 4 keywords), and that is normal: the two methods optimise different criteria and on little data the differences show.
But the substance of the groupings — the high-cpc transactional block, the high-volume heads, the informational tail — remains recognisable in both. When two different methods converge on the same story, we can trust that story a little more.

Which method, and the traps

The choice between the two, in practice, follows a few simple rules. K-means is fast and efficient even on tens of thousands of keywords, but it wants to be told k and tends to build “spherical” groups of similar size. The hierarchical one does not ask for k in advance and gives the dendrogram — precious for seeing how the groups nest inside one another — but becomes heavy when the keywords are too many. The most common practice: explore with hierarchical on a sample to get a sense of the number of groups, then apply K-means to the whole set with the k thus identified.

A word of caution, the most important of all: clustering always finds groups, even when there are none. Even pure noise comes back dutifully split into k tidy clusters. The number of groups, the metrics chosen to describe the keywords, the standardisation: they are all our decisions, and each one changes the result. A grouping is not a truth discovered in the data, it is a working hypothesis that only makes sense if it survives the test of business common sense. If we cannot explain a cluster in words, it probably does not really exist.

There is then a question of dimensions. Here we used four metrics, but in real operations the variables describing a keyword can be many more, and with many dimensions distances lose meaning (everything tends to look equally far). It is exactly the problem that principal component analysis knows how to ease, compressing many metrics into a few components before passing the baton to clustering.

Try it yourself

The best way to understand clustering is to watch it change its answer as the choices change. Building on the code above:

Skip the standardisation: run K-means directly on kw[, c("volume","cpc","position","n_words")] without scale(). Do the groups all collapse onto volume? It is the practical demonstration of why we standardise.
Change k: try four or five groups and re-read the averages. Does the informational cluster split into sensible sub-themes or are you just cutting noise?
Change the merging criterion of the hierarchical method: method = "complete" or "average" instead of "ward.D2". Does the dendrogram change shape? And the groups cut at k=3?

A hint: always keep an eye on the per-cluster averages with aggregate(). It is there, and not in the code, that you decide whether a grouping is useful or just an elegant partition of nothing.

We grouped the keywords by how they behave — volume, cost, position.
But what matters most to an SEO is left out: their meaning. Two queries can have different metrics and mean the same thing, or similar metrics and opposite intents. Grouping by sense, and not only by numbers, means turning the very text of the queries into something measurable: it is the job of text mining, where words become vectors and similarity is computed on language. And that is where we will pick up next.

Expected vs Actual CTR: finding the pages that earn fewer clicks than their position deserves

Anyone who spends their days inside Search Console knows that little nagging feeling: a page…

1 day ago

statistics

Naive Bayes: classifying search intent with Bayes’ theorem

In the article on the multi-armed bandit we used Bayes to decide between variants: shifting…

2 days ago

statistics

Multi-armed bandit: optimising the variants while the test is still running

In the article on Bayesian A/B testing we compared two variants at a fixed sample…

3 days ago

statistics

Bayesian A/B Testing: not just “whether” B beats A, but “by how much”

In the article on classic A/B testing we saw how to compare two variants with…

4 days ago

statistics

Bayesian Conversion Rate Estimation: how much can we trust limited data

In the article on the foundations of Bayesian statistics, we saw how Bayesian updating works…

5 days ago

statistics

The peeking problem: why sneaking a look at an A/B test inflates false positives

On 21 January 2015 Optimizely — one of the most widely used A/B testing platforms…

1 week ago