statistics

Naive Bayes: classifying search intent with Bayes’ theorem

In the article on the multi-armed bandit we used Bayes to decide between variants: shifting traffic toward the one converting best while the test is still running. Now we take a step sideways, while staying within the same line of reasoning: instead of choosing between options, we want to classify, that is to attach to each new observation the most probable label given its features.
The concrete case is one that anyone doing SEO knows well: the intent behind a search query. Someone searching “how to bake a cake” wants to learn something; someone searching “buy shoes online” is ready to pull out a credit card. They are two different worlds, and they call for different content: a guide, a tutorial, a glossary for the first; a product page, a price list, a clearly visible call to action for the second. Getting the intent wrong means answering the right question in the wrong way.

The trouble is that queries are many and always new, and classifying them by hand does not scale. We need a method that learns from a handful of labelled examples and then copes on its own with queries it has never seen. The algorithm that does this with almost disarming elegance is Naive Bayes, and — as the name hints — it starts once again from the Bayes’ theorem that has accompanied us throughout this path.

What we will cover:

From the theorem to the classifier
Training on labelled queries
Classifying new queries
The limits of “naive”
Try it yourself
Further reading

From the theorem to the classifier

Let us pick up the thread. Bayes’ theorem tells us how to update the probability of a hypothesis in the light of the observed data. Here the hypothesis is “this query belongs to the informational class” (or transactional), and the data are the words that make up the query. We want, in other words, the probability of a class given the text: in symbols, P(class | words).

Computing it directly would be a nightmare, because the possible combinations of words are boundless. Bayes lets us flip the problem around: instead of asking how probable the class is given the words, we ask how probable those words are given the class — a question the training data can answer. The rule, first in words and then in a formula, is that the probability of a class given a query is proportional to the prior probability of the class times the probability of observing those words if the class were that one:

\( P(\text{class} \mid \text{words}) \propto P(\text{class}) \cdot \prod_i P(\text{word}_i \mid \text{class}) \\ \)

Let us unpack the symbols. P(class) is the prior: how frequent that class is to begin with, before looking at the text (if half of the example queries are informational, the prior is 0.5). P(word | class) is how often that word appears in the queries of that class. The symbol ∏ is simply the product: we multiply the contributions of all the words in the query. The sign ∝ (“proportional to”) reminds us that we are dropping a constant denominator, identical across classes: since in the end we only care which class wins, we can ignore it with no harm.

And here it is, the spot where “naive” hides. Multiplying the probabilities of the individual words as if they were independent of one another amounts to assuming that, given the class, the presence of one word says nothing about the presence of the others. It is a plainly false assumption about real language — “credit” and “card” certainly do not appear at random independently of each other — and it is precisely this naivety that gives the algorithm its name. The surprising thing, as we shall see, is that despite starting from so crude an assumption the method works beautifully in practice.

Training on labelled queries

Training a Naive Bayes means one thing only: counting. For each class we count how many times each word appears in the example queries, and from those counts we derive the P(word | class). No optimisation, no iterations: we scan the data once and fill a table of frequencies.

Let us start from a small set of hand-labelled queries, five informational and five transactional. I train the classifier in R like this:

train <- list(
  info  = c("how to bake a cake", "what is statistics", "seo guide for beginners",
            "free r tutorial", "how bayesian works"),
  trans = c("buy shoes online", "iphone price deal", "best cheap hosting",
            "purchase seo course", "gym membership discount")
)
tok <- function(s) unlist(strsplit(tolower(s), "\\s+"))
vocab <- unique(unlist(lapply(unlist(train), tok)))
counts <- lapply(train, function(docs) {
  w <- table(factor(unlist(lapply(docs, tok)), levels = vocab)); w + 1
})
tots <- sapply(counts, sum); V <- length(vocab)
prior <- sapply(train, length) / length(unlist(train))   # 0.5 / 0.5

Let us see what happens line by line. The tok function splits each query into lowercase words (a spartan tokenisation, but enough). vocab is the vocabulary, the list of all distinct words seen in training. For each class, table(factor(...)) counts the occurrences of each vocabulary word; tots is the total of the counts per class, and prior here is 0.5 and 0.5 because the two classes have the same number of examples.

There is a detail in that w + 1 worth pausing on, because it is the trick that holds the whole edifice up. If a word never appears in the queries of a class, its count is zero, and with it the entire product of probabilities would collapse to zero: a single unknown word would be enough to drive the class probability to zero, wiping out the contribution of all the others. It is the classic case where “multiplying by zero” ruins the party. The fix is called Laplace smoothing: we add 1 to the count of every word, in every class, before computing the proportions. No word any longer has exactly zero probability, only a very small one.

The price of smoothing is that the per-class totals gain the number of vocabulary words (the units added one per word): that is why tots is not the raw token counts of the two classes but the smoothed totals. With these numbers, for example, the word “buy” (present once among the transactional queries, never among the informational ones) ends up clearly more probable under trans than under info — and it is exactly this imbalance that pushes a query toward the right intent. A neutral word like “seo”, appearing once on each side, stays roughly balanced between the two classes and does not move the needle.

Classifying new queries

With the frequency table ready, classifying a new query is genuinely child’s play: we tokenise it, sum the logarithms of the probabilities of each word (summing logarithms instead of multiplying probabilities avoids the product of many tiny numbers going into numerical underflow, but the result is the same), add the logarithm of the prior, and pick the class with the highest score. I compute it in R:

classify <- function(query) {
  w <- tok(query)
  logp <- log(prior)
  for (cl in names(train)) {
    p <- counts[[cl]][w]; p[is.na(p)] <- 1            # out-of-vocab word: neutral weight
    logp[cl] <- logp[cl] + sum(log(as.numeric(p) / tots[cl]))
  }
  names(which.max(logp))
}
cat("'buy seo course discounted' ->", classify("buy seo course discounted"), "\n")
cat("'how to learn statistics'   ->", classify("how to learn statistics"), "\n")

The output is clear-cut:

'buy seo course discounted' -> trans
'how to learn statistics'   -> info

As we can see, the classifier assigns the first query to the transactional intent and the second to the informational one, exactly as a human would have. It is worth noting how it gets there. In the first query “buy” and “discounted” pull decisively toward trans, “seo” stays neutral, and not even “course” — which in training appeared in the transactional “purchase seo course” — rows against it: the verdict is solid. In the second, “statistics” is a markedly informational word, and “learn”, though out of vocabulary, does no harm thanks to that p[is.na(p)] <- 1 which assigns never-seen words a neutral weight, identical for both classes: having nothing to say, it simply does not vote.

A handful of example queries per side is very little, and yet the mechanism is already all here. In a real case it is enough to replace the handfuls of labelled queries with a few hundred or thousand queries pulled from Search Console and annotated for intent, and the same code — unchanged in structure — becomes an intent classifier you can actually use.

The limits of “naive”

Before diving in, though, it is worth knowing where the method shows its seams, because its simplicity has a flip side.

The most obvious limit is precisely the independence assumption it takes its name from. By treating each word as detached from the others, Naive Bayes ignores order and context entirely: to it “cheap running shoes” and “the economics of running shoes” are the same bag of words. In intent classification this matters little, but in subtler tasks it can mislead. Then there is the question of out-of-vocabulary words: a query made only of terms never seen in training would be decided by the prior alone, that is by a pure coin toss — the sign that the example dataset is too thin and needs widening.

A note of caution that holds for any classifier, and all the more for one trained on little data: the model learns exactly what we show it, biases included. If the transactional example queries all contain the word “buy”, the classifier will associate purchase intent with that term and will struggle on an equally transactional but lexically different “add to cart”. The quality and representativeness of the labelled data matter more than the sophistication of the algorithm: a Naive Bayes fed with varied, balanced examples beats a sophisticated model trained badly.

That said, Naive Bayes remains a genuinely valuable tool to have in the box: it is extremely fast to train, needs little data to get going, is interpretable (we can always go and look at which words drove the decision) and in text classification it holds its own against far more complex models. It is often the baseline to beat before bringing in heavier artillery — and this is where the door opens onto machine learning proper, where classification is done with trees, regressions and networks that drop the independence assumption in exchange for more power (and less transparency).

Try it yourself

The code above is a perfect playground for building intuition. Search queries are not only informational or transactional: at least a third family is missing, the navigational one (someone searching “facebook login” or “gironi blog” just wants to reach a specific site). Here are a few changes to try:

Add a nav class to train with four or five navigational queries (“facebook login”, “youtube”, “amazon sign in”, “gmail inbox”), then retrain: the prior will no longer be 0.5 but roughly a third per class. How do the classifications of the earlier queries change?
Feed the classifier an ambiguous query like “iphone review” (informational? transactional?) and see which way it leans. Did the verdict make sense, given the words in training?
Remove the smoothing (replace w + 1 with w) and try classifying a query with a rare word: what happens to the score when a count is zero? It is the quickest way to see with your own eyes why Laplace is needed.

The nice part is that the structure of the code never changes: adding a class just means lengthening the train list, and everything else — vocabulary, counts, prior, classification rule — adapts on its own.

With this we close the Bayesian thread we have unspooled article after article: from estimating a conversion rate to comparing variants, from the adaptive allocation of traffic to this last leap, from deciding to classifying. The same theorem, reworn each time in a different guise, has proved a surprisingly sturdy common thread. From here the road forks toward machine learning in the full sense — decision trees, logistic regression, neural networks — where the methods give up the most comfortable assumptions in exchange for power, and where Bayes stays in the background as the grammar by which, in the end, we always learn from data.

Keyword Clustering: grouping thousands of queries with K-means and hierarchical clustering

It happens with every reasonably serious project: you export the keyword list from Search Console…

4 hours ago

statistics

Expected vs Actual CTR: finding the pages that earn fewer clicks than their position deserves

Anyone who spends their days inside Search Console knows that little nagging feeling: a page…

1 day ago

statistics

Multi-armed bandit: optimising the variants while the test is still running

In the article on Bayesian A/B testing we compared two variants at a fixed sample…

3 days ago

statistics

Bayesian A/B Testing: not just “whether” B beats A, but “by how much”

In the article on classic A/B testing we saw how to compare two variants with…

4 days ago

statistics

Bayesian Conversion Rate Estimation: how much can we trust limited data

In the article on the foundations of Bayesian statistics, we saw how Bayesian updating works…

5 days ago

statistics

The peeking problem: why sneaking a look at an A/B test inflates false positives

On 21 January 2015 Optimizely — one of the most widely used A/B testing platforms…

1 week ago