Anomaly Detection: 3 Statistical Methods to Find Outliers

Throughout this journey, we’ve examined tools to describe data, test hypotheses, and build models. But there’s a question that comes before all others—one that’s too often ignored: are these data reliable?

In any dataset—daily sessions, organic clicks, conversion rates—values that don’t behave like the others can hide. Values that deviate abnormally from the rest of the distribution. In statistics, we call them outliers, or anomalous values.

Let’s make one point clear immediately: an anomalous value isn’t necessarily an error. It can be a measurement error, certainly (a broken tracking tag, a bot inflating sessions). But it can also be the most important signal in the entire dataset: a Google algorithm update, content going viral, a technical issue crushing traffic. The issue isn’t eliminating anomalies—it’s recognizing them and then deciding what to do about them.

In this article, we’ll examine three statistical methods for identifying outliers, from the most intuitive to the most formal. For each, we’ll look at the logic, the limitations, and practical application with R.

What We’ll Cover

The Working Dataset: Simulated Sessions with Injected Anomalies
Method 1: The Z-Score
Method 2: IQR and Tukey’s Method
Method 3: Grubbs’ Test
Comparing the Three Methods
Try It Yourself
Further Reading

The Working Dataset

To make things concrete, let’s build a simulated but realistic dataset: daily website sessions over the course of a year. The data roughly follows a normal distribution with mean 250 and standard deviation 50, but with five anomalies intentionally injected—three sharp drops and two spikes.

Let’s generate the data in R:

set.seed(42)
n <- 365
sessions <- round(rnorm(n, mean = 250, sd = 50))
sessions[sessions < 0] <- 0

# Inject 5 realistic anomalies
sessions[45]  <- 38   # day 45: technical issue
sessions[120] <- 580  # day 120: viral article
sessions[200] <- 22   # day 200: Google update
sessions[300] <- 510  # day 300: social media mention
sessions[350] <- 15   # day 350: server down

Let’s visualize the trend with a simple time plot:

plot(1:n, sessions, type = "l", col = "steelblue",
     xlab = "Day", ylab = "Sessions",
     main = "Daily Sessions - One Year of Traffic")
abline(h = mean(sessions), col = "red", lty = 2)

By eye, we can spot some peaks and drops. But where do we draw the line between natural variation and anomaly? We need objective criteria.

Method 1: The Z-Score

We encountered the z-score when discussing the normal distribution. The z-score tells us how many standard deviations a value is from the mean:

\(
z = \frac{x – \mu}{\sigma} \\
\)

where \(x\) is the observed value, \(\mu\) is the mean, and \(\sigma\) is the standard deviation. A value with a z-score of 2 lies two standard deviations from the mean; one with a z-score of -3 lies three standard deviations below the mean.

Recall the empirical rule: in a normal distribution, approximately 99.7% of data falls within three standard deviations of the mean. A value with |z| > 3 is therefore extremely rare—less than 0.3% probability under the normality assumption.

Let’s calculate z-scores for our dataset and identify anomalies:

z <- (sessions - mean(sessions)) / sd(sessions)

# Conservative threshold: |z| > 3
anomalies_z3 <- which(abs(z) > 3)
cat("Anomalous days (|z| > 3):", anomalies_z3, "\n")
cat("Sessions:", sessions[anomalies_z3], "\n")
cat("Z-scores:", round(z[anomalies_z3], 2), "\n")

The result:

Anomalous days (|z| > 3): 45 120 200 300 350
Sessions: 38 580 22 510 15
Z-scores: -3.75 5.92 -4.03 4.67 -4.16

With the |z| > 3 threshold, the z-score identifies exactly the five anomalies we injected. No false positives, no false negatives—a nearly perfect result.

But be careful: if we lower the threshold to |z| > 2, anomalies jump to 14. Many of those values are simply data in the tail of the distribution, not real anomalies. The choice of threshold isn’t a technical detail—it’s an analytical decision that depends on how willing we are to tolerate false alarms.

There’s an important limitation to this method. The z-score assumes data follows (at least approximately) a normal distribution. If the distribution is heavily skewed—and web traffic data often is, with long right tails—the mean and standard deviation can be distorted by the very outliers we’re trying to find. It’s a vicious cycle: anomalies influence the statistics we use to detect them.

Method 2: IQR and Tukey’s Method

Measures of position—quartiles and median—offer an approach that doesn’t require assumptions about the distribution’s shape. Tukey’s method, named after the great statistician John Tukey, uses the interquartile range (IQR) as its measuring stick.

The IQR, as we saw when discussing measures of variability, is the difference between the third quartile (\(Q_3\), the 75th percentile) and the first quartile (\(Q_1\), the 25th percentile). It represents the spread of the central 50% of data—the “solid” part of the distribution, immune to the tails.

Tukey’s rule is simple: a value is considered anomalous if it falls outside the so-called fences:

\(
\text{anomalous if } x < Q_1 – 1.5 \cdot IQR \quad \text{or} \quad x > Q_3 + 1.5 \cdot IQR \\
\)

Why 1.5? Tukey didn’t choose this value randomly. For a normal distribution, fences at 1.5 IQR correspond approximately to 2.7 standard deviations from the mean—a reasonably conservative threshold that captures about 0.7% of observations in the tails. Strict enough to avoid too many false positives, sensitive enough not to miss important anomalies.

Let’s apply the method to our dataset:

Q1 <- quantile(sessions, 0.25)
Q3 <- quantile(sessions, 0.75)
IQR_val <- Q3 - Q1

lower_limit <- Q1 - 1.5 * IQR_val
upper_limit <- Q3 + 1.5 * IQR_val

cat("Q1:", Q1, " Q3:", Q3, " IQR:", IQR_val, "\n")
cat("Lower limit:", lower_limit, "\n")
cat("Upper limit:", upper_limit, "\n")

anomalies_iqr <- which(sessions < lower_limit | sessions > upper_limit)
cat("Anomalous days:", anomalies_iqr, "\n")
cat("Sessions:", sessions[anomalies_iqr], "\n")

The result:

Q1: 215  Q3: 282  IQR: 67
Lower limit: 114.5
Upper limit: 382.5
Anomalous days: 45 59 118 120 200 300 350
Sessions: 38 100 385 580 22 510 15

Tukey’s method finds 7 anomalies: our 5 injected ones plus two borderline values (day 59 with 100 sessions and day 118 with 385). Are these truly anomalous? 100 sessions is indeed low for a site with mean 250, and 385 is high relative to the quartiles. The decision, once again, is up to the analyst.

R offers an elegant way to visualize anomalies with Tukey’s method—the boxplot:

boxplot(sessions, main = "Daily Sessions",
        ylab = "Sessions", col = "lightblue", outline = TRUE)
# Points beyond the whiskers are anomalies according to Tukey

The major advantage of this method over the z-score is robustness: median and quartiles aren’t influenced by outliers. We don’t need to assume data is normal. Tukey’s method works even with skewed distributions—and for those working with web data, this is no small feature.

The limitation: the method doesn’t distinguish between “large” and “enormous” anomalies. A value just outside the fence and one completely off the charts receive the same treatment—they’re both “anomalous,” period.

Method 3: Grubbs’ Test

The first two methods rely on empirical rules: z-score thresholds, IQR thresholds. But if we want a formal approach—with a proper hypothesis test—we can turn to Grubbs’ test.

The idea is this: we take the most extreme value in the dataset (the one furthest from the mean) and ask whether it’s compatible with the rest of the data, or whether it’s “too” extreme to be due to chance.

The hypotheses are:

\(H_0\): there are no outliers in the dataset
\(H_1\): the most extreme value is an outlier

The test statistic is:

\(
G = \frac{\max |x_i – \bar{x}|}{s} \\
\)

where \(\bar{x}\) is the mean and \(s\) is the standard deviation. In other words, \(G\) is the maximum absolute z-score. The critical value is derived from the t-Student distribution with \(n-2\) degrees of freedom.

Let’s apply the test in R using the outliers package:

library(outliers)

result <- grubbs.test(sessions)
print(result)

The result:

Grubbs test for one outlier
data:  sessions
G = 5.9228, U = 0.9037, p-value = 2.339e-07
alternative hypothesis: highest value 580 is an outlier

The test identifies 580 (the day 120 spike, our “viral article”) as an outlier, with a virtually null p-value. The evidence is overwhelming: that value isn’t compatible with the rest of the distribution.

But we must keep in mind a fundamental limitation of Grubbs’ test: it tests only one outlier at a time—the most extreme one. If we suspect multiple anomalies (as in our case), we need to apply the test iteratively: remove the identified outlier, recalculate, test again.

Let’s do it:

data <- sessions
outliers_found <- c()

for(i in 1:5) {
  g <- grubbs.test(data)
  if(g$p.value < 0.05) {
    # Extract outlier value from result
    outlier_val <- as.numeric(gsub("[^0-9.]", "",
                     regmatches(g$alternative,
                     regexpr("[0-9.]+", g$alternative))))
    outliers_found <- c(outliers_found, outlier_val)
    data <- data[data != outlier_val]
    cat("Iteration", i, "- Outlier:", outlier_val,
        "- p-value:", format(g$p.value, digits = 3), "\n")
  } else {
    cat("Iteration", i, "- No outlier (p =",
        round(g$p.value, 3), ")\n")
    break
  }
}

This iterative approach is effective but treacherous: every time we remove a value, we change the distribution. The mean and standard deviation shift, and what wasn’t anomalous before might become so. It’s a procedure to use with caution and awareness.

Comparing the Three Methods

We’ve applied three methods to the same dataset. Let’s see what each found:

Day	Sessions	Simulated Event	Z-score (\|z\|>3)	IQR/Tukey	Grubbs
45	38	Technical issue	Yes	Yes	Yes (iter.)
120	580	Viral article	Yes	Yes	Yes (1st iter.)
200	22	Google update	Yes	Yes	Yes (iter.)
300	510	Social mention	Yes	Yes	Yes (iter.)
350	15	Server down	Yes	Yes	Yes (iter.)
59	100	(none)	No	Yes	No
118	385	(none)	No	Yes	No

All three methods find the five injected anomalies. Tukey’s method is the most sensitive: it also flags two borderline values that the other methods let pass. The z-score with threshold 3 is precise but depends on the normality assumption. Grubbs is the most formal but requires the iterative approach for multiple anomalies.

The important lesson: there’s no universally right method. There’s the right method for those data and that question. In daily practice, a sensible approach is to apply more than one method and focus on values that are flagged consistently.

Let’s summarize the three methods in R:

# Create a summary for each day
summary_df <- data.frame(
  day = 1:n,
  sessions = sessions,
  z_score = round(z, 2),
  anomaly_z = abs(z) > 3,
  anomaly_iqr = sessions < lower_limit | sessions > upper_limit
)

# Show only rows anomalous by at least one method
anomalous <- summary_df[summary_df$anomaly_z | summary_df$anomaly_iqr, ]
print(anomalous)

Try It Yourself

An e-commerce site has tracked CTR on its product pages for 30 days. Here’s the data:

ctr <- c(3.2, 2.8, 3.1, 2.9, 3.0, 3.3, 2.7, 3.1, 2.8, 3.0,
         0.4, 3.2, 2.9, 3.1, 2.8, 3.0, 2.9, 7.8, 3.1, 2.7,
         3.0, 3.2, 2.8, 3.1, 2.9, 3.0, 2.8, 3.1, 3.0, 2.9)

Days 11 and 18 look suspicious. Apply all three methods: z-score with threshold |z| > 3, Tukey’s method, and Grubbs’ test. Do all three agree? Which of the two values is more clearly anomalous, and why?

So far, we’ve treated each observation as independent from the others. We’ve asked: “is this value compatible with the overall distribution?” But web traffic data has a temporal structure: trends, seasonality, weekly cycles. A 30% drop in December might be perfectly normal for a B2B site, while the same drop in September would be alarming.

Distinguishing a real anomaly from simple seasonality requires different tools—time series decomposition into trend, seasonal component, and residual. That will be the topic of a future article.

Anomaly Detection: How to Identify Outliers in Your Data

What We’ll Cover

The Working Dataset

Method 1: The Z-Score

Method 2: IQR and Tukey’s Method

Method 3: Grubbs’ Test

Comparing the Three Methods

Try It Yourself

Further Reading

Leave a Reply