Throughout this journey, we’ve examined tools to describe data, test hypotheses, and build models. But there’s a question that comes before all others—one that’s too often ignored: are these data reliable?
In any dataset—daily sessions, organic clicks, conversion rates—values that don’t behave like the others can hide. Values that deviate abnormally from the rest of the distribution. In statistics, we call them outliers, or anomalous values.
Let’s make one point clear immediately: an anomalous value isn’t necessarily an error. It can be a measurement error, certainly (a broken tracking tag, a bot inflating sessions). But it can also be the most important signal in the entire dataset: a Google algorithm update, content going viral, a technical issue crushing traffic. The issue isn’t eliminating anomalies—it’s recognizing them and then deciding what to do about them.
In this article, we’ll examine three statistical methods for identifying outliers, from the most intuitive to the most formal. For each, we’ll look at the logic, the limitations, and practical application with R.
What We’ll Cover
The Working Dataset
To make things concrete, let’s build a simulated but realistic dataset: daily website sessions over the course of a year. The data roughly follows a normal distribution with mean 250 and standard deviation 50, but with five anomalies intentionally injected—three sharp drops and two spikes.
Let’s generate the data in R:
set.seed(42)
n <- 365
sessions <- round(rnorm(n, mean = 250, sd = 50))
sessions[sessions < 0] <- 0
# Inject 5 realistic anomalies
sessions[45] <- 38 # day 45: technical issue
sessions[120] <- 580 # day 120: viral article
sessions[200] <- 22 # day 200: Google update
sessions[300] <- 510 # day 300: social media mention
sessions[350] <- 15 # day 350: server down
Let’s visualize the trend with a simple time plot:
plot(1:n, sessions, type = "l", col = "steelblue",
xlab = "Day", ylab = "Sessions",
main = "Daily Sessions - One Year of Traffic")
abline(h = mean(sessions), col = "red", lty = 2)
By eye, we can spot some peaks and drops. But where do we draw the line between natural variation and anomaly? We need objective criteria.
Method 1: The Z-Score
We encountered the z-score when discussing the normal distribution. The z-score tells us how many standard deviations a value is from the mean:
\(z = \frac{x – \mu}{\sigma} \\
\)
where \(x\) is the observed value, \(\mu\) is the mean, and \(\sigma\) is the standard deviation. A value with a z-score of 2 lies two standard deviations from the mean; one with a z-score of -3 lies three standard deviations below the mean.
Recall the empirical rule: in a normal distribution, approximately 99.7% of data falls within three standard deviations of the mean. A value with |z| > 3 is therefore extremely rare—less than 0.3% probability under the normality assumption.
Let’s calculate z-scores for our dataset and identify anomalies:
z <- (sessions - mean(sessions)) / sd(sessions)
# Conservative threshold: |z| > 3
anomalies_z3 <- which(abs(z) > 3)
cat("Anomalous days (|z| > 3):", anomalies_z3, "\n")
cat("Sessions:", sessions[anomalies_z3], "\n")
cat("Z-scores:", round(z[anomalies_z3], 2), "\n")
The result:
Anomalous days (|z| > 3): 45 120 200 300 350
Sessions: 38 580 22 510 15
Z-scores: -3.75 5.92 -4.03 4.67 -4.16
With the |z| > 3 threshold, the z-score identifies exactly the five anomalies we injected. No false positives, no false negatives—a nearly perfect result.
But be careful: if we lower the threshold to |z| > 2, anomalies jump to 14. Many of those values are simply data in the tail of the distribution, not real anomalies. The choice of threshold isn’t a technical detail—it’s an analytical decision that depends on how willing we are to tolerate false alarms.
There’s an important limitation to this method. The z-score assumes data follows (at least approximately) a normal distribution. If the distribution is heavily skewed—and web traffic data often is, with long right tails—the mean and standard deviation can be distorted by the very outliers we’re trying to find. It’s a vicious cycle: anomalies influence the statistics we use to detect them.
Method 2: IQR and Tukey’s Method
Measures of position—quartiles and median—offer an approach that doesn’t require assumptions about the distribution’s shape. Tukey’s method, named after the great statistician John Tukey, uses the interquartile range (IQR) as its measuring stick.
The IQR, as we saw when discussing measures of variability, is the difference between the third quartile (\(Q_3\), the 75th percentile) and the first quartile (\(Q_1\), the 25th percentile). It represents the spread of the central 50% of data—the “solid” part of the distribution, immune to the tails.
Tukey’s rule is simple: a value is considered anomalous if it falls outside the so-called fences:
\(\text{anomalous if } x < Q_1 – 1.5 \cdot IQR \quad \text{or} \quad x > Q_3 + 1.5 \cdot IQR \\
\)
Why 1.5? Tukey didn’t choose this value randomly. For a normal distribution, fences at 1.5 IQR correspond approximately to 2.7 standard deviations from the mean—a reasonably conservative threshold that captures about 0.7% of observations in the tails. Strict enough to avoid too many false positives, sensitive enough not to miss important anomalies.
Let’s apply the method to our dataset:
Q1 <- quantile(sessions, 0.25)
Q3 <- quantile(sessions, 0.75)
IQR_val <- Q3 - Q1
lower_limit <- Q1 - 1.5 * IQR_val
upper_limit <- Q3 + 1.5 * IQR_val
cat("Q1:", Q1, " Q3:", Q3, " IQR:", IQR_val, "\n")
cat("Lower limit:", lower_limit, "\n")
cat("Upper limit:", upper_limit, "\n")
anomalies_iqr <- which(sessions < lower_limit | sessions > upper_limit)
cat("Anomalous days:", anomalies_iqr, "\n")
cat("Sessions:", sessions[anomalies_iqr], "\n")
The result:
Q1: 215 Q3: 282 IQR: 67
Lower limit: 114.5
Upper limit: 382.5
Anomalous days: 45 59 118 120 200 300 350
Sessions: 38 100 385 580 22 510 15
Tukey’s method finds 7 anomalies: our 5 injected ones plus two borderline values (day 59 with 100 sessions and day 118 with 385). Are these truly anomalous? 100 sessions is indeed low for a site with mean 250, and 385 is high relative to the quartiles. The decision, once again, is up to the analyst.
R offers an elegant way to visualize anomalies with Tukey’s method—the boxplot:
boxplot(sessions, main = "Daily Sessions",
ylab = "Sessions", col = "lightblue", outline = TRUE)
# Points beyond the whiskers are anomalies according to Tukey
The major advantage of this method over the z-score is robustness: median and quartiles aren’t influenced by outliers. We don’t need to assume data is normal. Tukey’s method works even with skewed distributions—and for those working with web data, this is no small feature.
The limitation: the method doesn’t distinguish between “large” and “enormous” anomalies. A value just outside the fence and one completely off the charts receive the same treatment—they’re both “anomalous,” period.
Method 3: Grubbs’ Test
The first two methods rely on empirical rules: z-score thresholds, IQR thresholds. But if we want a formal approach—with a proper hypothesis test—we can turn to Grubbs’ test.
The idea is this: we take the most extreme value in the dataset (the one furthest from the mean) and ask whether it’s compatible with the rest of the data, or whether it’s “too” extreme to be due to chance.
The hypotheses are:
- \(H_0\): there are no outliers in the dataset
- \(H_1\): the most extreme value is an outlier
The test statistic is:
\(G = \frac{\max |x_i – \bar{x}|}{s} \\
\)
where \(\bar{x}\) is the mean and \(s\) is the standard deviation. In other words, \(G\) is the maximum absolute z-score. The critical value is derived from the t-Student distribution with \(n-2\) degrees of freedom.
Let’s apply the test in R using the outliers package:
library(outliers)
result <- grubbs.test(sessions)
print(result)
The result:
Grubbs test for one outlier
data: sessions
G = 5.9228, U = 0.9037, p-value = 2.339e-07
alternative hypothesis: highest value 580 is an outlier
The test identifies 580 (the day 120 spike, our “viral article”) as an outlier, with a virtually null p-value. The evidence is overwhelming: that value isn’t compatible with the rest of the distribution.
But we must keep in mind a fundamental limitation of Grubbs’ test: it tests only one outlier at a time—the most extreme one. If we suspect multiple anomalies (as in our case), we need to apply the test iteratively: remove the identified outlier, recalculate, test again.
Let’s do it:
data <- sessions
outliers_found <- c()
for(i in 1:5) {
g <- grubbs.test(data)
if(g$p.value < 0.05) {
# Extract outlier value from result
outlier_val <- as.numeric(gsub("[^0-9.]", "",
regmatches(g$alternative,
regexpr("[0-9.]+", g$alternative))))
outliers_found <- c(outliers_found, outlier_val)
data <- data[data != outlier_val]
cat("Iteration", i, "- Outlier:", outlier_val,
"- p-value:", format(g$p.value, digits = 3), "\n")
} else {
cat("Iteration", i, "- No outlier (p =",
round(g$p.value, 3), ")\n")
break
}
}
This iterative approach is effective but treacherous: every time we remove a value, we change the distribution. The mean and standard deviation shift, and what wasn’t anomalous before might become so. It’s a procedure to use with caution and awareness.
Comparing the Three Methods
We’ve applied three methods to the same dataset. Let’s see what each found:
| Day | Sessions | Simulated Event | Z-score (|z|>3) | IQR/Tukey | Grubbs |
|---|---|---|---|---|---|
| 45 | 38 | Technical issue | Yes | Yes | Yes (iter.) |
| 120 | 580 | Viral article | Yes | Yes | Yes (1st iter.) |
| 200 | 22 | Google update | Yes | Yes | Yes (iter.) |
| 300 | 510 | Social mention | Yes | Yes | Yes (iter.) |
| 350 | 15 | Server down | Yes | Yes | Yes (iter.) |
| 59 | 100 | (none) | No | Yes | No |
| 118 | 385 | (none) | No | Yes | No |
All three methods find the five injected anomalies. Tukey’s method is the most sensitive: it also flags two borderline values that the other methods let pass. The z-score with threshold 3 is precise but depends on the normality assumption. Grubbs is the most formal but requires the iterative approach for multiple anomalies.
The important lesson: there’s no universally right method. There’s the right method for those data and that question. In daily practice, a sensible approach is to apply more than one method and focus on values that are flagged consistently.
Let’s summarize the three methods in R:
# Create a summary for each day
summary_df <- data.frame(
day = 1:n,
sessions = sessions,
z_score = round(z, 2),
anomaly_z = abs(z) > 3,
anomaly_iqr = sessions < lower_limit | sessions > upper_limit
)
# Show only rows anomalous by at least one method
anomalous <- summary_df[summary_df$anomaly_z | summary_df$anomaly_iqr, ]
print(anomalous)
Try It Yourself
An e-commerce site has tracked CTR on its product pages for 30 days. Here’s the data:
ctr <- c(3.2, 2.8, 3.1, 2.9, 3.0, 3.3, 2.7, 3.1, 2.8, 3.0,
0.4, 3.2, 2.9, 3.1, 2.8, 3.0, 2.9, 7.8, 3.1, 2.7,
3.0, 3.2, 2.8, 3.1, 2.9, 3.0, 2.8, 3.1, 3.0, 2.9)
Days 11 and 18 look suspicious. Apply all three methods: z-score with threshold |z| > 3, Tukey’s method, and Grubbs’ test. Do all three agree? Which of the two values is more clearly anomalous, and why?
So far, we’ve treated each observation as independent from the others. We’ve asked: “is this value compatible with the overall distribution?” But web traffic data has a temporal structure: trends, seasonality, weekly cycles. A 30% drop in December might be perfectly normal for a B2B site, while the same drop in September would be alarming.
Distinguishing a real anomaly from simple seasonality requires different tools—time series decomposition into trend, seasonal component, and residual. That will be the topic of a future article.
Further Reading
For those who want to deepen their understanding of outliers and statistical reasoning about unexpected data, The Art of Statistics by David Spiegelhalter is a read that tackles the problem with clarity and numerous real-world examples..
For a more formal treatment of outlier tests (Grubbs, Rosner, Dixon), the textbook Statistica by Newbold, Carlson, and Thorne offers comprehensive coverage with exercises..