statistics

The Poisson Distribution

The Poisson distribution is a discrete probability distribution that describes the number of events occurring in a fixed interval of time or area.

The Poisson distribution is useful for measuring how many events can occur within a given time horizon, such as the number of customers entering a shop in the next hour, or the number of pageviews on a website in the next minute, and so on.

Siméon-Denis Poisson


Lambda: The Average Rate of Events

An important element: each time interval is assumed to be independent of all others.

We need to know the average number of events or the rate at which they occur within the time interval. We represent this value with the Greek letter lambda:

\( X \sim Po(\lambda) \\ \\ \)

To calculate the probability that there are r occurrences in a specific interval:

\( P (X=r) = \frac{e^{-\lambda} \lambda^{r}}{r!} \\ \\ \)

For example, if:

\( X \sim Po(2) \\ \\ r=3 \)

we get:

\( P (X=3) = \frac{e^{-2} \cdot 2^{3}}{3!} =\frac{e^{-2} \cdot 8}{6} = e^{2} \cdot 1.333 = 0.180 \\ \\ \)

That is, 18%.


Poisson and Binomial: A Side Note

If

\( X \sim Po(\lambda x) \\ Y \sim Po(\lambda y) \\ \\ \)

then

\( X + Y \sim Po(\lambda x + \lambda y) \\ \\ \)

If

\( X \sim Bin(n,p) \\ \\ \)

and n is large and p is small, then we can approximate the binomial with the Poisson:

\( X \sim Po(n \cdot p) \\ \\ \)

Differences Between the Poisson and Binomial Distributions

The Poisson and binomial distributions are both discrete probability distributions used to model rare events. The main difference between the two concerns the number of trials and successes.

The binomial distribution is biparametric, meaning it is characterised by two parameters n and p, where n represents the number of trials and p the probability of success in each trial.

In contrast, the Poisson distribution is uniparametric, meaning it is characterised by a single parameter λ representing the average number of events per interval.

Furthermore, the binomial distribution is used when the number of trials is finite and the number of successes cannot exceed n, whereas the Poisson distribution is used when the number of trials is essentially infinite.


A Practical Example

A vending machine malfunctions on average 3.4 times per week. What is the probability that the machine will not break down next week?

\( P (X=0) = \frac{e^{- \lambda} \cdot \lambda ^{r}}{r!} \\ \\ = \frac{e^{-3.4} \cdot 3.4 ^{0}}{0!} = \\ \frac{e^{-3.4} \cdot 1}{1} = 0.033 \\ \)

We notice that the probability is very low indeed — just 3.3%.

Note: X=0 because we are looking at the probability that the machine does not break down.

a battered vending machine…

In R we would use the command:

dpois(0, 3.4)

Now let us calculate the probability that the vending machine breaks down exactly 3 times during the next week.

\( P (X=3) = \frac{e^{-3.4} \cdot 3.4 ^{3}}{3!} = \frac{e^{-3.4} \cdot 39.304}{6} = 0.216 \\ \)

The probability is 21.6%.

Moving on to a third question: what are the expected value and the variance of the vending machine malfunctions?

\( E(X) = \lambda = 3.4 \\ Var(X) = \lambda = 3.4 \\ \)

As we can see, within the Poisson distribution lambda represents not only the mean but also the variance. This is known as the mean-variance equality property of the Poisson distribution.

Therefore, if lambda is large, the Poisson distribution will be more concentrated around its mean and its variance will also be large; if lambda is small, the distribution will be less concentrated around its mean and its variance will also be small.


The Poisson Distribution Applied to SEO

There are several aspects that make the Poisson distribution potentially interesting for website traffic analysis. It is a simple and well-understood statistical model that can be readily applied to website traffic data — for example, to estimate the average rate of requests or visits per unit of time and to predict the probability of observing a certain number of requests or visits in the future.

However, we should keep in mind that there are also many limitations to using the Poisson distribution for SEO-oriented web traffic analysis.

First, the Poisson distribution assumes that events occur independently and at a constant rate, which may not always hold for website traffic. For example, website traffic might exhibit peak frequencies or internal symmetry that the Poisson distribution cannot capture.

Second, the Poisson distribution is a memoryless process, meaning it does not account for any history of past events. This can be a limitation when analysing website traffic data that display trends or seasonality.

Third, the Poisson distribution assumes that events are discrete and countable, which may not always be appropriate for modelling continuous variables such as response time or page load time. Finally, the Poisson distribution is a simple model that may not capture all the complexities of real-world website traffic.

There are several alternative models for website traffic analysis that can be used when the Poisson distribution is not appropriate.


Alternative Models for Web Traffic Analysis

One alternative is the Negative Binomial distribution, which can handle overdispersion and capture peak frequencies or internal symmetry in website traffic data.

Another alternative is the Lognormal distribution, which can be used to model continuous variables such as response time or page load time.

The Exponential distribution can also be used to model the time intervals between requests or visits to a website.


Using Poisson for Website Click Estimates

Suppose we have a website that receives on average 10 clicks per hour and we want to estimate the probability of getting a certain number of clicks in one hour using the Poisson distribution. We can use R to carry out the following steps:

  1. We start by loading the ggplot2 library and defining the average number of clicks per hour (our lambda):
library(ggplot2)

# Average number of clicks per hour
lam 

2. We now compute the probability mass function (PMF) of the Poisson distribution for each possible number of clicks using the dpois() function. For example, to calculate the probability of getting exactly 15 clicks:

clicks 

The output is:

The probability of getting 15 clicks per hour is 0.0347180696306841

3. We compute the PMF of the Poisson distribution for a range of possible click counts using dpois() and display the results in a chart. For example, to calculate the probability of getting from 0 to 30 clicks:

x 

4. We now plot the probability for each possible number of clicks:

ggplot(data.frame(x=x, pmf=pmf), aes(x, pmf)) +
  geom_bar(stat="identity") +
  xlab("Number of clicks") +
  ylab("Probability") +
  ggtitle(paste("PMF of the Poisson distribution with lambda =", lam))

5. We compute the CDF of the Poisson distribution and plot it:

# Compute the CDF of the Poisson distribution
cdf 

6. We calculate the number of clicks corresponding to a 90% probability:

q 

The output is:

For a 90% probability, the number of clicks must be up to 14

For convenience, here is the equivalent Python script:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson

# Define the average number of clicks per hour
lam = 10

clicks = 15
prob = poisson.pmf(clicks, lam)
print(f"The probability of getting {clicks} clicks per hour is {prob}")

x = np.arange(0, 31)
pmf = poisson.pmf(x, lam)

plt.bar(x, pmf)
plt.xlabel('Number of clicks')
plt.ylabel('Probability')
plt.title(f'PMF of the Poisson distribution with lambda = {lam}')
plt.show()

cdf = poisson.cdf(x, lam)

plt.step(x, cdf)
plt.xlabel('Number of clicks')
plt.ylabel('Cumulative probability')
plt.title(f'CDF of the Poisson distribution with lambda = {lam}')
plt.show()

q = poisson.ppf(0.9, lam)
print(f"For a 90% probability, the number of clicks must be up to {q}")

You might also like


Further Reading

For an accessible yet thorough introduction to probability distributions—including the Poisson—Finalmente ho capito la statistica by Maurizio De Pra covers these topics in a clear and approachable style, ideal for building solid intuition before moving on to more advanced topics.

autore-articoli

Recent Posts

Understanding the Basics of Machine Learning: A Beginner’s Guide

Introduction Machine Learning is changing the way we see the world around us. From weather…

16 hours ago

The Gini Index: What It Is, Why It Matters, and How to Compute It in R

The Gini coefficient is a measure of the degree of inequality in a distribution, and…

16 hours ago

Contingency Tables and Conditional Probability

Contingency tables are used to evaluate the interaction between two categorical variables (qualitative). They are…

16 hours ago

The Geometric Distribution

After looking at the most famous discrete distribution, the Binomial, as well as the Poisson…

16 hours ago

A Brief (Personal) Manifesto for SEO

The need I feel—the fruit of many years working in this field—is to affirm the…

16 hours ago

Descriptive Statistics: Measures of Variability (or Dispersion)

Measures of variability are used to describe the degree of dispersion of observations around a…

16 hours ago