statistics

Bayesian Statistics: How to Learn from Data, One Step at a Time

In previous articles, we’ve examined statistical inference from a precise and coherent perspective: formulate a hypothesis, collect data, calculate a p-value, construct a confidence interval. We’ve conducted hypothesis tests, compared variants with A/B testing, and seen with the Central Limit Theorem why all of this works even when data isn’t normal.

This approach—called frequentist—has a clear logic: the parameter we want to estimate is a fixed value (even if unknown), and we “chase” it with data. But there’s another way to think about uncertainty, one that allows us to update our beliefs as new data arrives. It’s called the Bayesian approach, and in this article we’ll build its foundations.

Let’s start with a concrete example. Imagine we’ve just launched an advertising campaign and we don’t know the true click rate. We have an initial opinion based on experience (“click rates usually fall between 0% and 20%”), and then data starts coming in. The Bayesian approach lets us combine our initial opinion with the observed data to get an updated estimate—and repeat this process every time new information arrives.

What We’ll Cover

Two Ways of Thinking About Uncertainty: Frequentist and Bayesian
Bayes’ Theorem
Numerical Example: Click Rate of an Ad Campaign
Sequential Updating: Today’s Posterior Is Tomorrow’s Prior
Informative and Non-Informative Priors
Credible Interval vs Confidence Interval
When to Use the Bayesian Approach
Toward the Beta Distribution
Try It Yourself
Further Reading

Two Ways of Thinking About Uncertainty: Frequentist and Bayesian

Before diving into the mechanics, let’s clarify the conceptual difference between the two approaches. This isn’t a war: they’re two different ways of answering the same questions.

	Frequentist	Bayesian
The parameter	Is a fixed (unknown) value	Is a random variable with a distribution
Probability	Relative frequency of an event over infinite repetitions	Degree of belief about an event
Uncertainty	Expressed through confidence intervals	Expressed through credible intervals
Prior data	Does not enter the model	Incorporated through the prior
Interpretation of CI	“If we repeated 100 times, 95 intervals would contain the parameter”	“There’s a 95% probability the parameter lies in this interval”

The frequentist approach, the one we’ve used so far, treats the parameter as a fixed number and reasons about the distribution of the data. The Bayesian approach flips the perspective: it treats the data as fixed (we’ve observed them, they don’t change) and reasons about the distribution of the parameter—that is, how plausible we consider the various values the parameter might take.

The practical advantage of the Bayesian approach is that it can incorporate prior knowledge. If we know something about the parameter before collecting data (from experience, from previous studies, from common sense), we can use that knowledge. And then update it.

Bayes’ Theorem

The heart of the Bayesian approach is a formula dating back to 1763, to Reverend Thomas Bayes. Let’s start from conditional probability: the probability of A given B.

We know from probability theory that:

\(
P(A|B) = \frac{P(A \cap B)}{P(B)} \\
\)

and symmetrically:

\(
P(B|A) = \frac{P(A \cap B)}{P(A)} \\
\)

From these two relations, by extracting \(P(A \cap B)\) from the second and substituting into the first, we obtain Bayes’ Theorem:

\(
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \\
\)

So far it’s algebra. The magic happens when we apply this formula to our problem: estimating a parameter \(\theta\) (for example, the click rate) from observed data. The theorem becomes:

\(
P(\theta | \text{data}) = \frac{P(\text{data} | \theta) \cdot P(\theta)}{P(\text{data})} \\
\)

Each piece of this formula has a specific name and role:

Prior \(P(\theta)\): what we believe about the parameter before seeing the data. It’s our prior knowledge, our starting point.
Likelihood \(P(\text{data} | \theta)\): how compatible the observed data are with each possible value of \(\theta\). This is the same likelihood function that also appears in the frequentist approach.
Posterior \(P(\theta | \text{data})\): what we believe about the parameter after seeing the data. It’s the final result, our updated knowledge.
Evidence \(P(\text{data})\): the marginal probability of the data. In practice, it’s a normalizing constant that ensures the posterior is a valid probability distribution.

Since the evidence is constant (it doesn’t depend on \(\theta\)), we can write the fundamental relationship:

\(
P(\theta | \text{data}) \propto P(\text{data} | \theta) \cdot P(\theta) \\
\)

In words: the posterior is proportional to the likelihood multiplied by the prior. The more data we collect, the more the likelihood “dominates” and the posterior concentrates around the values supported by the data. But with little data, the prior matters—and it matters a lot.

Numerical Example: Click Rate of an Ad Campaign

Let’s move to practice. We’ve launched an advertising campaign: the ad was shown 100 times and received 13 clicks. What is the true click rate?

The prior: we know almost nothing, but from experience we believe click rates typically fall between 0% and 20%. We model this uncertainty with a uniform distribution on [0, 0.20].

The approach: we use a simulation. We generate many plausible values from the prior, simulate the data each value would produce, and keep only those compatible with what we actually observed (13 clicks out of 100). What remains is the posterior.

set.seed(42)

n_samples <- 100000
n_ads_shown <- 100
n_clicks_observed <- 13

# 1. Generiamo campioni dal prior: uniforme tra 0 e 0.20
proportion_clicks <- runif(n_samples, min = 0.0, max = 0.20)

# 2. Per ogni valore di proporzione, simuliamo quanti click otterremmo
n_visitors <- rbinom(n_samples, size = n_ads_shown, prob = proportion_clicks)

# 3. Costruiamo il data frame con prior e dati simulati
prior <- data.frame(proportion_clicks, n_visitors)

# 4. Conditioning: teniamo solo i campioni compatibili con 13 click
posterior <- prior[prior$n_visitors == n_clicks_observed, ]

cat("Campioni nel prior:", nrow(prior), "\n")
cat("Campioni nel posterior:", nrow(posterior), "\n")
cat("Media del posterior:", round(mean(posterior$proportion_clicks) * 100, 1), "%\n")
cat("Mediana del posterior:", round(median(posterior$proportion_clicks) * 100, 1), "%\n")

Result: of the 100,000 initial samples, approximately 4,700 survive the conditioning step (the exact number varies due to the simulation). The mean and median of the posterior are about 13.4%: a value very close to the 13 clicks out of 100 we observed.

Let’s visualize the transformation from prior to posterior:

par(mfrow = c(1, 2))

# Prior
hist(prior$proportion_clicks, breaks = 30, probability = TRUE,
     main = "Prior\n(uniforme 0-20%)",
     col = "lightyellow", xlab = "Tasso di click",
     ylab = "Densita'", xlim = c(0, 0.25))

# Posterior
hist(posterior$proportion_clicks, breaks = 30, probability = TRUE,
     main = "Posterior\n(dopo 13/100 click)",
     col = "lightblue", xlab = "Tasso di click",
     ylab = "Densita'", xlim = c(0, 0.25))

The difference is striking. The prior is a flat distribution (uniform): all values between 0% and 20% are considered equally plausible. The posterior, instead, concentrates around 13%, with a bell-shaped form. The data have “informed” our uncertainty.

The 95% credible interval:

ci_95 <- quantile(posterior$proportion_clicks, probs = c(0.025, 0.975))
cat("Credible interval al 95%:", round(ci_95[1] * 100, 1), "% -",
    round(ci_95[2] * 100, 1), "%\n")

The 95% credible interval is approximately 7.7% – 19.1%. This means exactly what it sounds like: there’s a 95% probability that the true click rate falls between 7.7% and 19.1%.

Comparison with the frequentist CI:

prop_test <- prop.test(13, 100, correct = FALSE)
cat("IC frequentista al 95%:", round(prop_test$conf.int[1] * 100, 1), "% -",
    round(prop_test$conf.int[2] * 100, 1), "%\n")

The frequentist 95% CI is approximately 7.8% – 21.0%. The numbers are similar, but the interpretation is different: the frequentist CI tells us that “if we repeated the sampling 100 times, 95 intervals would contain the true parameter.” The Bayesian credible interval tells us directly the probability that the parameter lies in the interval. The latter is the interpretation that most people think they’re giving to the confidence interval—but which, in the frequentist approach, is technically incorrect.

Sequential Updating: Today’s Posterior Is Tomorrow’s Prior

This is where the Bayesian approach reveals its elegance. Suppose the campaign continues: after additional days, we have 150 new impressions and 20 new clicks. How do we update our estimate?

The principle is simple: the posterior we just calculated becomes the new prior. We don’t have to start from scratch; we pick up where we left off.

# Il posterior precedente diventa il nuovo prior
prior_aggiornato <- posterior

# Nuovi dati: 150 impression, 20 click
n_ads_nuovi <- 150
n_clicks_nuovi <- 20

# Simuliamo i dati con le proporzioni del prior aggiornato
n_samples_aggiornato <- nrow(prior_aggiornato)
prior_aggiornato$n_visitors <- rbinom(n_samples_aggiornato,
                                       size = n_ads_nuovi,
                                       prob = prior_aggiornato$proportion_clicks)

# Conditioning: teniamo solo i campioni compatibili con 20 click
posterior_aggiornato <- prior_aggiornato[prior_aggiornato$n_visitors == n_clicks_nuovi, ]

cat("Campioni nel posterior aggiornato:", nrow(posterior_aggiornato), "\n")
cat("Media:", round(mean(posterior_aggiornato$proportion_clicks) * 100, 1), "%\n")

# Nuovo credible interval
ci_aggiornato <- quantile(posterior_aggiornato$proportion_clicks, probs = c(0.025, 0.975))
cat("Credible interval al 95%:", round(ci_aggiornato[1] * 100, 1), "% -",
    round(ci_aggiornato[2] * 100, 1), "%\n")

In total we’ve now observed 33 clicks on 250 impressions (13.2%). The mean of the updated posterior is about 13.5%, and the 95% credible interval has narrowed to approximately 9.6% – 17.9% (compared to the previous 7.7% – 19.1%). The distribution has “tightened”: we have more data, so we’re more certain.

Let’s visualize the evolution:

par(mfrow = c(1, 3))

# Prior originale
hist(runif(10000, 0, 0.20), breaks = 30, probability = TRUE,
     main = "1. Prior originale\n(uniforme 0-20%)",
     col = "lightyellow", xlab = "Tasso di click",
     ylab = "Densita'", xlim = c(0, 0.25))

# Posterior dopo primi dati (13/100)
hist(posterior$proportion_clicks, breaks = 30, probability = TRUE,
     main = "2. Dopo 13/100 click",
     col = "lightblue", xlab = "Tasso di click",
     ylab = "Densita'", xlim = c(0, 0.25))

# Posterior dopo secondi dati (33/250 totali)
hist(posterior_aggiornato$proportion_clicks, breaks = 30, probability = TRUE,
     main = "3. Dopo 33/250 click",
     col = "lightgreen", xlab = "Tasso di click",
     ylab = "Densita'", xlim = c(0, 0.25))

The visual message is immediate: the distribution shifts and tightens. From total uncertainty (anything between 0% and 20%), through a reasonable estimate (centered on 13%), we arrive at a more precise estimate around 13.2%. The more data we collect, the more the posterior concentrates around the true value.

This is Bayesian updating: an iterative process in which information accumulates. We don’t throw away anything we knew before; we integrate it with the new evidence.

Informative and Non-Informative Priors

In the previous examples we used a uniform prior: “all values between 0% and 20% are equally plausible.” This is called a non-informative prior (or weakly informative): it doesn’t express a strong preference for any particular value.

But in practice we often do know something. If we’ve already managed many advertising campaigns, we know that click rates typically fall between 5% and 15%, with a concentration around 10%. We can express this knowledge with an informative prior, for example a distribution centered on 0.10 with reduced spread.

Let’s compare the two approaches on the same data (13 clicks on 100 impressions):

set.seed(42)

n_samples <- 100000
n_ads_shown <- 100
n_clicks_observed <- 13

# --- Prior non informativo: uniforme (0, 0.20) ---
prior_flat <- runif(n_samples, min = 0.0, max = 0.20)
sim_flat <- rbinom(n_samples, size = n_ads_shown, prob = prior_flat)
posterior_flat <- prior_flat[sim_flat == n_clicks_observed]

# --- Prior informativo: centrato su 10%, concentrato tra 5% e 15% ---
# Usiamo una distribuzione beta(20, 180) che ha media ~10% e varianza ridotta
prior_info <- rbeta(n_samples, shape1 = 20, shape2 = 180)
sim_info <- rbinom(n_samples, size = n_ads_shown, prob = prior_info)
posterior_info <- prior_info[sim_info == n_clicks_observed]

# Confronto
cat("=== Prior non informativo (uniforme) ===\n")
cat("Media posterior:", round(mean(posterior_flat) * 100, 1), "%\n")
cat("Credible interval 95%:", round(quantile(posterior_flat, 0.025) * 100, 1), "% -",
    round(quantile(posterior_flat, 0.975) * 100, 1), "%\n\n")

cat("=== Prior informativo (centrato su 10%) ===\n")
cat("Media posterior:", round(mean(posterior_info) * 100, 1), "%\n")
cat("Credible interval 95%:", round(quantile(posterior_info, 0.025) * 100, 1), "% -",
    round(quantile(posterior_info, 0.975) * 100, 1), "%\n")

The posterior with the informative prior is slightly “pulled” toward 10% (our past experience), while the one with the uniform prior follows the data more closely. With 13 clicks on 100, the difference is modest; but with 5 clicks on 20, it would be much more pronounced.

Let’s visualize:

par(mfrow = c(1, 2))

hist(posterior_flat, breaks = 30, probability = TRUE,
     main = "Posterior con prior\nnon informativo",
     col = "lightyellow", xlab = "Tasso di click",
     ylab = "Densita'", xlim = c(0, 0.25))

hist(posterior_info, breaks = 30, probability = TRUE,
     main = "Posterior con prior\ninformativo (10%)",
     col = "lightcoral", xlab = "Tasso di click",
     ylab = "Densita'", xlim = c(0, 0.25))

This is a fundamental property of Bayesian inference: with little data, the prior matters a lot; with a lot of data, the prior gets “overwhelmed” by the data. If we had 10,000 impressions and 1,300 clicks, the two posteriors would be practically identical, regardless of the chosen prior. In the long run, data always wins.

Credible Interval vs Confidence Interval

This is where the paths diverge clearly. In the article on confidence intervals we saw a fundamental point: the 95% confidence interval does not mean there’s a 95% probability that the parameter lies in the interval. It’s a property of the procedure, not of the individual interval.

The Bayesian 95% credible interval, on the other hand, means exactly what it sounds like: there’s a 95% probability that the parameter lies in that interval. It’s a direct statement about what we don’t know, not a statement about the procedure.

Let’s review the numbers from our example (13 clicks on 100 impressions):

	Frequentist (CI)	Bayesian (credible interval)
Interval	~7.8% – 21.0%	~7.7% – 19.1%
Interpretation	“If we repeated the experiment 100 times, 95 intervals would contain the true parameter”	“There’s a 95% probability that the true parameter lies in this interval”
The parameter	Is a fixed value; the interval is random	Our belief about the parameter is described by a distribution
Depends on the prior?	No	Yes

The numbers are similar—and this is no coincidence. With large samples and non-informative priors, the two approaches converge. But the interpretation is profoundly different, and the credible interval is much more intuitive: “there’s a 95% probability that the click rate is between 7.7% and 19.1%” is a statement anyone can understand and use to make decisions.

When to Use the Bayesian Approach

There’s no absolute winner. The choice depends on the context:

The Bayesian approach works particularly well when:

We have little data but reasonable prior knowledge. The prior lets us obtain sensible estimates even with small samples.
We need continuous updating: data arrives over time and we want to update our estimate progressively, without starting from scratch each time.
We want to communicate uncertainty directly. The credible interval is much more intuitive than the confidence interval: saying “there’s a 90% probability that the conversion rate is between 3% and 7%” is clear even for people without statistical training.

The frequentist approach remains preferable when:

We need standardized, reproducible results. Frequentist tests don’t depend on subjective choices (like the prior), making them easier to compare and replicate.
We work in contexts where conventions are established (scientific publications, regulatory reports).
We have large samples: with a lot of data, the two approaches give virtually identical results, and the frequentist approach is often simpler to implement.

In practice, many professionals use both approaches depending on the context. A/B testing, for example, can be conducted in a frequentist manner (as we saw in the dedicated article) or in a Bayesian manner—and some testing platforms use precisely the Bayesian approach to update results in real time.

Toward the Beta Distribution

We’ve seen how Bayesian updating works with simulation: generate samples, simulate data, filter. It’s a powerful and intuitive method, but it has a practical limitation: at each step we lose samples. After two updates, of the initial 100,000 samples only a few remain.

The good news is that, for the case of proportions (click rates, conversion rates, success percentages), there’s an elegant analytical solution. The Beta distribution, which we’ve already encountered, is the natural distribution for describing our uncertainty about a proportion. And when the prior is a Beta distribution and the data are binomial (success/failure), the posterior is still a Beta distribution—with updated parameters.

This means the entire Bayesian update reduces to a simple operation on the parameters, without the need for simulations. But that’s a story for the next article.

Try It Yourself

An e-commerce site has a historical conversion rate around 3%. After a product page redesign, out of 200 visits 10 conversions are observed (5%).

Build a Bayesian model with a uniform prior between 0% and 10% (initial uncertainty: the rate could be any value in that range).
Use simulation (as in the article’s example) to obtain the posterior.
Calculate the 95% credible interval.
Calculate the probability that the true conversion rate is above 3%.

Hint: the code is nearly identical to the ad campaign example. Change the prior (runif(n, 0, 0.10)), the number of visits (200), and the number of observed conversions (10). For question 4, count how many posterior samples are above 0.03 and divide by the total.