In previous articles, we’ve examined statistical inference from a precise and coherent perspective: formulate a hypothesis, collect data, calculate a p-value, construct a confidence interval. We’ve conducted hypothesis tests, compared variants with A/B testing, and seen with the Central Limit Theorem why all of this works even when data isn’t normal.
This approach—called frequentist—has a clear logic: the parameter we want to estimate is a fixed value (even if unknown), and we “chase” it with data. But there’s another way to think about uncertainty, one that allows us to update our beliefs as new data arrives. It’s called the Bayesian approach, and in this article we’ll build its foundations.
Let’s start with a concrete example. Imagine we’ve just launched an advertising campaign and we don’t know the true click rate. We have an initial opinion based on experience (“click rates usually fall between 0% and 20%”), and then data starts coming in. The Bayesian approach lets us combine our initial opinion with the observed data to get an updated estimate—and repeat this process every time new information arrives.
Before diving into the mechanics, let’s clarify the conceptual difference between the two approaches. This isn’t a war: they’re two different ways of answering the same questions.
| Frequentist | Bayesian | |
|---|---|---|
| The parameter | Is a fixed (unknown) value | Is a random variable with a distribution |
| Probability | Relative frequency of an event over infinite repetitions | Degree of belief about an event |
| Uncertainty | Expressed through confidence intervals | Expressed through credible intervals |
| Prior data | Does not enter the model | Incorporated through the prior |
| Interpretation of CI | “If we repeated 100 times, 95 intervals would contain the parameter” | “There’s a 95% probability the parameter lies in this interval” |
The frequentist approach, the one we’ve used so far, treats the parameter as a fixed number and reasons about the distribution of the data. The Bayesian approach flips the perspective: it treats the data as fixed (we’ve observed them, they don’t change) and reasons about the distribution of the parameter—that is, how plausible we consider the various values the parameter might take.
The practical advantage of the Bayesian approach is that it can incorporate prior knowledge. If we know something about the parameter before collecting data (from experience, from previous studies, from common sense), we can use that knowledge. And then update it.
The heart of the Bayesian approach is a formula dating back to 1763, to Reverend Thomas Bayes. Let’s start from conditional probability: the probability of A given B.
We know from probability theory that:
\(and symmetrically:
\(From these two relations, by extracting \(P(A \cap B)\) from the second and substituting into the first, we obtain Bayes’ Theorem:
\(So far it’s algebra. The magic happens when we apply this formula to our problem: estimating a parameter \(\theta\) (for example, the click rate) from observed data. The theorem becomes:
\(Each piece of this formula has a specific name and role:
Since the evidence is constant (it doesn’t depend on \(\theta\)), we can write the fundamental relationship:
\(In words: the posterior is proportional to the likelihood multiplied by the prior. The more data we collect, the more the likelihood “dominates” and the posterior concentrates around the values supported by the data. But with little data, the prior matters—and it matters a lot.
Let’s move to practice. We’ve launched an advertising campaign: the ad was shown 100 times and received 13 clicks. What is the true click rate?
The prior: we know almost nothing, but from experience we believe click rates typically fall between 0% and 20%. We model this uncertainty with a uniform distribution on [0, 0.20].
The approach: we use a simulation. We generate many plausible values from the prior, simulate the data each value would produce, and keep only those compatible with what we actually observed (13 clicks out of 100). What remains is the posterior.
set.seed(42)
n_samples <- 100000
n_ads_shown <- 100
n_clicks_observed <- 13
# 1. Generiamo campioni dal prior: uniforme tra 0 e 0.20
proportion_clicks <- runif(n_samples, min = 0.0, max = 0.20)
# 2. Per ogni valore di proporzione, simuliamo quanti click otterremmo
n_visitors <- rbinom(n_samples, size = n_ads_shown, prob = proportion_clicks)
# 3. Costruiamo il data frame con prior e dati simulati
prior <- data.frame(proportion_clicks, n_visitors)
# 4. Conditioning: teniamo solo i campioni compatibili con 13 click
posterior <- prior[prior$n_visitors == n_clicks_observed, ]
cat("Campioni nel prior:", nrow(prior), "\n")
cat("Campioni nel posterior:", nrow(posterior), "\n")
cat("Media del posterior:", round(mean(posterior$proportion_clicks) * 100, 1), "%\n")
cat("Mediana del posterior:", round(median(posterior$proportion_clicks) * 100, 1), "%\n") Result: of the 100,000 initial samples, approximately 4,700 survive the conditioning step (the exact number varies due to the simulation). The mean and median of the posterior are about 13.4%: a value very close to the 13 clicks out of 100 we observed.
Let’s visualize the transformation from prior to posterior:
par(mfrow = c(1, 2))
# Prior
hist(prior$proportion_clicks, breaks = 30, probability = TRUE,
main = "Prior\n(uniforme 0-20%)",
col = "lightyellow", xlab = "Tasso di click",
ylab = "Densita'", xlim = c(0, 0.25))
# Posterior
hist(posterior$proportion_clicks, breaks = 30, probability = TRUE,
main = "Posterior\n(dopo 13/100 click)",
col = "lightblue", xlab = "Tasso di click",
ylab = "Densita'", xlim = c(0, 0.25)) The difference is striking. The prior is a flat distribution (uniform): all values between 0% and 20% are considered equally plausible. The posterior, instead, concentrates around 13%, with a bell-shaped form. The data have “informed” our uncertainty.
The 95% credible interval:
ci_95 <- quantile(posterior$proportion_clicks, probs = c(0.025, 0.975))
cat("Credible interval al 95%:", round(ci_95[1] * 100, 1), "% -",
round(ci_95[2] * 100, 1), "%\n") The 95% credible interval is approximately 7.7% – 19.1%. This means exactly what it sounds like: there’s a 95% probability that the true click rate falls between 7.7% and 19.1%.
Comparison with the frequentist CI:
prop_test <- prop.test(13, 100, correct = FALSE)
cat("IC frequentista al 95%:", round(prop_test$conf.int[1] * 100, 1), "% -",
round(prop_test$conf.int[2] * 100, 1), "%\n") The frequentist 95% CI is approximately 7.8% – 21.0%. The numbers are similar, but the interpretation is different: the frequentist CI tells us that “if we repeated the sampling 100 times, 95 intervals would contain the true parameter.” The Bayesian credible interval tells us directly the probability that the parameter lies in the interval. The latter is the interpretation that most people think they’re giving to the confidence interval—but which, in the frequentist approach, is technically incorrect.
This is where the Bayesian approach reveals its elegance. Suppose the campaign continues: after additional days, we have 150 new impressions and 20 new clicks. How do we update our estimate?
The principle is simple: the posterior we just calculated becomes the new prior. We don’t have to start from scratch; we pick up where we left off.
# Il posterior precedente diventa il nuovo prior
prior_aggiornato <- posterior
# Nuovi dati: 150 impression, 20 click
n_ads_nuovi <- 150
n_clicks_nuovi <- 20
# Simuliamo i dati con le proporzioni del prior aggiornato
n_samples_aggiornato <- nrow(prior_aggiornato)
prior_aggiornato$n_visitors <- rbinom(n_samples_aggiornato,
size = n_ads_nuovi,
prob = prior_aggiornato$proportion_clicks)
# Conditioning: teniamo solo i campioni compatibili con 20 click
posterior_aggiornato <- prior_aggiornato[prior_aggiornato$n_visitors == n_clicks_nuovi, ]
cat("Campioni nel posterior aggiornato:", nrow(posterior_aggiornato), "\n")
cat("Media:", round(mean(posterior_aggiornato$proportion_clicks) * 100, 1), "%\n")
# Nuovo credible interval
ci_aggiornato <- quantile(posterior_aggiornato$proportion_clicks, probs = c(0.025, 0.975))
cat("Credible interval al 95%:", round(ci_aggiornato[1] * 100, 1), "% -",
round(ci_aggiornato[2] * 100, 1), "%\n") In total we’ve now observed 33 clicks on 250 impressions (13.2%). The mean of the updated posterior is about 13.5%, and the 95% credible interval has narrowed to approximately 9.6% – 17.9% (compared to the previous 7.7% – 19.1%). The distribution has “tightened”: we have more data, so we’re more certain.
Let’s visualize the evolution:
par(mfrow = c(1, 3))
# Prior originale
hist(runif(10000, 0, 0.20), breaks = 30, probability = TRUE,
main = "1. Prior originale\n(uniforme 0-20%)",
col = "lightyellow", xlab = "Tasso di click",
ylab = "Densita'", xlim = c(0, 0.25))
# Posterior dopo primi dati (13/100)
hist(posterior$proportion_clicks, breaks = 30, probability = TRUE,
main = "2. Dopo 13/100 click",
col = "lightblue", xlab = "Tasso di click",
ylab = "Densita'", xlim = c(0, 0.25))
# Posterior dopo secondi dati (33/250 totali)
hist(posterior_aggiornato$proportion_clicks, breaks = 30, probability = TRUE,
main = "3. Dopo 33/250 click",
col = "lightgreen", xlab = "Tasso di click",
ylab = "Densita'", xlim = c(0, 0.25)) The visual message is immediate: the distribution shifts and tightens. From total uncertainty (anything between 0% and 20%), through a reasonable estimate (centered on 13%), we arrive at a more precise estimate around 13.2%. The more data we collect, the more the posterior concentrates around the true value.
This is Bayesian updating: an iterative process in which information accumulates. We don’t throw away anything we knew before; we integrate it with the new evidence.
In the previous examples we used a uniform prior: “all values between 0% and 20% are equally plausible.” This is called a non-informative prior (or weakly informative): it doesn’t express a strong preference for any particular value.
But in practice we often do know something. If we’ve already managed many advertising campaigns, we know that click rates typically fall between 5% and 15%, with a concentration around 10%. We can express this knowledge with an informative prior, for example a distribution centered on 0.10 with reduced spread.
Let’s compare the two approaches on the same data (13 clicks on 100 impressions):
set.seed(42)
n_samples <- 100000
n_ads_shown <- 100
n_clicks_observed <- 13
# --- Prior non informativo: uniforme (0, 0.20) ---
prior_flat <- runif(n_samples, min = 0.0, max = 0.20)
sim_flat <- rbinom(n_samples, size = n_ads_shown, prob = prior_flat)
posterior_flat <- prior_flat[sim_flat == n_clicks_observed]
# --- Prior informativo: centrato su 10%, concentrato tra 5% e 15% ---
# Usiamo una distribuzione beta(20, 180) che ha media ~10% e varianza ridotta
prior_info <- rbeta(n_samples, shape1 = 20, shape2 = 180)
sim_info <- rbinom(n_samples, size = n_ads_shown, prob = prior_info)
posterior_info <- prior_info[sim_info == n_clicks_observed]
# Confronto
cat("=== Prior non informativo (uniforme) ===\n")
cat("Media posterior:", round(mean(posterior_flat) * 100, 1), "%\n")
cat("Credible interval 95%:", round(quantile(posterior_flat, 0.025) * 100, 1), "% -",
round(quantile(posterior_flat, 0.975) * 100, 1), "%\n\n")
cat("=== Prior informativo (centrato su 10%) ===\n")
cat("Media posterior:", round(mean(posterior_info) * 100, 1), "%\n")
cat("Credible interval 95%:", round(quantile(posterior_info, 0.025) * 100, 1), "% -",
round(quantile(posterior_info, 0.975) * 100, 1), "%\n") The posterior with the informative prior is slightly “pulled” toward 10% (our past experience), while the one with the uniform prior follows the data more closely. With 13 clicks on 100, the difference is modest; but with 5 clicks on 20, it would be much more pronounced.
Let’s visualize:
par(mfrow = c(1, 2))
hist(posterior_flat, breaks = 30, probability = TRUE,
main = "Posterior con prior\nnon informativo",
col = "lightyellow", xlab = "Tasso di click",
ylab = "Densita'", xlim = c(0, 0.25))
hist(posterior_info, breaks = 30, probability = TRUE,
main = "Posterior con prior\ninformativo (10%)",
col = "lightcoral", xlab = "Tasso di click",
ylab = "Densita'", xlim = c(0, 0.25)) This is a fundamental property of Bayesian inference: with little data, the prior matters a lot; with a lot of data, the prior gets “overwhelmed” by the data. If we had 10,000 impressions and 1,300 clicks, the two posteriors would be practically identical, regardless of the chosen prior. In the long run, data always wins.
This is where the paths diverge clearly. In the article on confidence intervals we saw a fundamental point: the 95% confidence interval does not mean there’s a 95% probability that the parameter lies in the interval. It’s a property of the procedure, not of the individual interval.
The Bayesian 95% credible interval, on the other hand, means exactly what it sounds like: there’s a 95% probability that the parameter lies in that interval. It’s a direct statement about what we don’t know, not a statement about the procedure.
Let’s review the numbers from our example (13 clicks on 100 impressions):
| Frequentist (CI) | Bayesian (credible interval) | |
|---|---|---|
| Interval | ~7.8% – 21.0% | ~7.7% – 19.1% |
| Interpretation | “If we repeated the experiment 100 times, 95 intervals would contain the true parameter” | “There’s a 95% probability that the true parameter lies in this interval” |
| The parameter | Is a fixed value; the interval is random | Our belief about the parameter is described by a distribution |
| Depends on the prior? | No | Yes |
The numbers are similar—and this is no coincidence. With large samples and non-informative priors, the two approaches converge. But the interpretation is profoundly different, and the credible interval is much more intuitive: “there’s a 95% probability that the click rate is between 7.7% and 19.1%” is a statement anyone can understand and use to make decisions.
There’s no absolute winner. The choice depends on the context:
The Bayesian approach works particularly well when:
The frequentist approach remains preferable when:
In practice, many professionals use both approaches depending on the context. A/B testing, for example, can be conducted in a frequentist manner (as we saw in the dedicated article) or in a Bayesian manner—and some testing platforms use precisely the Bayesian approach to update results in real time.
We’ve seen how Bayesian updating works with simulation: generate samples, simulate data, filter. It’s a powerful and intuitive method, but it has a practical limitation: at each step we lose samples. After two updates, of the initial 100,000 samples only a few remain.
The good news is that, for the case of proportions (click rates, conversion rates, success percentages), there’s an elegant analytical solution. The Beta distribution, which we’ve already encountered, is the natural distribution for describing our uncertainty about a proportion. And when the prior is a Beta distribution and the data are binomial (success/failure), the posterior is still a Beta distribution—with updated parameters.
This means the entire Bayesian update reduces to a simple operation on the parameters, without the need for simulations. But that’s a story for the next article.
An e-commerce site has a historical conversion rate around 3%. After a product page redesign, out of 200 visits 10 conversions are observed (5%).
Hint: the code is nearly identical to the ad campaign example. Change the prior (runif(n, 0, 0.10)), the number of visits (200), and the number of observed conversions (10). For question 4, count how many posterior samples are above 0.03 and divide by the total.
If you want to explore Bayesian statistics with an accessible and surprisingly fun approach, Bayesian Statistics the Fun Way by Will Kurt is a read I recommend. Kurt manages to explain priors, posteriors, and Bayesian updating with concrete examples that don’t require a math degree—and he uses R for the computational side, exactly as we do here. It’s the ideal book for anyone who wants to understand Bayesian logic before tackling the formal theory.
In previous articles, we examined how hypothesis testing works and how the t-distribution allows us…
Throughout this journey, we've examined tools to describe data, test hypotheses, and build models. But…
Throughout the previous articles, we've had the chance to examine the normal distribution and its…
Statistical tests are fundamental tools for data analysis and informed decision-making. Choosing the appropriate test…
Decision Trees are a type of machine learning algorithm that uses a tree structure to…
Imagine wanting to find the fastest route to reach a destination by car. You could…