<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>statistics &#8211; paologironi blog</title>
	<atom:link href="https://www.gironi.it/blog/en/category/statistics/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.gironi.it/blog</link>
	<description>Scattered notes on (retro) computing, data analysis, statistics, SEO, and things that change</description>
	<lastBuildDate>Fri, 13 Mar 2026 08:07:30 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>The Monte Carlo Method Explained Simply with Real-World Applications</title>
		<link>https://www.gironi.it/blog/en/the-monte-carlo-method-explained-simply-with-real-world-applications/</link>
					<comments>https://www.gironi.it/blog/en/the-monte-carlo-method-explained-simply-with-real-world-applications/#respond</comments>
		
		<dc:creator><![CDATA[autore-articoli]]></dc:creator>
		<pubDate>Wed, 11 Mar 2026 14:49:05 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/?p=3512</guid>

					<description><![CDATA[What is the Monte Carlo method The story of the Monte Carlo method begins in the most unlikely way: with a mathematician in bed playing cards. In 1946, Stanisław Ulam, a Polish mathematician recovering from surgery, found himself playing solitaire to pass the time. Being a mathematician, he wondered: what are the chances of winning &#8230; <a href="https://www.gironi.it/blog/en/the-monte-carlo-method-explained-simply-with-real-world-applications/" class="more-link">Continue reading<span class="screen-reader-text"> "The Monte Carlo Method Explained Simply with Real-World Applications"</span></a>]]></description>
										<content:encoded><![CDATA[
<p><!--
  The Monte Carlo Method - Enriched blog content
  gironi.it/blog/en/the-monte-carlo-method/ (EN version)

  Instructions:
  1. Publish EN post via publish_montecarlo_en.py
  2. Add iframe to simulator manually in Gutenberg (Custom HTML block)
  3. Upload test_en.html via FTP to /utility/montecarlo-simulator-en/index.html
--></p>



<p><!-- ============================================================ --><br><!-- SECTION 1: WHAT IS THE MONTE CARLO METHOD (~500 words) --><br><!-- ============================================================ --></p>



<h2 class="wp-block-heading">What is the Monte Carlo method</h2>



<p>The story of the Monte Carlo method begins in the most unlikely way: with a mathematician in bed playing cards. In 1946, <strong>Stanisław Ulam</strong>, a Polish mathematician recovering from surgery, found himself playing solitaire to pass the time. Being a mathematician, he wondered: what are the chances of winning a game?</p>



<p>The problem was theoretically solvable: just enumerate every possible combination of cards and count the favorable ones. In practice, however, the number of combinations was so enormous that an exact calculation was completely impractical. Ulam then had an insight as simple as it was powerful: <strong>instead of computing the exact probability, why not simulate hundreds of games and count how many times you win?</strong></p>



<span id="more-3512"></span>



<p>The idea is disarmingly simple. If we play 1,000 games and win 230 of them, we can estimate the probability of winning at about 23%. The more games we simulate, the closer our estimate gets to the true value. This is, in essence, the <strong>Monte Carlo method</strong>: using random simulation to solve problems that would be too complex to tackle analytically.</p>



<p>Ulam shared the idea with his colleague <strong>John von Neumann</strong>, arguably the most brilliant mathematician of the 20th century, who immediately saw its potential. Von Neumann realized that <strong>ENIAC</strong> — one of the very first electronic computers, which filled an entire room — could run thousands of simulations in reasonable time. Together, they developed the method for a problem far more serious than solitaire: the <strong>diffusion of neutrons</strong> in atomic weapons, as part of the Manhattan Project at Los Alamos.</p>



<p>The name “Monte Carlo” was chosen as a code name, a reference to the famous <strong>Monte Carlo Casino</strong> in Monaco. Legend has it that the inspiration came from Ulam’s uncle, a notorious gambler. After all, the heart of the method is chance itself: generating random numbers to explore spaces of possibility too vast to traverse systematically.</p>



<p>From those early nuclear experiments of the 1940s, the Monte Carlo method has spread to every field of science and engineering. Today it is one of the most widely used computational tools in the world, from particle physics to finance, from cinematic rendering to drug discovery. Let’s see how it works.</p>



<p><!-- ============================================================ --><br><!-- SECTION 2: FUNDAMENTAL CONCEPTS (~300 words) --><br><!-- ============================================================ --></p>



<h2 class="wp-block-heading">Fundamental concepts</h2>



<p>The Monte Carlo method rests on a statistical principle we’ve encountered before: the <strong>law of large numbers</strong>. In simple terms, this law tells us that the average of a random sample approaches the population average as the sample grows. Translated into Monte Carlo language: <strong>the more simulations we run, the more accurate our result will be</strong>.</p>



<p>To run a Monte Carlo simulation, we need <strong>random numbers</strong>. In practice, computers don’t generate truly random numbers: they use deterministic algorithms that produce sequences of <strong>pseudo-random numbers</strong> with statistical properties indistinguishable from real randomness. In R, for example, the <code>runif()</code> function generates uniformly distributed numbers between 0 and 1.</p>



<p>A crucial aspect is the <strong>rate of convergence</strong>. The Monte Carlo estimation error decreases as <strong>1/√n</strong>, where n is the number of simulations. This means that to halve the error, we need to quadruple our simulations; to gain one more decimal digit of precision, we need 100 times more iterations. It’s not particularly efficient, but the beauty of the method lies in the fact that <strong>it works regardless of the problem’s complexity</strong>: whether the problem has 2 or 2,000 variables, the convergence rate remains the same.</p>



<p>In practice, we must always balance <strong>desired precision</strong> with <strong>available computational resources</strong>. Increasing the number of simulations comes at a cost in computation time. Fortunately, modern computers make this trade-off much more favorable than in the days of ENIAC.</p>



<p><!-- ============================================================ --><br><!-- SECTION 3: THE METHOD IN ACTION (~400 words) --><br><!-- ============================================================ --></p>



<h2 class="wp-block-heading">The Monte Carlo method in action</h2>



<p>Let’s see concretely how the Monte Carlo method is applied. The procedure follows four fundamental steps:</p>



<p><strong>1. Define the model.</strong> First, we identify the problem’s variables and the probability distributions that govern them. For instance, if we want to simulate an investment’s return, the model will include the expected return (mean) and volatility (standard deviation), typically assuming normally distributed returns.</p>



<p><strong>2. Generate random scenarios.</strong> Using a pseudo-random number generator, we produce thousands of possible scenarios. Each scenario represents an “alternative history”: one way things could play out.</p>



<p><strong>3. Compute the result for each scenario.</strong> For each scenario, we apply the model and obtain a result. If we’re simulating an investment, the result is the final portfolio value.</p>



<p><strong>4. Aggregate the results.</strong> Finally, we analyze the set of results: we compute the mean, the median, the percentiles. This gives us not just an estimate of the expected outcome, but an entire <strong>distribution of possibilities</strong>. And this is where Monte Carlo truly shines: it tells us not only “how much we’re likely to earn” but also “how much we could lose in the worst case.”</p>



<p>Let’s use a quick example to illustrate convergence. Imagine flipping a coin and trying to estimate the probability of heads. After 10 flips, we might get 7 heads (70%), an estimate far from the true 50%. After 100 flips, we’ll be closer, perhaps 53%. After 10,000 flips, our estimate will be very close to 50%. This is Monte Carlo in action: replacing a theoretical calculation with an experiment repeated thousands of times.</p>



<p>The power of the method lies in its <strong>flexibility</strong>. While analytical methods require closed-form solutions (which often don’t exist for complex problems), Monte Carlo only requires the ability to simulate the process. If we can write a program that generates one scenario, Monte Carlo gives us the distribution of outcomes.</p>



<p><!-- ============================================================ --><br><!-- SECTION 4: PRACTICAL EXAMPLES (~600 words) --><br><!-- ============================================================ --></p>



<h2 class="wp-block-heading">Practical examples: estimating π and portfolio returns</h2>



<h3 class="wp-block-heading">Example 1: estimating the value of π</h3>



<p>The most classic and pedagogically effective example of the Monte Carlo method is <strong>estimating the number π</strong>. The idea is elegant: consider a square of side 2 with a circle of radius 1 inscribed inside it. The area of the square is 4, the area of the circle is π. If we generate random points inside the square, the proportion falling inside the circle will be approximately π/4.</p>



<p>We compute this in R with 100,000 points:</p>



<pre class="wp-block-code"><code>set.seed(123)
n &lt;- 100000
x &lt;- runif(n, -1, 1)
y &lt;- runif(n, -1, 1)
inside &lt;- (x^2 + y^2) &lt;= 1
pi_estimate &lt;- 4 * sum(inside) / n
pi_estimate
# &#91;1] 3.13956</code></pre>



<p>The same in Python:</p>



<pre class="wp-block-code"><code>import random
random.seed(123)
n = 100000
inside = sum(1 for _ in range(n)
             if random.uniform(-1, 1)**2 + random.uniform(-1, 1)**2 &lt;= 1)
pi_estimate = 4 * inside / n
print(pi_estimate)
# 3.14268</code></pre>



<p>With 100,000 points we already get a reasonable estimate, though not extremely precise: we’re accurate to about two decimal places. As we mentioned, gaining another digit of precision would require roughly 100 times more points. The computer does all the heavy lifting.</p>



<h3 class="wp-block-heading">Example 2: estimating portfolio returns</h3>



<p>Let’s move to an example closer to real-world applications. Suppose we have a portfolio of three stocks with the following characteristics:</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Stock</th><th>Expected Return</th><th>Standard Deviation</th><th>Portfolio Weight</th></tr></thead><tbody><tr><td>A</td><td>8%</td><td>12%</td><td>40%</td></tr><tr><td>B</td><td>10%</td><td>15%</td><td>30%</td></tr><tr><td>C</td><td>12%</td><td>18%</td><td>30%</td></tr></tbody></table></figure>



<p>We want to estimate the probability that the portfolio return exceeds 10%. We simulate in R with 10,000 scenarios:</p>



<pre class="wp-block-code"><code>set.seed(42)
sim_A &lt;- rnorm(10000, mean = 0.08, sd = 0.12)
sim_B &lt;- rnorm(10000, mean = 0.10, sd = 0.15)
sim_C &lt;- rnorm(10000, mean = 0.12, sd = 0.18)
sim_portfolio &lt;- 0.4 * sim_A + 0.3 * sim_B + 0.3 * sim_C
prob_result &lt;- mean(sim_portfolio &gt;= 0.10)
prob_result
# &#91;1] 0.4504</code></pre>



<p>The same in Python:</p>



<pre class="wp-block-code"><code>import random
random.seed(42)
n = 10000
count = 0
for _ in range(n):
    a = random.gauss(0.08, 0.12)
    b = random.gauss(0.10, 0.15)
    c = random.gauss(0.12, 0.18)
    ptf = 0.4 * a + 0.3 * b + 0.3 * c
    if ptf &gt;= 0.10:
        count += 1
print(count / n)
# 0.4479</code></pre>



<p>The result tells us there’s roughly a 45% chance of exceeding 10% return. Notice how Monte Carlo gives us not a single number, but an entire distribution: we could easily compute the median return, the worst-case 5th percentile, the probability of loss, and so on.</p>



<p><!-- ============================================================ --><br><!-- SECTION 5: INTERACTIVE SIMULATOR (~200 words) --><br><!-- ============================================================ --></p>



<h2 class="wp-block-heading">Monte Carlo Simulator</h2>



<p>To make the concept even more tangible, we’ve built an <strong>interactive simulator</strong> that applies the Monte Carlo method to predict the future value of an investment. The underlying model is the <strong>Geometric Brownian Motion</strong> (GBM), the same model used in the famous Black-Scholes framework for options pricing.</p>



<p>Intuitively, an asset’s future price is computed as the current price multiplied by a random growth factor. The formula is:</p>



<p class="has-text-align-center"><strong>S(t+1) = S(t) × exp((μ − σ²/2) + σ × Z)</strong></p>



<p>where <strong>μ</strong> is the expected annual return (the “average growth”), <strong>σ</strong> is the volatility (how much the price fluctuates — our measure of uncertainty), and <strong>Z</strong> is a random number drawn from a normal distribution. Each simulation generates a different path: some scenarios see the portfolio grow substantially, others see it decline. The histogram shows the distribution of all possible outcomes.</p>



<iframe src="https://www.gironi.it/utility/montecarlo-simulator-en/" width="100%" height="600" style="border:none;border-radius:12px;" loading="lazy" title="Monte Carlo Simulator"></iframe>



<p><!-- ============================================================ --><br><!-- SECTION 6: MODERN APPLICATIONS (~400 words) --><br><!-- ============================================================ --></p>



<h2 class="wp-block-heading">Modern applications of the Monte Carlo method</h2>



<p>From the nuclear physics of the 1940s, the Monte Carlo method has spread to domains that Ulam and von Neumann could never have imagined. Let’s look at some of the most fascinating applications.</p>



<p><strong>3D rendering and cinema.</strong> Every time we watch a Pixar film or a blockbuster with visual effects, we’re admiring Monte Carlo at work. The technique is called <strong>path tracing</strong>: to compute the color of each pixel, the software simulates millions of light rays bouncing between surfaces in the scene. Each ray follows a random path, and the average of thousands of paths produces the photorealistic image we see on screen.</p>



<p><strong>Finance and risk management.</strong> In the financial world, Monte Carlo is ubiquitous. Banks use it to calculate <strong>Value at Risk</strong> (VaR) — the maximum probable loss of a portfolio over a given time horizon. It’s the same principle as our simulator, applied to portfolios with hundreds of assets and complex correlations. Pricing exotic options that lack closed-form solutions also relies on Monte Carlo simulations.</p>



<p><strong>Drug discovery.</strong> In pharmaceutical research, Monte Carlo is used to simulate <strong>molecular docking</strong>: how a candidate molecule binds to a target protein. By simulating millions of possible spatial configurations, researchers identify the most promising compounds before synthesizing them in the lab, saving years of experimentation.</p>



<p><strong>Climate models.</strong> Models predicting climate change are inherently uncertain: they depend on emission scenarios, atmospheric feedback, ocean dynamics. Monte Carlo allows exploration of thousands of parameter combinations and generates the <strong>uncertainty bands</strong> we see in IPCC reports. Not a single prediction, but a distribution of possible futures.</p>



<p><strong>Artificial intelligence.</strong> In machine learning, a technique called <strong>Monte Carlo dropout</strong> uses simulation to estimate the uncertainty of a neural network’s predictions. And the famous <strong>AlphaGo</strong> by DeepMind, which in 2016 defeated the world Go champion, used <strong>Monte Carlo Tree Search</strong> (MCTS) to explore possible moves in a game with more configurations than atoms in the universe.</p>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Field</th><th>Example</th><th>What is simulated</th></tr></thead><tbody><tr><td>Cinema/3D</td><td>Path tracing (Pixar)</td><td>Light ray paths</td></tr><tr><td>Finance</td><td>Value at Risk</td><td>Market scenarios</td></tr><tr><td>Pharmaceuticals</td><td>Molecular docking</td><td>Spatial configurations</td></tr><tr><td>Climate</td><td>IPCC models</td><td>Parameter combinations</td></tr><tr><td>AI</td><td>AlphaGo (MCTS)</td><td>Possible moves</td></tr></tbody></table></figure>



<p><!-- ============================================================ --><br><!-- SECTION 7: ADVANTAGES AND LIMITATIONS (~300 words) --><br><!-- ============================================================ --></p>



<h2 class="wp-block-heading">Advantages and limitations of the Monte Carlo method</h2>



<p>Like any statistical tool, the Monte Carlo method has its strengths and limitations. Let’s examine them honestly.</p>



<p><strong>Flexibility.</strong> The greatest advantage is versatility: Monte Carlo applies to complex problems of any size and in any field, from finance to engineering, physics to biology. It doesn’t require closed-form solutions, only the ability to simulate the process.</p>



<p><strong>Accuracy.</strong> With a sufficient number of simulations, the estimate can be made arbitrarily precise. The more we run the method, the closer the result converges to the true value.</p>



<p><strong>Scalability.</strong> Unlike grid-based methods, which suffer from the “curse of dimensionality” (cost explodes with the number of variables), Monte Carlo maintains the same convergence rate regardless of the number of dimensions. This makes it the only practical tool for high-dimensional problems.</p>



<p>However, the method also presents <strong>significant limitations</strong>:</p>



<p><strong>Slow convergence.</strong> The 1/√n rate means that gaining one digit of precision requires 100 times more simulations. For problems demanding very high precision, this can be prohibitive.</p>



<p><strong>Computational cost.</strong> For complex problems (many variables, heavy models), each individual simulation may require significant time. Multiplied by thousands or millions of iterations, the cost becomes considerable.</p>



<p>To mitigate these limitations, <strong>variance reduction techniques</strong> have been developed over the years, enabling more precise results with fewer simulations:</p>



<ul class="wp-block-list">
<li><strong>Importance sampling</strong>: sampling from an alternative distribution that “concentrates” simulations in the most informative regions.</li>



<li><strong>Control variates</strong>: using a correlated variable with known expected value to reduce the estimate’s variance.</li>



<li><strong>Stratified sampling</strong>: dividing the space into homogeneous subgroups and sampling from each.</li>



<li><strong>Antithetic variates</strong>: exploiting pairs of negatively correlated random numbers to reduce variance.</li>
</ul>



<p><!-- ============================================================ --><br><!-- CLOSING --><br><!-- ============================================================ --></p>



<p>The Monte Carlo method represents one of the most powerful tools in computational statistics. In future articles, we’ll explore how some of these techniques — particularly the <strong>bootstrap</strong>, a close relative of Monte Carlo — apply to concrete problems in statistical inference.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><!-- ============================================================ --><br><!-- FURTHER READING --><br><!-- ============================================================ --></p>



<h3 class="wp-block-heading">Further reading</h3>



<p>For a deeper dive into the Monte Carlo method and its applications in finance, <a href="https://www.amazon.com/dp/1441915753?tag=consulenzeinf-21" target="_blank" rel="nofollow noopener sponsored"><em>Monte Carlo Methods in Financial Engineering</em></a> by Paul Glasserman is the most comprehensive reference: it covers theory and practice with detailed examples in derivative pricing and risk management.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/the-monte-carlo-method-explained-simply-with-real-world-applications/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>A/B Test Sample Size Calculator</title>
		<link>https://www.gironi.it/blog/en/ab-test-sample-size-calculator/</link>
		
		<dc:creator><![CDATA[paolo]]></dc:creator>
		<pubDate>Fri, 06 Mar 2026 08:07:28 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/?p=3495</guid>

					<description><![CDATA[One of the most common questions when planning an A/B test is: how many users do I need to get a reliable result? The answer is not a magic number: it depends on the size of the effect we want to detect, the baseline conversion rate, and the level of statistical certainty we require. Calculating &#8230; <a href="https://www.gironi.it/blog/en/ab-test-sample-size-calculator/" class="more-link">Continue reading<span class="screen-reader-text"> "A/B Test Sample Size Calculator"</span></a>]]></description>
										<content:encoded><![CDATA[<p>One of the most common questions when planning an <strong>A/B test</strong> is: <em>how many users do I need to get a reliable result?</em> The answer is not a magic number: it depends on the size of the effect we want to detect, the baseline conversion rate, and the level of statistical certainty we require.</p>
<p>Calculating the <strong>sample size</strong> in advance is essential to avoid two classic mistakes: stopping the test too early and declaring a winner that does not exist, or letting it run too long, wasting traffic and time. In other words, it is about finding the right balance between resources and rigour.</p>
<p>If you have read the article on <a href="https://www.gironi.it/blog/en/guide-to-statistical-tests-for-a-b-analysis/">A/B Testing</a>, you will recall that <strong>power analysis</strong> is the statistical method that lets us determine this threshold. And if you have studied <a href="https://www.gironi.it/blog/en/confidence-intervals-what-they-are-how-to-calculate-them-and-what-they-do-not-mean/">confidence intervals</a>, you already know that significance level and test power are not abstract concepts but operational levers that directly affect sample size.</p>
<p><span id="more-3495"></span></p>
<p>The calculator below automates this process: simply enter your test parameters to instantly get the number of observations needed per variant and, if you know your daily traffic, an estimate of the test duration in days.</p>
<div style="border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px">
<h3 style="margin-top: 0">What We&#8217;ll Cover</h3>
<ul>
<li><a href="#calculator">The calculator</a></li>
<li><a href="#formula">The formula: how the calculation works</a></li>
<li><a href="#how-to-use">How to use the calculator</a></li>
<li><a href="#further-reading">Further reading</a></li>
</ul>
</div>
<hr />
<h2 id="calculator">The calculator</h2>
<p>Enter the parameters of your A/B test and the calculator will instantly return the required sample size.</p>
<style>
.ss-calc{max-width:620px;margin:2em auto;padding:1.5em 2em;background:#f8f8f8;border:1px solid #ddd;border-radius:8px;font-family:inherit}
.ss-calc h3{margin:0 0 1em;color:#333;font-size:1.2em}
.ss-calc label{display:block;margin:0.8em 0 0.3em;font-weight:600;color:#333;font-size:0.95em}
.ss-calc .ss-hint{font-size:0.82em;color:#777;margin:0.15em 0 0}
.ss-calc input[type=number],.ss-calc select{width:100%;padding:8px 10px;border:1px solid #ccc;border-radius:4px;font-size:1em;box-sizing:border-box;background:#fff}
.ss-calc input[type=number]:focus,.ss-calc select:focus{outline:none;border-color:#0073aa;box-shadow:0 0 0 2px rgba(0,115,170,0.15)}
.ss-calc .ss-row{display:flex;gap:1.2em}
.ss-calc .ss-col{flex:1}
.ss-calc .ss-result{margin-top:1.5em;padding:1.2em;background:#fff;border:2px solid #2ecc71;border-radius:6px;text-align:center}
.ss-calc .ss-result .ss-big{font-size:2em;font-weight:700;color:#2ecc71;display:block;margin:0.2em 0}
.ss-calc .ss-result .ss-label{font-size:0.85em;color:#666}
.ss-calc .ss-result .ss-total{font-size:1.1em;color:#333;margin-top:0.5em}
.ss-calc .ss-result .ss-days{font-size:1em;color:#0073aa;margin-top:0.4em;font-weight:600}
.ss-calc .ss-warn{color:#e74c3c;font-size:0.85em;margin-top:0.5em;display:none}
@media(max-width:520px){.ss-calc .ss-row{flex-direction:column;gap:0}.ss-calc{padding:1em 1.2em}}
</style>
<div class="ss-calc" id="ssCalcEn">
<h3>Sample Size Calculator</h3>
<p><label for="ssBaseEn">Baseline conversion rate (%)</label><br />
<input type="number" id="ssBaseEn" value="5" min="0.1" max="100" step="0.1"></p>
<p class="ss-hint">The current conversion rate of the control variant</p>
<p><label for="ssMdeEn">Minimum detectable effect &mdash; MDE (% relative)</label><br />
<input type="number" id="ssMdeEn" value="20" min="1" max="100" step="1"></p>
<p class="ss-hint">The smallest relative improvement we consider meaningful (e.g. 20% = from 5% to 6%)</p>
<div class="ss-row">
<div class="ss-col">
<label for="ssAlphaEn">Significance level (&alpha;)</label><br />
<select id="ssAlphaEn"><option value="0.01">0.01 (99%)</option><option value="0.05" selected>0.05 (95%)</option><option value="0.10">0.10 (90%)</option></select>
</div>
<div class="ss-col">
<label for="ssPowerEn">Power (1&minus;&beta;)</label><br />
<select id="ssPowerEn"><option value="0.80" selected>0.80</option><option value="0.85">0.85</option><option value="0.90">0.90</option><option value="0.95">0.95</option></select>
</div>
</div>
<p><label for="ssTrafficEn">Daily traffic <span style="font-weight:400;color:#999">(optional)</span></label><br />
<input type="number" id="ssTrafficEn" value="" min="1" step="1" placeholder="e.g. 1000"></p>
<p class="ss-hint">Total daily visitors to estimate test duration</p>
<div class="ss-result" id="ssResultEn">
<span class="ss-label">Sample size per variant</span><br />
<span class="ss-big" id="ssNEn">&mdash;</span></p>
<div class="ss-total" id="ssTotalEn"></div>
<div class="ss-days" id="ssDaysEn"></div>
</div>
<div class="ss-warn" id="ssWarnEn"></div>
</div>
<p><script>
(function(){
  function qnorm(p){
    if(p<=0||p>=1)return NaN;
    if(p<0.5)return -qnorm(1-p);
    var t=Math.sqrt(-2*Math.log(1-p));
    var c0=2.515517,c1=0.802853,c2=0.010328;
    var d1=1.432788,d2=0.189269,d3=0.001308;
    return t-(c0+c1*t+c2*t*t)/(1+d1*t+d2*t*t+d3*t*t*t);
  }
  function calcSS(){
    var base=parseFloat(document.getElementById('ssBaseEn').value);
    var mde=parseFloat(document.getElementById('ssMdeEn').value);
    var alpha=parseFloat(document.getElementById('ssAlphaEn').value);
    var power=parseFloat(document.getElementById('ssPowerEn').value);
    var traffic=document.getElementById('ssTrafficEn').value;
    var warn=document.getElementById('ssWarnEn');
    warn.style.display='none';
    if(isNaN(base)||isNaN(mde)||base<=0||base>100||mde<=0||mde>100){
      document.getElementById('ssNEn').innerHTML='&mdash;';
      document.getElementById('ssTotalEn').textContent='';
      document.getElementById('ssDaysEn').textContent='';
      return;
    }
    var p1=base/100;
    var p2=p1*(1+mde/100);
    if(p2>1){
      warn.textContent='Warning: with these values the variant conversion rate would exceed 100%.';
      warn.style.display='block';
      document.getElementById('ssNEn').innerHTML='&mdash;';
      document.getElementById('ssTotalEn').textContent='';
      document.getElementById('ssDaysEn').textContent='';
      return;
    }
    var za=qnorm(1-alpha/2);
    var zb=qnorm(power);
    var diff=p1-p2;
    var n=Math.ceil((Math.pow(za+zb,2)*(p1*(1-p1)+p2*(1-p2)))/(diff*diff));
    document.getElementById('ssNEn').textContent=n.toLocaleString('en-US');
    document.getElementById('ssTotalEn').textContent='Total (2 variants): '+(n*2).toLocaleString('en-US')+' observations';
    if(traffic && parseInt(traffic)>0){
      var days=Math.ceil((n*2)/parseInt(traffic));
      document.getElementById('ssDaysEn').textContent='Estimated duration: about '+days+' days';
    }else{
      document.getElementById('ssDaysEn').textContent='';
    }
  }
  ['ssBaseEn','ssMdeEn','ssAlphaEn','ssPowerEn','ssTrafficEn'].forEach(function(id){
    document.getElementById(id).addEventListener('input',calcSS);
    document.getElementById(id).addEventListener('change',calcSS);
  });
  calcSS();
})();
</script></p>
<hr />
<h2 id="formula">The formula: how the calculation works</h2>
<p>The calculator uses the standard formula for comparing two proportions with a <strong>two-tailed z-test</strong>. Let us walk through it step by step.</p>
<p>We start with the parameters we enter:</p>
<ul>
<li><strong>p<sub>1</sub></strong>: the baseline conversion rate (control), expressed as a proportion. If our CR is 5%, then p<sub>1</sub> = 0.05.</li>
<li><strong>p<sub>2</sub></strong>: the expected conversion rate for the variant. If the minimum detectable effect (MDE) is 20% relative, then p<sub>2</sub> = p<sub>1</sub> &times; (1 + MDE/100) = 0.05 &times; 1.20 = 0.06.</li>
<li><strong>&alpha;</strong>: the significance level, i.e. the probability of declaring an effect when there is none (Type I error). With &alpha; = 0.05 we work at 95% confidence.</li>
<li><strong>1 &minus; &beta;</strong>: the power of the test, i.e. the probability of detecting an effect when it actually exists. With power 0.80, we have an 80% chance of catching the effect.</li>
</ul>
<p>The formula is:</p>
<p>\( n = \frac{\left[z_{\alpha/2} + z_{\beta}\right]^2 \cdot \left[p_1(1-p_1) + p_2(1-p_2)\right]}{(p_1 &#8211; p_2)^2} \)</p>
<p>Where z<sub>&alpha;/2</sub> and z<sub>&beta;</sub> are the <strong>quantiles of the standard normal distribution</strong>. For the most common values:</p>
<ul>
<li>&alpha; = 0.05 &rarr; z<sub>&alpha;/2</sub> = 1.96</li>
<li>&alpha; = 0.01 &rarr; z<sub>&alpha;/2</sub> = 2.576</li>
<li>&beta; = 0.20 (power 0.80) &rarr; z<sub>&beta;</sub> = 0.842</li>
<li>&beta; = 0.10 (power 0.90) &rarr; z<sub>&beta;</sub> = 1.282</li>
</ul>
<p><strong>Worked example.</strong> Suppose we have a baseline conversion rate of 3% and we want to detect a 20% relative increase (i.e. going from 3% to 3.6%), with &alpha; = 0.05 and power = 0.80:</p>
<ul>
<li>p<sub>1</sub> = 0.03, p<sub>2</sub> = 0.036</li>
<li>z<sub>&alpha;/2</sub> = 1.96, z<sub>&beta;</sub> = 0.842</li>
<li>Numerator: (1.96 + 0.842)<sup>2</sup> &times; [0.03 &times; 0.97 + 0.036 &times; 0.964] = 7.849 &times; 0.0638 = 0.5008</li>
<li>Denominator: (0.03 &minus; 0.036)<sup>2</sup> = 0.000036</li>
<li>n = 0.5008 / 0.000036 &asymp; <strong>13,911 per variant</strong></li>
</ul>
<p>So to detect a 20% relative effect on a 3% CR, we need roughly <strong>13,900 observations per variant</strong> (nearly 28,000 in total). These numbers are worth reflecting on: if our site gets 500 visitors a day, the test will take about 56 days. This is one of the reasons why, in practice, most A/B tests on medium-traffic sites take weeks, not days.</p>
<hr />
<h2 id="how-to-use">How to use the calculator</h2>
<p><strong>How to choose the MDE.</strong> The minimum detectable effect is the trickiest parameter. Rather than asking &#8220;how much would we like the metric to improve&#8221;, we should ask: <em>what is the smallest improvement that would justify the effort of implementing the change?</em> An MDE of 5% relative requires enormous samples; an MDE of 50% is easy to detect but rarely realistic. The 10&ndash;30% range is a good starting point for most conversion rate tests.</p>
<p>An important detail: the MDE in the calculator is <strong>relative</strong>, not absolute. An MDE of 20% on a baseline CR of 5% means we are looking to detect a shift from 5% to 6% (one absolute percentage point, but 20% of the starting value).</p>
<p><strong>How to estimate daily traffic.</strong> The traffic to enter is that of the pages involved in the test, not the total site traffic. If the test is on the checkout page and it receives 300 visits per day, the correct value is 300. You can get this figure from your analytics tool (GA4, Matomo, or similar) by averaging the last 30 days to smooth out daily fluctuations.</p>
<hr />
<h3 id="further-reading">You might also like</h3>
<ul>
<li><a href="https://www.gironi.it/blog/en/guide-to-statistical-tests-for-a-b-analysis/">A/B Testing: A Guide to Statistical Tests for A/B Analysis</a></li>
<li><a href="https://www.gironi.it/blog/en/confidence-intervals-what-they-are-how-to-calculate-them-and-what-they-do-not-mean/">Confidence Intervals</a></li>
<li><a href="https://www.gironi.it/blog/en/hypothesis-testing-a-step-by-step-guide/">Hypothesis Testing</a></li>
</ul>
<hr />
<h3>Further reading</h3>
<p>The most comprehensive reference on the rigorous design of online experiments is: <a href="https://www.amazon.com/dp/1108724264" rel="nofollow sponsored noopener" target="_blank"><em>Trustworthy Online Controlled Experiments</em></a> by Ron Kohavi, Diane Tang and Ya Xu. It covers sample size, power analysis and much more, drawing on decades of practical experience at Microsoft and Google.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Understanding the Basics of Machine Learning: A Beginner&#8217;s Guide</title>
		<link>https://www.gironi.it/blog/en/understanding-the-basics-of-machine-learning-a-beginners-guide/</link>
					<comments>https://www.gironi.it/blog/en/understanding-the-basics-of-machine-learning-a-beginners-guide/#respond</comments>
		
		<dc:creator><![CDATA[autore-articoli]]></dc:creator>
		<pubDate>Sun, 01 Mar 2026 19:47:47 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/understanding-the-basics-of-machine-learning-a-beginners-guide/</guid>

					<description><![CDATA[Introduction Machine Learning is changing the way we see the world around us. From weather prediction to medical diagnosis, from content recommendations on streaming platforms to financial fraud detection, Machine Learning is increasingly present in our daily lives. But what exactly is it, and how does it work? In this post, we will explore the &#8230; <a href="https://www.gironi.it/blog/en/understanding-the-basics-of-machine-learning-a-beginners-guide/" class="more-link">Continue reading<span class="screen-reader-text"> "Understanding the Basics of Machine Learning: A Beginner&#8217;s Guide"</span></a>]]></description>
										<content:encoded><![CDATA[<h2>Introduction</h2>
<p>Machine Learning is changing the way we see the world around us. From weather prediction to medical diagnosis, from content recommendations on streaming platforms to financial fraud detection, Machine Learning is increasingly present in our daily lives.</p>
<p>But what exactly is it, and how does it work? In this post, we will <strong>explore the fundamental concepts of Machine Learning and see how it can be used to solve real-world problems</strong>. We will also look at how to get started with Machine Learning, what resources are available, and how to use this technology to improve both work and everyday life.</p>
<p><span id="more-3484"></span></p>
<p style="background-color:#f0f0f0;padding:1em;border-radius:4px"><strong><em>Caveat</em></strong>: This article is a simple introduction to a vast subject. It is written for anyone who wants to understand the basic concepts of Machine Learning, without requiring advanced technical or mathematical knowledge. At the end of the post, we provide a set of useful resources for anyone who wishes to explore the topic further and continue what is a truly fascinating journey.</p>
<div style="border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px">
<h3 style="margin-top: 0">What We&#8217;ll Cover</h3>
<ul>
<li><a href="#what-is-ml">What Is Machine Learning</a></li>
<li><a href="#types">Supervised vs Unsupervised Learning</a></li>
<li><a href="#supervised-algorithms">Main Supervised Learning Algorithms</a></li>
<li><a href="#unsupervised-algorithms">Main Unsupervised Learning Algorithms</a></li>
<li><a href="#ml-process">The Machine Learning Process</a></li>
<li><a href="#getting-started">Getting Started: Tutorials and Resources</a></li>
<li><a href="#jupyter-colab">Jupyter Lab and Google Colab</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
</div>
<h2 id="what-is-ml">What Is Machine Learning</h2>
<p>Machine Learning, or automated learning, is a technology that allows machines to &#8220;learn&#8221; from data and improve their performance without being explicitly programmed. In other words, Machine Learning enables machines to &#8220;learn&#8221; from experience, just as humans do.</p>
<p>There are two main types of Machine Learning: <strong>supervised Machine Learning</strong> and <strong>unsupervised Machine Learning</strong>.</p>
<p>In supervised Machine Learning, the model is &#8220;trained&#8221; on a dataset that includes examples of both inputs and desired outputs. The model then uses these examples to make predictions on new data. In unsupervised Machine Learning, the model must &#8220;discover&#8221; on its own the structures and relationships within the data, without being guided by pre-defined examples.</p>
<p>Machine Learning is used across a wide range of applications, from weather prediction to medical diagnosis, from content recommendations to financial fraud detection. In general, the goal of Machine Learning is to automate decisions and predictions based on data, improving the efficiency and accuracy of the process.</p>
<h2 id="types">Types of Machine Learning: Supervised and Unsupervised</h2>
<p>As we have already seen, Machine Learning can be divided into two main categories: supervised Machine Learning and unsupervised Machine Learning.</p>
<p><strong>Supervised Machine Learning is the most common type of automated learning</strong> and is based on a <strong>set of already labeled data</strong>. In other words, the model is &#8220;trained&#8221; on a dataset that includes examples of both inputs and desired outputs. The model then uses these examples to learn to make inferences on new data. For example, a spam classifier could be trained on a set of emails labeled as &#8220;spam&#8221; or &#8220;not spam,&#8221; and then used to classify new incoming emails.</p>
<p><strong>Unsupervised Machine Learning</strong>, on the other hand, <strong>is based on a set of unlabeled data</strong>. In other words, the model must &#8220;learn&#8221; on its own to discover structures and relationships within the data. A typical example of this type of learning is clustering, where data is divided into groups (<em>clusters</em>) based on their similarities.</p>
<p style="background-color:#f0f0f0;padding:1em;border-radius:4px">In general, we can say that supervised Machine Learning uses labeled data to make predictions or classifications, while unsupervised Machine Learning uses unlabeled data to make discoveries or identify relationships within the data.</p>
<h3 id="supervised-algorithms">Main Supervised Learning Algorithms</h3>
<p>The main supervised Machine Learning algorithms are:</p>
<ul>
<li><strong>Linear Regression</strong>: used for <strong>quantitative predictions</strong> on a continuous variable. For example, predicting the price of a house based on its square footage.
<p>We have written dedicated posts on this topic, which may be very helpful for a proper understanding:<br /><strong><a href="https://www.gironi.it/blog/en/correlation-and-regression-analysis/" target="_blank" rel="noreferrer noopener">Correlation and Regression Analysis</a><br /><a href="https://www.gironi.it/blog/en/multiple-regression-analysis/" target="_blank" rel="noreferrer noopener">Multiple Regression Analysis</a></strong></li>
<li><strong>Logistic Regression</strong>: used for <strong>categorical variable predictions</strong>, i.e., when the output is one class among two or more possibilities. For example, predicting whether a patient has a certain disease or not.</li>
<li><strong>Decision Trees</strong>: used for both classification and regression. They consist of a decision graph where each node represents a decision and each branch represents an outcome.</li>
<li><strong>Random Forest</strong>: a variant of decision trees where multiple trees are used to make predictions, and then the average of the trees&#8217; predictions is used.</li>
<li><strong>Gradient Boosting</strong>: an algorithm that uses a series of decision trees in succession to improve predictions.</li>
<li><strong>Support Vector Machine (SVM)</strong>: used for classification when the data is linearly separable.</li>
<li><strong>k-Nearest Neighbors (k-NN)</strong>: used for classification based on the similarity of data points relative to a reference point.</li>
<li><strong>Naive Bayes</strong>: used for probability-based classification.</li>
</ul>
<h3 id="unsupervised-algorithms">Main Unsupervised Learning Algorithms</h3>
<ul>
<li><strong>Clustering</strong>: used to divide data into groups or clusters based on their similarities. The most common clustering algorithm is k-means.</li>
<li><strong>Principal Component Analysis (PCA)</strong>: used to reduce the dimensionality of data, that is, to transform a set of correlated variables into a set of uncorrelated variables.</li>
<li><strong>Density-Based Spatial Clustering (DBSCAN)</strong>: used to find clusters based on the density of data points.</li>
<li><strong>Association Rule Mining (Apriori, FP-Growth)</strong>: used to find association rules between variables.</li>
<li><strong>Anomaly Detection Algorithms (One-class SVM, Isolation Forest)</strong>: used to detect elements that deviate from the norm.</li>
<li><strong>Self-Organizing Maps (SOM)</strong>: used to visualize hidden structures in the data.</li>
<li><strong>Structure Detection Algorithms (Spectral Clustering, Hierarchical Clustering)</strong>: used to find hierarchical relationships in the data.</li>
</ul>
<p>These are some of the main unsupervised Machine Learning algorithms, but there are many others. As with supervised Machine Learning, the choice of algorithm depends on the characteristics of the specific problem and the nature of the data.</p>
<p style="background-color:#f0f0f0;padding:1em 1.2em;border-radius:4px;font-size:1.1em"><strong>In practice, choosing the right algorithm for a specific solution is a critical decision that can determine the success or complete failure of a data analysis effort.</strong></p>
<h2 id="ml-process">The Main Phases of the Machine Learning Process</h2>
<ol>
<li><strong>Data Collection</strong>: The first phase consists of gathering the data needed for the problem to be solved. This data must be cleaned, formatted, and prepared for processing.</li>
<li><strong>Data Analysis</strong>: Once the data has been collected, it is important to explore it in order to better understand the problem and identify any interesting relationships or characteristics.</li>
<li><strong>Model Selection</strong>: The next phase consists of choosing the most appropriate Machine Learning model for the problem at hand. There are many available algorithms, including decision trees, neural networks, and support vector machines (SVM).</li>
<li><strong>Model Training</strong>: Once the model has been selected, it must be &#8220;trained&#8221; using the training data. This process allows the model to &#8220;learn&#8221; from the data and become capable of making predictions on new data.</li>
<li><strong>Model Evaluation</strong>: Once trained, the model must be evaluated on a test dataset to verify its accuracy.</li>
<li><strong>Model Deployment</strong>: If the model has shown good performance, it can be used to solve the problem in question and deployed to a production environment.</li>
<li><strong>Monitoring and Maintenance</strong>: The model must be monitored to ensure it continues to function correctly and, if necessary, updated or replaced if performance declines.</li>
</ol>
<h2 id="getting-started">Getting Started with Machine Learning: Tutorials and Resources</h2>
<p>Machine Learning is a rapidly evolving field, and there are many resources available for those who want to get started. Any list is necessarily incomplete and subject to personal preferences, but here are some good starting points:</p>
<p><strong>Tutorials:</strong> There are numerous tutorials available online that cover the basics of Machine Learning. For example, the scikit-learn data science website has a tutorial section that explains how to use the library to build some of the most common models.<br /><a href="https://scikit-learn.org/stable/tutorial/index.html" target="_blank" rel="noreferrer noopener">https://scikit-learn.org/stable/tutorial/index.html</a></p>
<p><strong>Books:</strong> There are many books on the subject, but some of the classics in the field include:<br />&#8220;<em>Introduction to Machine Learning</em>&#8221; by Alpaydin: <a href="https://www.amazon.com/Introduction-Machine-Learning-Adaptive-Computation/dp/0262028182" target="_blank" rel="noreferrer noopener">https://www.amazon.com/Introduction-Machine-Learning-Adaptive-Computation/dp/0262028182</a><br />&#8220;<em>Python Machine Learning</em>&#8221; by Raschka and Mirjalili: <a href="https://www.packtpub.com/data/python-machine-learning-third-edition" target="_blank" rel="noreferrer noopener">https://www.packtpub.com/data/python-machine-learning-third-edition</a></p>
<p><strong>Online Courses:</strong> There are many online courses that cover the basics of Machine Learning, such as the excellent course by Andrew Ng on Coursera:<br /><a href="https://www.coursera.org/learn/machine-learning" target="_blank" rel="noopener">https://www.coursera.org/learn/machine-learning</a><br />or the Machine Learning course by fast.ai:<br /><a href="https://www.fast.ai/" target="_blank" rel="noreferrer noopener">https://www.fast.ai/</a></p>
<p><strong>Tools:</strong> There are many tools and libraries that can be used to explore data and build models. Some of the most popular include:</p>
<p><strong>scikit-learn</strong>: a Machine Learning library for Python<br /><a href="https://scikit-learn.org/stable/" target="_blank" rel="noopener">https://scikit-learn.org/stable/</a><br /><strong>TensorFlow</strong>: a Machine Learning library developed by Google<br /><a href="https://www.tensorflow.org/" target="_blank" rel="noopener">https://www.tensorflow.org/</a><br /><strong>Keras</strong>: a high-level interface for building neural networks in TensorFlow<br /><a href="https://keras.io/" target="_blank" rel="noopener">https://keras.io/</a><br /><strong>PyTorch</strong>: an open-source Machine Learning library developed by Facebook<br /><a href="https://pytorch.org/" target="_blank" rel="noreferrer noopener">https://pytorch.org/</a></p>
<p>In general, we recommend starting with tutorials and online courses to become familiar with the basic concepts, and then continuing with books and tools to deepen understanding and develop practical skills. To become a good data scientist, it is also important to work with real data and not just tutorials or exercises. Seeking out Machine Learning projects or competitions can help build concrete experience.</p>
<h2 id="jupyter-colab">Experimenting with Code: Jupyter Lab and Google Colab</h2>
<p>Jupyter Lab and Google Colab are both <strong>free and powerful tools for data exploration</strong>, learning, and testing Machine Learning code.</p>
<p>How can we use both tools to create development environments and share our work with others?</p>
<p><strong>Jupyter Lab</strong> is the new interface for <strong>Jupyter Notebook</strong> that provides an integrated development environment for working with notebooks. <strong>It is an interactive development environment that allows us to write, run, and document Python and R code within a web browser</strong>. It is particularly useful for data analysis and for learning Machine Learning.</p>
<p>To get started, Jupyter Lab needs to be installed on a local machine. This can be done easily using <strong>Anaconda</strong>, a Python distribution that includes Jupyter Lab and many other data science libraries. Once installed, Jupyter Lab can be launched from the command line and a new notebook opened to write and run code. Jupyter Lab is available at: <a href="https://jupyter.org/" target="_blank" rel="noopener">https://jupyter.org/</a></p>
<p>It is also possible to test the environment directly in the browser with JupyterLite:</p>
<figure><img decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2023/01/jubyterLite-1024x459.png" alt="JupyterLite: try the Jupyter environment in your browser" /><figcaption>JupyterLite: try the Jupyter environment in your browser</figcaption></figure>
<p><strong>Google Colab</strong> is a <strong>cloud-based</strong> development environment that allows us to write and run Python and R code within a web browser <strong>without any installation</strong>. It is a very convenient option, because Colab can be accessed from any device with an internet connection and work can be shared with others simply by providing a link. It also allows the use of a GPU or TPU to make computations more powerful. Google Colab is available at: <a href="https://colab.research.google.com/" target="_blank" rel="noopener">https://colab.research.google.com/</a></p>
<figure><img decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2023/01/google-colab-1024x506.png" alt="Google Colab: test and share code in the cloud" /><figcaption>Google Colab: test and share code in the cloud</figcaption></figure>
<p>Both tools allow us to create a sequence of cells containing code and text. <strong>Code can be executed within cells and the results displayed directly in the notebook</strong>. This makes Jupyter Lab and Google Colab ideal for data exploration, Machine Learning learning, sharing, and documenting work.</p>
<hr />
<h3>You might also like</h3>
<ul>
<li><a href="https://www.gironi.it/blog/en/correlation-and-regression-analysis/">Correlation and Regression Analysis</a></li>
<li><a href="https://www.gironi.it/blog/en/logistic-regression/">Logistic Regression</a></li>
<li><a href="https://www.gironi.it/blog/en/how-to-use-decision-trees/">How to Use Decision Trees</a></li>
</ul>
<hr />
<h3 id="further-reading">Further Reading</h3>
<p>For a solid introduction to the statistical foundations of machine learning—including regression, model selection, and prediction—<a href="https://www.amazon.it/dp/8891906190?tag=consulenzeinf-21" rel="nofollow sponsored noopener" target="_blank"><em>Introduzione all&#8217;econometria</em></a> by Stock and Watson provides the quantitative framework that underpins many ML techniques. For a hands-on guide to online experimentation and A/B testing—essential skills for deploying ML models in production—<a href="https://www.amazon.it/dp/1108724264?tag=consulenzeinf-21" rel="nofollow sponsored noopener" target="_blank"><em>Trustworthy Online Controlled Experiments</em></a> by Kohavi, Tang and Xu is the definitive reference.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/understanding-the-basics-of-machine-learning-a-beginners-guide/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>The Gini Index: What It Is, Why It Matters, and How to Compute It in R</title>
		<link>https://www.gironi.it/blog/en/the-gini-index-what-it-is-why-it-matters-and-how-to-compute-it-in-r/</link>
					<comments>https://www.gironi.it/blog/en/the-gini-index-what-it-is-why-it-matters-and-how-to-compute-it-in-r/#respond</comments>
		
		<dc:creator><![CDATA[autore-articoli]]></dc:creator>
		<pubDate>Sun, 01 Mar 2026 19:47:45 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/the-gini-index-what-it-is-why-it-matters-and-how-to-compute-it-in-r/</guid>

					<description><![CDATA[The Gini coefficient is a measure of the degree of inequality in a distribution, and is commonly used to measure income distribution. These few words alone are enough to grasp the extraordinary importance of this index for economic and political studies, and why it is worth getting to know it a little more closely. What &#8230; <a href="https://www.gironi.it/blog/en/the-gini-index-what-it-is-why-it-matters-and-how-to-compute-it-in-r/" class="more-link">Continue reading<span class="screen-reader-text"> "The Gini Index: What It Is, Why It Matters, and How to Compute It in R"</span></a>]]></description>
										<content:encoded><![CDATA[<p>The Gini coefficient is a measure of <strong>the degree of inequality in a distribution</strong>, and is commonly used to <strong>measure income distribution</strong>.</p>
<p>These few words alone are enough to grasp the extraordinary importance of this index for economic and political studies, and why it is worth getting to know it a little more closely.</p>
<p><span id="more-3483"></span></p>
<div style="border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px">
<h3 style="margin-top: 0">What We&#8217;ll Cover</h3>
<ul>
<li><a href="#lorenz-curve">The Lorenz Curve</a></li>
<li><a href="#example">An Example</a></li>
<li><a href="#definition">The Definition of the Concentration Index R</a></li>
<li><a href="#r-code">Computing R&#8230; in R!</a></li>
<li><a href="#python">What If I Don&#8217;t Use R?</a></li>
<li><a href="#world-data">Gini Index Values Around the World</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
</div>
<hr />
<div style="border: 1px solid silver; padding: 8px; background-color: #f8f8f8; font-size: small;">A preliminary note:<br /><strong>Income is a transferable variable</strong>.<br />A quantitative variable is said to be transferable when the overall increase in the phenomenon recorded across a given population can be redistributed among the statistical units without changing its total amount.</div>
<p>The index is one of the greatest achievements of <a href="https://en.wikipedia.org/wiki/Corrado_Gini" target="_blank" rel="noopener noreferrer">Corrado Gini</a>, one of the foremost Italian statisticians (who was unfortunately personally connected to the fascist regime. He inspired Mussolini&#8217;s famous &#8220;Ascension Day&#8221; speech of 1927 on the issues of birth rates and <a href="https://en.wikipedia.org/wiki/Eugenics" target="_blank" rel="noopener noreferrer">eugenics</a>).</p>
<figure class="aligncenter size-large"><img loading="lazy" decoding="async" width="249" height="326" src="https://www.gironi.it/blog/wp-content/uploads/2020/11/Corrado_Gini-1.jpg" alt="Corrado Gini - the Gini index" /></figure>
<p>It was in 1912 that Gini published his article &#8220;<em><strong>Variabilità e mutabilità</strong></em>&#8221; (Variability and Mutability), in which he expanded on the work of <a href="https://en.wikipedia.org/wiki/Max_O._Lorenz" target="_blank" rel="noopener noreferrer">Max Otto Lorenz</a>, who as early as 1905 had introduced the famous curves (now known as &#8220;Lorenz curves&#8221;) describing the percentages of wealth held by increasing percentages of the population.</p>
<h2 id="lorenz-curve">The Lorenz Curve</h2>
<p>Lorenz introduced a highly effective graphical representation, placing on the horizontal axis the points P<sub>i</sub> (that is, the cumulative fraction of the first <em>i</em> income earners: P<sub>i</sub> = i / n) and on the vertical axis the corresponding values Q<sub>i</sub> (the cumulative fraction of income held by the first <em>i</em> income earners). Connecting these points produces the <strong>concentration curve</strong>, known as the <a href="https://en.wikipedia.org/wiki/Lorenz_curve" target="_blank" rel="noopener noreferrer"><strong>Lorenz curve</strong></a>.</p>
<figure class="aligncenter"><img loading="lazy" decoding="async" width="609" height="346" src="https://www.gironi.it/blog/wp-content/uploads/2018/05/Lorenz-curve1.png" alt="Lorenz curve" /></figure>
<p>The difference between P<sub>i</sub> and Q<sub>i</sub> measures, in proportion, the share of total income that the first <em>i</em> individuals lack in order to reach a state of equal distribution.<br />The larger this difference, the more the remaining <em>n &minus; i</em> individuals concentrate a significant portion of the total amount on themselves.</p>
<p>The measure of income inequality is the arithmetic mean of the normalised differences (that is, of the quantities P<sub>i</sub> &minus; Q<sub>i</sub> / P<sub>i</sub>, for i = 1, 2, 3, &hellip;, n &minus; 1).</p>
<p>Gini thus managed to develop, in his 1912 work and then in 1914, &#8220;his&#8221; coefficient, which <strong>measures the percentage of the area between the given curve and the 45-degree line, relative to the area between the latter and the flat curve</strong>.</p>
<p>In practice, it indicates how much the corresponding Lorenz curve deviates from complete equality in the distribution of wealth.</p>
<p>In one sentence: the ratio of the area of concentration to its maximum (which is 0.5) coincides exactly with R.</p>
<h2 id="example">An Example</h2>
<p>Let us build the Lorenz curve: the vertical axis shows the income percentages of households, while the horizontal axis shows the percentages of households.<br />If 30% of households earned 30% of the income, 40% of households earned 40% of the income, and so on, we would have a perfectly equal distribution &mdash; that is, a straight line at 45 degrees.</p>
<figure class="aligncenter"><img loading="lazy" decoding="async" width="644" height="372" src="https://www.gironi.it/blog/wp-content/uploads/2018/05/gini-lorenz.png" alt="Lorenz curve and Gini index graph" /></figure>
<p>The Lorenz curve instead represents the actual distribution of income: the deviation of the Lorenz curve from the line of perfect equality (that is, from the 45-degree line) constitutes the measure of inequality in income distribution.</p>
<p>The ratio of the area between the line of perfect equality and the Lorenz curve (that is, the shaded area in the figure) to the area of triangle 0AB is the Gini coefficient.</p>
<h2 id="definition">The Definition of the Concentration Index R</h2>
<p>R can be defined independently of the Lorenz curve: it equals the <strong>normalised</strong> simple mean difference divided by its maximum, that is:</p>
<p>\(<br />
R = \frac{\text{mean absolute difference}}{2 \times \text{mean of values}} \\<br />
\)</p>
<p>R is therefore an index expressed as a number between the theoretical values 0 and 1 &mdash; theoretical because they correspond, respectively, to the case of perfect equity in wealth distribution (everyone has the same income) and the case of maximum inequality (a single unit holds all the income). It is a &#8220;pure&#8221; value that allows comparison between different countries or territorial areas, proving <strong>extraordinarily useful in the field of socio-economic analysis</strong>.</p>
<h2 id="r-code">Computing R&#8230; in R!</h2>
<p>Countless R libraries contain a function for calculating the Gini index (the most widely used package is probably &#8220;<em>ineq</em>&#8220;, easily found with a search on CRAN), which is not included among R&#8217;s base functions.</p>
<p>However, since the calculation itself is not particularly complex, we find it useful to present a version of the function below.</p>
<p><strong>1 &ndash; We start by computing the mean absolute difference</strong></p>
<pre><code class="language-r">Delta &lt;- function(variable) {
  n &lt;- length(variable)
  avg &lt;- mean(variable)
  sorted_variable &lt;- sort(variable)
  (4 * sum((1:n) * sorted_variable) / n - 2 * avg * (n + 1)) / (n - 1)
}</code></pre>
<p><strong>2 &ndash; Now obtaining the Gini concentration ratio is just one line!</strong></p>
<pre><code class="language-r">gini &lt;- Delta(variable) / (2 * mean(variable))</code></pre>
<h2 id="python">What If I Don&#8217;t Use R?</h2>
<p>Fair point. R is a fantastic tool, but not everyone uses it. An index as important as Gini can be useful to many people who do not deal with statistics every day and are not familiar with R. The most universal and widespread programming language, even among non-programmers, is Python. Naturally, as with R, there are many possible implementations of the Gini coefficient, but in this case too, doing it ourselves is simple and instructive.</p>
<p>The solution we liked best comes from a post on <a href="https://planspace.org/2013/06/21/how-to-calculate-gini-coefficient-from-raw-data-in-python/" target="_blank" rel="noopener noreferrer">planspace.org</a> &mdash; here is the function, 8 lines in all:</p>
<pre><code class="language-python">def gini(list_of_values):
    sorted_list = sorted(list_of_values)
    height, area = 0, 0
    for value in sorted_list:
        height += value
        area += height - value / 2.
    fair_area = height * len(list_of_values) / 2.
    return (fair_area - area) / fair_area</code></pre>
<p>First, the function sorts the list of values in ascending order. Then, it uses a <code>for</code> loop to compute the height and area of the Lorenz curve.</p>
<p>The height is calculated as the cumulative sum of the values in the list, while the area is computed as the area of the trapezoid between the current value and the previous value in the list. The total area of the Lorenz curve is then calculated as half the total height of the curve multiplied by the length of the list.</p>
<p>Finally, the Gini index is computed as the difference between the &#8220;fair area&#8221; (half the total area of the Lorenz curve if there were no inequality) and the actual area of the Lorenz curve, divided by the fair area.</p>
<h2 id="world-data">Gini Index Values Around the World</h2>
<ul>
<li>For a general overview, we can <a href="http://www.oecd.org/social/income-distribution-database.htm" target="_blank" rel="noopener noreferrer">visit the website</a> of the Organisation for Economic Co-operation and Development (OECD).</li>
<li>A comparison of values across European countries is provided by <a href="http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=ilc_di12" target="_blank" rel="noopener noreferrer">Eurostat</a>.</li>
<li>On the <a href="http://dati.istat.it/Index.aspx?QueryId=4836" target="_blank" rel="noopener noreferrer">ISTAT website</a> it is possible to compare Gini index data across the various Italian regions.</li>
</ul>
<hr />
<h3>You might also like</h3>
<ul>
<li><a href="https://www.gironi.it/blog/en/descriptive-statistics-measures-of-position/">Descriptive Statistics: Measures of Position</a></li>
<li><a href="https://www.gironi.it/blog/en/descriptive-statistics-measures-of-variability-or-dispersion/">Descriptive Statistics: Measures of Variability</a></li>
<li><a href="https://www.gironi.it/blog/en/the-data-the-4-scales-of-measurement/">The Data: The 4 Scales of Measurement</a></li>
</ul>
<hr />
<h3 id="further-reading">Further Reading</h3>
<p>For a brilliant, accessible exploration of statistical thinking—including how inequality measures like the Gini coefficient help us understand the world—<a href="https://www.amazon.it/dp/8806246623?tag=consulenzeinf-21" rel="nofollow sponsored noopener" target="_blank"><em>The Art of Statistics</em></a> by David Spiegelhalter offers a masterful blend of rigour and clarity.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/the-gini-index-what-it-is-why-it-matters-and-how-to-compute-it-in-r/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Contingency Tables and Conditional Probability</title>
		<link>https://www.gironi.it/blog/en/contingency-tables-and-conditional-probability/</link>
					<comments>https://www.gironi.it/blog/en/contingency-tables-and-conditional-probability/#respond</comments>
		
		<dc:creator><![CDATA[autore-articoli]]></dc:creator>
		<pubDate>Sun, 01 Mar 2026 19:47:43 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/contingency-tables-and-conditional-probability/</guid>

					<description><![CDATA[Contingency tables are used to evaluate the interaction between two categorical variables (qualitative). They are also called two-way tables or cross-tabulations. Searching for relationships between two categorical variables is a very common goal for researchers. Think, for example, of the classic question that marketers ask: who is more likely to buy certain product categories, young &#8230; <a href="https://www.gironi.it/blog/en/contingency-tables-and-conditional-probability/" class="more-link">Continue reading<span class="screen-reader-text"> "Contingency Tables and Conditional Probability"</span></a>]]></description>
										<content:encoded><![CDATA[<p><strong>Contingency tables</strong> are used to evaluate the <strong>interaction between two categorical variables</strong> (qualitative). They are also called two-way tables or cross-tabulations.</p>
<p>Searching for relationships between two categorical variables is a very common goal for researchers. Think, for example, of the classic question that marketers ask: who is more likely to buy certain product categories, young or old people, men or women&#8230;</p>
<p><span id="more-3482"></span></p>
<div style="border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px">
<h3 style="margin-top: 0">What We&#8217;ll Cover</h3>
<ul>
<li><a href="#two-way-tables">Two-Way Tables and Marginal Distributions</a></li>
<li><a href="#conditional-probability">Conditional Probability</a></li>
<li><a href="#dependence-independence">Dependence and Independence</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
</div>
<hr />
<h2 id="two-way-tables">Two-Way Tables and Marginal Distributions</h2>
<p>A <strong>two-way table</strong> is a table with rows and columns that helps organize data from categorical variables:</p>
<ul>
<li><strong>Rows</strong> represent the possible categories for one qualitative variable, for example males and females.</li>
<li><strong>Columns</strong> represent the possible categories for a second qualitative variable, for example whether someone likes pizza or not&#8230;</li>
</ul>
<p>A <strong>marginal distribution</strong> shows how many total responses there are for each category of the variable. The marginal distribution of a variable can be determined by looking at the &#8220;Total&#8221; column (or row).</p>
<p>Let&#8217;s look at an example.</p>
<p><em>Note: I couldn&#8217;t think of anything particularly clever, so I created a table (with fictitious data, of course) of rare silliness, imagining that the two categorical variables concern education level and favorite sci-fi series&#8230;</em></p>
<p>We build the table in R:</p>
<pre><code class="language-r">scifi_fans <- matrix(c(44, 38, 26, 53, 35, 30, 58, 22, 29), ncol = 3, byrow = TRUE)
rownames(scifi_fans) <- c("degree", "diploma", "lower education")
colnames(scifi_fans) <- c("star trek", "star wars", "doctor who")
scifi_fans <- as.table(scifi_fans)
scifi_fans</code></pre>
<p>and we get something like this:</p>
<pre>                 star trek   star wars   doctor who
degree               44          38          26
diploma              53          35          30
lower education      58          22          29</pre>
<p><img decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2023/03/26e3bb37-2a8f-4f6c-9e5d-fddf6a1bb60f-1024x1024.jpeg" alt="Fantasy image for the sci-fi dataset used to discuss contingency tables and conditional probability" loading="lazy"/></p>
<p>Remember? A <strong>marginal distribution</strong> shows how many total responses there are for each category of the variable (at the margins, precisely, where the Total column or row is...).</p>
<p>We can compute row totals in R with:</p>
<pre><code class="language-r">margin.table(scifi_fans, 1)</code></pre>
<p>and column totals with:</p>
<pre><code class="language-r">margin.table(scifi_fans, 2)</code></pre>
<p>We can also find the "grand total" with:</p>
<pre><code class="language-r">margin.table(scifi_fans)</code></pre>
<p>Here is the table with totals:</p>
<pre>              star trek   star wars   doctor who   <strong>TOTAL</strong>
degree            44          38          26        <strong>108</strong>
diploma           53          35          30        <strong>118</strong>
lower ed.         58          22          29        <strong>109</strong>
<strong>TOTAL            155          95          85        335</strong></pre>
<p>So the marginal totals by education level are 108 for degree holders, 118 for diploma holders, 109 for lower education.</p>
<p>Likewise, the marginal totals by sci-fi series type are 155 for Star Trek, 95 for Star Wars, 85 for Doctor Who.</p>
<p>The grand total must be the same in both directions, in this case 335.</p>
<p>We could also have displayed a complete table with totals using just a few lines of R code:</p>
<pre><code class="language-r">scifi_fans <- matrix(c(44, 38, 26, 53, 35, 30, 58, 22, 29), ncol = 3, byrow = TRUE)

row_names <- c("degree", "diploma", "lower education")
col_names <- c("star trek", "star wars", "doctor who")
dimnames(scifi_fans) <- list(row_names, col_names)

# Compute column totals using apply
col_totals <- apply(scifi_fans, 2, sum)
# Add row with column totals using rbind
scifi_fans2 <- rbind(scifi_fans, col_totals)
# Compute row totals
row_totals <- apply(scifi_fans2, 1, sum)
# Add column with row totals
cont_table <- cbind(scifi_fans2, row_totals)

# Print the table
cont_table</code></pre>
<p>We can then ask ourselves (and answer): what percentage of degree holders has a soft spot for Doctor Who?<br />Elementary, Watson (oh wait, that was a different series...):</p>
<p><strong>26/108 = 0.24 = 24% of degree holders prefer Doctor Who</strong></p>
<p>And how many Star Wars fans hold a diploma?</p>
<p><strong>35/95 = 0.37 = 37% of Star Wars fans are diploma holders</strong></p>
<p>In R, we can directly obtain row proportions with the function:</p>
<pre><code class="language-r">prop.table(scifi_fans, 1)</code></pre>
<p>and the result will be:</p>
<pre>                 star trek    star wars    doctor who
degree           0.4074074    0.3518519    0.2407407
diploma          0.4491525    0.2966102    0.2542373
lower ed.        0.5321101    0.2018349    0.2660550</pre>
<p>(as we can see, the row totals add up to 1, or 100%)</p>
<p>or column proportions with:</p>
<pre><code class="language-r">prop.table(scifi_fans, 2)</code></pre>
<p>and the result will be:</p>
<pre>                 star trek    star wars    doctor who
degree           0.2838710    0.4000000    0.3058824
diploma          0.3419355    0.3684211    0.3529412
lower ed.        0.3741935    0.2315789    0.3411765</pre>
<p>(as we can see, the column totals add up to 1, or 100%)</p>
<p>As always, there is more than one way to get the result. We can also install the "gmodels" package and use the CrossTable function (we'll leave it to R's built-in help to show all the command options...):</p>
<pre><code class="language-r">install.packages("gmodels")
library(gmodels)
scifi_fans <- matrix(c(44, 38, 26, 53, 35, 30, 58, 22, 29), ncol = 3, byrow = TRUE)
rownames(scifi_fans) <- c("degree", "diploma", "lower education")
colnames(scifi_fans) <- c("star trek", "star wars", "doctor who")

CrossTable(scifi_fans, prop.r = "false", prop.c = "false", prop.t = "false", prop.chisq = "false")</code></pre>
<p>So what is all this good for? The answer is: for example, to compute <strong>conditional probability</strong>.</p>
<hr />
<h2 id="conditional-probability">Conditional Probability</h2>
<p>Before we see what it is and why it is an extremely useful concept in everyday life, we need a few preliminary definitions about <a href="https://www.gironi.it/blog/en/first-steps-into-the-world-of-probability/">probability</a>.</p>
<p>An event is something that occurs with one or more possible outcomes.<br />An experiment is the process of measuring or making an observation.</p>
<p><strong>Key definition: <em><a href="https://www.gironi.it/blog/en/first-steps-into-the-world-of-probability/">the probability of an event is the ratio of the number of favorable cases to the number of possible cases</a></em></strong></p>
<p>\( P(A) = \frac {\text{number of favorable cases}}{\text{number of possible cases}}\\ \)</p>
<p>Let us also recall that:</p>
<ul>
<li>The probability that two events both occur can never be greater than the probability that each event occurs separately.</li>
<li>If two possible events, A and B, are independent, then the probability that both occur is the product of their individual probabilities.</li>
<li>If an event can have a certain number of different and distinct possible outcomes (A, B, C, etc.), then the probability that A or B occurs equals the sum of the individual probabilities of A and B, and the sum of the probabilities of all possible outcomes (A, B, C, etc.) equals 1, i.e. 100%.</li>
</ul>
<p>The <strong>conditional probability</strong> of an event A with respect to an event B is the probability that A occurs, given that B has occurred.</p>
<p>The formula is:</p>
<p>\( P(A|B) = \frac {P(A \text{ and } B)}{P(B)}\\ \)</p>
<p>If a probability is based on <strong>one variable</strong> it is a <strong>marginal probability</strong>; if on <strong>two or more variables</strong> it is called a <strong>joint probability</strong>.</p>
<ul>
<li>The <strong>probability of an event</strong> P(A) is: \( \frac {\text{marginal probability of A}}{\text{Total}}\\ \)</li>
<li>The <strong>joint probability of two events</strong> is: \( \frac {P(A \text{ and } B)}{\text{Total}}\\ \)</li>
<li>The <strong>conditional probability</strong> of outcome A given the occurrence of condition B is: \( \frac {P(A \text{ and } B)}{P(B)}\\ \)</li>
</ul>
<p>In other words:</p>
<p>A <strong>joint probability</strong> is the probability that someone selected from the entire group has two particular characteristics at the same time. That is, both characteristics occur jointly. We find a joint probability by taking the value of the cell at the intersection of A and B and dividing by the grand total.</p>
<p>To find a <strong>conditional probability</strong>, we take the value of the cell at the intersection of A and B and divide it by the marginal total of B, i.e. the variable expressing the event that has occurred.</p>
<hr />
<p>It's time for a second example. We take the data from:<br /><em>Ellis GJ and Stone LH. 1979. Marijuana Use in College: An Evaluation of a Modeling Explanation. Youth and Society 10:323-334.</em></p>
<p>The study asks whether a college student is more likely to smoke marijuana if their parents had used drugs in the past. Here is the table:</p>
<pre>                    parents    parents     <strong>Total</strong>
                      use      no use
student uses          125        94         <strong>219</strong>
student does not use   85       141         <strong>226</strong>
<strong>Total                 210       235         445</strong></pre>
<p>Let's apply our knowledge to answer these questions:</p>
<ol>
<li><strong><em>If the parents used soft drugs in the past, what is the probability that their child does the same in college?</em></strong></li>
</ol>
<p>This is a case of conditional probability.<br />We recall \( P(A|B) = \frac {P(A \text{ and } B)}{P(B)}\\ \), therefore</p>
<p>P(<em>student uses given that parents used</em>) = 125 / 210 = 0.59 = 59%</p>
<p>2. <strong><em>A student is selected at random and does not use marijuana. What is the probability that their parents used it?</em></strong></p>
<p>Here again we face a question that asks for a conditional probability. Therefore:</p>
<p>P(<em>parents used given that student does not use</em>) = 85 / 226 = 0.376 = 37.6%</p>
<p>3. <strong><em>What is the probability of selecting a student who does not use marijuana and whose parents used it in the past?</em></strong></p>
<p>In this case we need to find a joint probability, so:</p>
<p>\( \frac {P(A \text{ and } B)}{\text{Total}}\\ \), therefore \( \frac {85}{445} = 0.19\\ \).</p>
<p>The probability is approximately 19%.</p>
<h2 id="dependence-independence">Dependence and Independence</h2>
<p>If the outcomes of A and B influence each other, we say that <strong>the two variables are in a relationship of dependence</strong>.<br />Conversely, we say the two variables are independent.</p>
<p>More rigorously: we can state that event B is independent of event A if:</p>
<p>P(B|A) = P(B)</p>
<p>or</p>
<p>P(A|B) = P(A)</p>
<p>If this is not the case, the events are dependent on each other.</p>
<p>Therefore:</p>
<ul>
<li>P(A and B) = P(A) P(B) if and only if A and B are independent events.</li>
<li>P(A | B) = P(A) and P(B | A) = P(B) if and only if A and B are independent events.</li>
</ul>
<h3>Let's examine the independence of categorical variables...</h3>
<p>Let's explain this better with an example.</p>
<p>Let A be the event that people enjoy cycling.<br />B expresses whether they enjoy roast lamb. (Makes perfect sense, right?)</p>
<p>We build our contingency table:</p>
<pre>                  Likes cycling   Doesn't like cycling   <strong>Total</strong>
Likes roast lamb       95                36               <strong>131</strong>
No roast lamb          15                19                <strong>34</strong>
---------------------------------------------------------------
<strong>Total                 110                55               165</strong></pre>
<p>Let's remember what it means for two events to be independent. It means this:<br />P(A | B) = P(A)</p>
<p>But in our case we see that<br />P(A) = 66.7%<br />because 110/165 = 0.67</p>
<p>P(A | B) = 72.5%<br />because 95/131 = 0.725</p>
<p>We recall that \( P(A|B) = \frac {P(A \text{ and } B)}{P(B)}\\ \), therefore \( \frac {95}{131} = 0.725\\ \).</p>
<p>From the result it is clear that \( P(A) \neq P(A|B) \) -- the two events are NOT independent (therefore they are dependent).</p>
<p>After all, everyone knows that there is a clear dependence between loving cycling and loving roast lamb!</p>
<hr />
<h3>You might also like</h3>
<ul>
<li><a href="https://www.gironi.it/blog/en/first-steps-into-the-world-of-probability/">First Steps into the World of Probability</a></li>
<li><a href="https://www.gironi.it/blog/en/the-chi-square-test/">The Chi-Square Test</a></li>
<li><a href="https://www.gironi.it/blog/en/bayesian-statistics-how-to-learn-from-data-one-step-at-a-time/">Bayesian Statistics</a></li>
</ul>
<hr />
<h3 id="further-reading">Further Reading</h3>
<p>For a comprehensive treatment of contingency tables, conditional probability, and the full machinery of categorical data analysis, <a href="https://www.amazon.it/dp/8891910651?tag=consulenzeinf-21" rel="nofollow sponsored noopener" target="_blank"><em>Statistica</em></a> by Newbold, Carlson and Thorne provides a rigorous yet accessible framework for applying these concepts in real-world settings.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/contingency-tables-and-conditional-probability/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>The Poisson Distribution</title>
		<link>https://www.gironi.it/blog/en/the-poisson-distribution/</link>
					<comments>https://www.gironi.it/blog/en/the-poisson-distribution/#respond</comments>
		
		<dc:creator><![CDATA[autore-articoli]]></dc:creator>
		<pubDate>Sun, 01 Mar 2026 19:47:37 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/the-poisson-distribution/</guid>

					<description><![CDATA[The Poisson distribution is a discrete probability distribution that describes the number of events occurring in a fixed interval of time or area. The Poisson distribution is useful for measuring how many events can occur within a given time horizon, such as the number of customers entering a shop in the next hour, or the &#8230; <a href="https://www.gironi.it/blog/en/the-poisson-distribution/" class="more-link">Continue reading<span class="screen-reader-text"> "The Poisson Distribution"</span></a>]]></description>
										<content:encoded><![CDATA[<p>The Poisson distribution is a <strong>discrete probability distribution</strong> that describes the number of events occurring in a fixed interval of time or area.</p>
<p>The Poisson distribution is useful for <strong>measuring how many events can occur within a given time horizon</strong>, such as the number of customers entering a shop in the next hour, or the number of pageviews on a website in the next minute, and so on.</p>
<figure class="aligncenter"><img fetchpriority="high" decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2020/11/Simeon_Poisson.jpg" alt="The Poisson Distribution: Siméon-Denis Poisson" width="400" height="469" /><figcaption><strong><a href="https://en.wikipedia.org/wiki/Sim%C3%A9on-Denis_Poisson" target="_blank" rel="noreferrer noopener">Siméon-Denis Poisson</a></strong></figcaption></figure>
<p><span id="more-3481"></span></p>
<div style="border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px">
<h3 style="margin-top: 0">What We&#8217;ll Cover</h3>
<ul>
<li><a href="#lambda">Lambda: The Average Rate of Events</a></li>
<li><a href="#poisson-binomial">Poisson and Binomial: A Side Note</a></li>
<li><a href="#practical-example">A Practical Example</a></li>
<li><a href="#seo-application">The Poisson Distribution Applied to SEO</a></li>
<li><a href="#alternative-models">Alternative Models for Web Traffic Analysis</a></li>
<li><a href="#clicks-example">Using Poisson for Website Click Estimates</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
</div>
<hr />
<h2 id="lambda">Lambda: The Average Rate of Events</h2>
<p>An important element: <strong>each time interval is assumed to be independent of all others.</strong></p>
<p>We need to know the <strong>average number of events or the rate at which they occur within the time interval</strong>. We represent this value with the Greek letter <strong>lambda</strong>:</p>
<p>\( X \sim Po(\lambda) \\ \\ \)</p>
<p>To calculate the probability that there are r occurrences in a specific interval:</p>
<p>\( P (X=r) = \frac{e^{-\lambda} \lambda^{r}}{r!} \\ \\ \)</p>
<p>For example, if:</p>
<p>\( X \sim Po(2) \\ \\ r=3 \)</p>
<p>we get:</p>
<p>\( P (X=3) = \frac{e^{-2} \cdot 2^{3}}{3!} =\frac{e^{-2} \cdot 8}{6} = e^{2} \cdot 1.333 = 0.180 \\ \\ \)</p>
<p>That is, 18%.</p>
<hr />
<h2 id="poisson-binomial">Poisson and Binomial: A Side Note</h2>
<p>If</p>
<p>\( X \sim Po(\lambda x) \\ Y \sim Po(\lambda y) \\ \\ \)</p>
<p>then</p>
<p>\( X + Y \sim Po(\lambda x + \lambda y) \\ \\ \)</p>
<p>If</p>
<p>\( X \sim Bin(n,p) \\ \\ \)</p>
<p>and n is large and p is small, then we can approximate the <a href="https://www.gironi.it/blog/en/probability-distributions-discrete-distributions-and-the-binomial/">binomial</a> with the Poisson:</p>
<p>\( X \sim Po(n \cdot p) \\ \\ \)</p>
<h3>Differences Between the Poisson and Binomial Distributions</h3>
<p>The Poisson and binomial distributions are both discrete probability distributions used to model rare events. The main difference between the two concerns the number of trials and successes.</p>
<p><strong>The binomial distribution is biparametric</strong>, meaning it is characterised by two parameters n and p, where n represents the number of trials and p the probability of success in each trial.</p>
<p>In contrast, <strong>the Poisson distribution is uniparametric</strong>, meaning it is characterised by a single parameter &lambda; representing the average number of events per interval.</p>
<p>Furthermore, the binomial distribution is used when the number of trials is finite and the number of successes cannot exceed n, whereas the Poisson distribution is used when the number of trials is essentially infinite.</p>
<hr />
<h2 id="practical-example">A Practical Example</h2>
<p>A vending machine malfunctions on average 3.4 times per week. What is the probability that the machine will <strong>not</strong> break down next week?</p>
<p>\( P (X=0) = \frac{e^{- \lambda} \cdot \lambda ^{r}}{r!} \\ \\ = \frac{e^{-3.4} \cdot 3.4 ^{0}}{0!} = \\ \frac{e^{-3.4} \cdot 1}{1} = 0.033 \\ \)</p>
<p>We notice that the probability is very low indeed &mdash; just 3.3%.</p>
<p>Note: X=0 because we are looking at the probability that the machine does <strong>not</strong> break down.</p>
<figure class="aligncenter"><img decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2023/03/DALL·E-2023-03-17-16.45.37-Una-distributore-automatico-di-bevande-in-stile-pop-art-300x300.png" alt="A vending machine: an example to explain the Poisson distribution" width="300" height="300" /><figcaption><em>a battered vending machine&hellip;</em></figcaption></figure>
<p>In R we would use the command:</p>
<pre><code class="language-r">dpois(0, 3.4)</code></pre>
<p>Now let us calculate the probability that the vending machine breaks down exactly 3 times during the next week.</p>
<p>\( P (X=3) = \frac{e^{-3.4} \cdot 3.4 ^{3}}{3!} = \frac{e^{-3.4} \cdot 39.304}{6} = 0.216 \\ \)</p>
<p>The probability is 21.6%.</p>
<p>Moving on to a third question: what are the expected value and the variance of the vending machine malfunctions?</p>
<p>\( E(X) = \lambda = 3.4 \\ Var(X) = \lambda = 3.4 \\ \)</p>
<p style="background-color:#f0f0f0;padding:1em;">As we can see, within the Poisson distribution lambda represents not only the mean but also the <strong>variance</strong>. This is known as the <strong>mean-variance equality property of the Poisson distribution</strong>.</p>
<p>Therefore, if lambda is large, the Poisson distribution will be more concentrated around its mean and its variance will also be large; if lambda is small, the distribution will be less concentrated around its mean and its variance will also be small.</p>
<hr />
<h2 id="seo-application">The Poisson Distribution Applied to SEO</h2>
<p>There are several aspects that make the Poisson distribution potentially interesting for website traffic analysis. It is a <strong>simple and well-understood statistical model</strong> that can be readily applied to website traffic data &mdash; for example, to estimate the <strong>average rate of requests or visits per unit of time</strong> and to predict the <strong>probability of observing a certain number of requests or visits in the future</strong>.</p>
<p>However, we should keep in mind that there are also many limitations to using the Poisson distribution for SEO-oriented web traffic analysis.</p>
<p>First, the Poisson distribution <strong>assumes that events occur independently and at a constant rate, which may not always hold for website traffic</strong>. For example, website traffic might exhibit <strong>peak frequencies</strong> or <strong>internal symmetry</strong> that the Poisson distribution cannot capture.</p>
<p>Second, the <strong>Poisson distribution is a memoryless process</strong>, meaning it does not account for any history of past events. This can be a limitation when analysing website traffic data that display <strong>trends or seasonality</strong>.</p>
<p>Third, the Poisson distribution assumes that <strong>events are discrete and countable</strong>, which may not always be appropriate for modelling continuous variables such as response time or page load time. Finally, the Poisson distribution is a simple model that may not capture all the complexities of real-world website traffic.</p>
<p>There are several alternative models for website traffic analysis that can be used when the Poisson distribution is not appropriate.</p>
<hr />
<h2 id="alternative-models">Alternative Models for Web Traffic Analysis</h2>
<p>One alternative is the <strong>Negative Binomial distribution</strong>, which can handle overdispersion and capture peak frequencies or internal symmetry in website traffic data.</p>
<p>Another alternative is the <strong>Lognormal distribution</strong>, which can be used to model continuous variables such as response time or page load time.</p>
<p>The <strong>Exponential distribution</strong> can also be used to model the time intervals between requests or visits to a website.</p>
<hr />
<h2 id="clicks-example">Using Poisson for Website Click Estimates</h2>
<p>Suppose we have a website that receives on average 10 clicks per hour and we want to estimate the probability of getting a certain number of clicks in one hour using the Poisson distribution. We can use R to carry out the following steps:</p>
<ol>
<li>We start by loading the ggplot2 library and defining the average number of clicks per hour (our lambda):</li>
</ol>
<pre><code class="language-r">library(ggplot2)

# Average number of clicks per hour
lam <- 10</code></pre>
<p>2. We now compute the probability mass function (PMF) of the Poisson distribution for each possible number of clicks using the dpois() function. For example, to calculate the probability of getting exactly 15 clicks:</p>
<pre><code class="language-r">clicks <- 15
prob <- dpois(clicks, lam)
cat(paste("The probability of getting", clicks, "clicks per hour is", prob, "\n"))</code></pre>
<p>The output is:</p>
<pre>The probability of getting 15 clicks per hour is 0.0347180696306841</pre>
<p>3. We compute the PMF of the Poisson distribution for a range of possible click counts using dpois() and display the results in a chart. For example, to calculate the probability of getting from 0 to 30 clicks:</p>
<pre><code class="language-r">x <- 0:30
pmf <- dpois(x, lam)</code></pre>
<p>4. We now plot the probability for each possible number of clicks:</p>
<pre><code class="language-r">ggplot(data.frame(x=x, pmf=pmf), aes(x, pmf)) +
  geom_bar(stat="identity") +
  xlab("Number of clicks") +
  ylab("Probability") +
  ggtitle(paste("PMF of the Poisson distribution with lambda =", lam))</code></pre>
<figure class="aligncenter"><img decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2023/03/PMF-Poisson.png" alt="PMF of the Poisson distribution" /></figure>
<p>5. We compute the CDF of the Poisson distribution and plot it:</p>
<pre><code class="language-r"># Compute the CDF of the Poisson distribution
cdf <- ppois(x, lam)

# Plot the CDF
ggplot(data.frame(x=x, cdf=cdf), aes(x, cdf)) +
  geom_step() +
  xlab("Number of clicks") +
  ylab("Cumulative probability") +
  ggtitle(paste("CDF of the Poisson distribution with lambda =", lam))</code></pre>
<figure class="aligncenter"><img decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2023/03/CDF-Poisson.png" alt="CDF of the Poisson distribution" /></figure>
<p>6. We calculate the number of clicks corresponding to a 90% probability:</p>
<pre><code class="language-r">q <- qpois(0.9, lam)
cat(paste("For a 90% probability, the number of clicks must be up to", q, "\n"))</code></pre>
<p>The output is:</p>
<pre>For a 90% probability, the number of clicks must be up to 14</pre>
<p>For convenience, here is the equivalent Python script:</p>
<pre><code class="language-python">import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson

# Define the average number of clicks per hour
lam = 10

clicks = 15
prob = poisson.pmf(clicks, lam)
print(f"The probability of getting {clicks} clicks per hour is {prob}")

x = np.arange(0, 31)
pmf = poisson.pmf(x, lam)

plt.bar(x, pmf)
plt.xlabel('Number of clicks')
plt.ylabel('Probability')
plt.title(f'PMF of the Poisson distribution with lambda = {lam}')
plt.show()

cdf = poisson.cdf(x, lam)

plt.step(x, cdf)
plt.xlabel('Number of clicks')
plt.ylabel('Cumulative probability')
plt.title(f'CDF of the Poisson distribution with lambda = {lam}')
plt.show()

q = poisson.ppf(0.9, lam)
print(f"For a 90% probability, the number of clicks must be up to {q}")</code></pre>
<hr />
<h3>You might also like</h3>
<ul>
<li><a href="https://www.gironi.it/blog/en/probability-distributions-discrete-distributions-and-the-binomial/">Probability Distributions: Discrete Distributions and the Binomial</a></li>
<li><a href="https://www.gironi.it/blog/en/anomaly-detection-how-to-identify-outliers-in-data/">Anomaly Detection: How to Identify Outliers in Data</a></li>
<li><a href="https://www.gironi.it/blog/en/the-negative-binomial-distribution/">The Negative Binomial Distribution</a></li>
</ul>
<hr />
<h3 id="further-reading">Further Reading</h3>
<p>For an accessible yet thorough introduction to probability distributions—including the Poisson—<a href="https://www.amazon.it/dp/8867319396?tag=consulenzeinf-21" rel="nofollow sponsored noopener" target="_blank"><em>Finalmente ho capito la statistica</em></a> by Maurizio De Pra covers these topics in a clear and approachable style, ideal for building solid intuition before moving on to more advanced topics.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/the-poisson-distribution/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>The Geometric Distribution</title>
		<link>https://www.gironi.it/blog/en/the-geometric-distribution/</link>
					<comments>https://www.gironi.it/blog/en/the-geometric-distribution/#respond</comments>
		
		<dc:creator><![CDATA[autore-articoli]]></dc:creator>
		<pubDate>Sun, 01 Mar 2026 19:47:35 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/the-geometric-distribution/</guid>

					<description><![CDATA[After looking at the most famous discrete distribution, the Binomial, as well as the Poisson distribution and the Beta distribution, it is time to take a look at the geometric distribution. What We&#8217;ll Cover How Many Trials Until the First Success? Worked Examples Computing in R Further Reading How Many Trials Until the First Success? &#8230; <a href="https://www.gironi.it/blog/en/the-geometric-distribution/" class="more-link">Continue reading<span class="screen-reader-text"> "The Geometric Distribution"</span></a>]]></description>
										<content:encoded><![CDATA[<p>After looking at the most famous discrete distribution, the <a href="https://www.gironi.it/blog/en/probability-distributions-discrete-distributions-and-the-binomial/" target="_blank" rel="noreferrer noopener">Binomial</a>, as well as the <a href="https://www.gironi.it/blog/la-distribuzione-di-poisson/" target="_blank" rel="noreferrer noopener">Poisson distribution</a> and the <a href="https://www.gironi.it/blog/en/the-beta-distribution-explained-simply/" target="_blank" rel="noreferrer noopener">Beta distribution</a>, it is time to take a look at the <em><strong>geometric distribution</strong></em>.</p>
<p><span id="more-3480"></span></p>
<div style="border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px">
<h3 style="margin-top: 0">What We&#8217;ll Cover</h3>
<ul>
<li><a href="#how-many-trials">How Many Trials Until the First Success?</a></li>
<li><a href="#examples">Worked Examples</a></li>
<li><a href="#r-code">Computing in R</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
</div>
<h2 id="how-many-trials"><strong>How Many Trials Until the First Success?</strong></h2>
<p>We use the geometric distribution when we perform independent trials, each of which can result in either success or failure, and <strong>we want to know how many trials are needed to obtain the first success</strong>.</p>
<p>In symbols:</p>
<p>\( X \sim Geo(p) \\ \\ \)</p>
<ul>
<li>\(X\) is the number of trials needed to obtain the first success.</li>
<li>\(r\) is the number of trials.</li>
<li>\(P\) is the probability of success on each trial.</li>
<li>We also define, as is natural: q = 1 &#8211; p</li>
</ul>
<p>Here is where it gets interesting. We have:</p>
<p>\( \\ P(X=r) = p \times q ^ {r-1} \\ \)</p>
<p><strong>P therefore denotes the probability that the first success occurs on trial number r.</strong><br />Let us continue our reasoning:</p>
<p>\( P(X > r) = q ^ {r} \)</p>
<p><strong>This allows us to calculate the probability that more than r trials are needed before the first success</strong>, as well as:</p>
<p>\( P(X \leq r) = 1 &#8211; q ^ {r} \\ \)</p>
<p>which helps us find the probability that r trials or fewer are needed to achieve the first success. The expected value is:</p>
<p>\( E(X) = \frac{1}{P} \\ \)</p>
<p>The <strong>variance</strong> is:</p>
<p>\( Var(X) = \frac{q}{P^{2}} \)</p>
<h2 id="examples">Worked Examples</h2>
<p>We know that the probability of an ice skater completing a course without incident is 0.4. Therefore:</p>
<p>\( X \sim Geo(0.4) \\ \)</p>
<p>X is the number of attempts our skater must make in order to complete a course without any incident.</p>
<p>We are now ready to apply our new knowledge.</p>
<figure class="aligncenter"><img decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2023/04/Firefly_anice-skater-glides-on-the-rink-ring.-The-ice-is-covered-in-numbers-representing-probabilities._art_42785-1024x745.jpg" alt="Artistic representation of the ice skater example for the geometric distribution" loading="lazy" /></figure>
<p>Let us calculate the expected number of attempts before achieving a success:</p>
<p>\( E(X) = \frac{1}{P} \\ \)<br />
therefore<br />
\( \frac{1}{0.4} = 2.5 \)</p>
<p>The variance in the number of attempts is quickly calculated:</p>
<p>\( Var(X) = \frac{q}{p^{2}} \\ \)<br />
that is<br />
\( \frac{0.6}{0.4^{2}} = \frac{0.6}{0.16} = 3.75 \\ \)</p>
<p>The probability of succeeding on the second attempt, after having failed the first:</p>
<p>\( P(X=2) = P \times q = 0.4 \times 0.6 = 0.24 \\ \)<br />
that is, 24%</p>
<p>The probability of succeeding in 4 attempts or fewer? Easy!</p>
<p>\( P(X \leq 4) = 1-q^{4} = 1 &#8211; 0.6^{4} = 1 &#8211; 0.1296 \\ \)</p>
<p>That is 0.8704, or 87%.</p>
<p>The probability of needing more than 4 attempts? A simple calculation:</p>
<p>\( P(X > 4) = q^{4} = 0.6^{4} \\ \)</p>
<p>That is 0.1296, or about 13%.</p>
<hr />
<h2 id="r-code">Computing in R</h2>
<p>Now that we have the formulas well in mind, we can let our laziness take over and use R to do the heavy lifting.</p>
<p>With P(X=2) and P=0.4:</p>
<pre><code class="language-r">dgeom(1, 0.4)</code></pre>
<p>where 1 is the number of failures before the first success.</p>
<p>P(X&lt;=4) and P=0.4:</p>
<pre><code class="language-r">pgeom(3, 0.4)</code></pre>
<p>Simple, quick, and fun!</p>
<hr />
<h3>You might also like</h3>
<ul>
<li><a href="https://www.gironi.it/blog/en/probability-distributions-discrete-distributions-and-the-binomial/">Probability Distributions: Discrete Distributions and the Binomial</a></li>
<li><a href="https://www.gironi.it/blog/en/the-negative-binomial-distribution/">The Negative Binomial Distribution</a></li>
<li><a href="https://www.gironi.it/blog/en/the-normal-distribution/">The Normal Distribution</a></li>
</ul>
<hr />
<h3 id="further-reading">Further Reading</h3>
<p>For an accessible yet thorough introduction to discrete probability distributions—including the geometric—<a href="https://www.amazon.it/dp/8867319396?tag=consulenzeinf-21" rel="nofollow sponsored noopener" target="_blank"><em>Finalmente ho capito la statistica</em></a> by Maurizio De Pra covers these topics in a clear and approachable style, ideal for building solid intuition.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/the-geometric-distribution/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>A Brief (Personal) Manifesto for SEO</title>
		<link>https://www.gironi.it/blog/en/a-brief-personal-manifesto-for-seo/</link>
					<comments>https://www.gironi.it/blog/en/a-brief-personal-manifesto-for-seo/#respond</comments>
		
		<dc:creator><![CDATA[autore-articoli]]></dc:creator>
		<pubDate>Sun, 01 Mar 2026 19:47:33 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/a-brief-personal-manifesto-for-seo/</guid>

					<description><![CDATA[The need I feel—the fruit of many years working in this field—is to affirm the decisive importance of basic scientific rigour in analysing traffic data, so that we can calibrate our SEO interventions with accuracy, and not merely &#8220;by gut feeling&#8221; (even though feelings do matter!). The tools available to the SEO professional are countless, &#8230; <a href="https://www.gironi.it/blog/en/a-brief-personal-manifesto-for-seo/" class="more-link">Continue reading<span class="screen-reader-text"> "A Brief (Personal) Manifesto for SEO"</span></a>]]></description>
										<content:encoded><![CDATA[<p>The need I feel—the fruit of many years working in this field—is to affirm the decisive importance of basic scientific rigour in analysing traffic data, so that we can calibrate our SEO interventions with accuracy, and not merely &#8220;by gut feeling&#8221; (even though feelings do matter!).</p>
<p>The tools available to the SEO professional are countless, and yet it is undeniable that a sense of disappointment lingers within us. Too often we deal with data of apparent strategic importance that turn out, when put to the test, to be fallacious or imprecise—mere red herrings.</p>
<p><span id="more-3479"></span></p>
<p>Raise your hand if you have never felt a sense of frustration using any—truly any—competitive analysis tool, to give a concrete example.</p>
<p>Wittgenstein stated that whereof one cannot speak, thereof one must be silent.</p>
<p>Too often we draw conclusions based on utterly unreliable data. The trap represented by eye-catching graphs, conclusions just a click away, special effects—it is always with us. The tool that serves up the ready-made analysis we need, ready to use, simply does not exist.</p>
<p>Too often SEO is treated by its practitioners not so much as an art (that would be useful and is even necessary, though not sufficient) but as a sort of bag of tricks for apprentice storytellers, or as a stage for conjuring acts based on evanescent numbers. A sum of a few SEO tools and report templates good for every occasion. Or sometimes shamanistic practices combined into miracle recipes.</p>
<p>Back to statistics. SEO is obviously not just statistics, but it rests on numbers. Basic statistics, to begin with. The ABCs of descriptive statistics.</p>
<p>Let us examine what the only solid data we have tell us—the data that come from our own sites. Not sampled data, not data derived from presumed traffic based on average positions in search engine indices, without clinging to fanciful correlations presented as certainties on the basis of r values close to 0…</p>
<p>Our data. Let us look at what our data tell us.</p>
<p>Sometimes it seems that reasoning in &#8220;simple&#8221; terms—median and five-number summary—boxplots and histograms, hypothesis testing and moving averages—is reductive. Other times I meet professionals who consider even a basic grounding in statistics superfluous, ignoring the fact that statistics has a fundamental, irreplaceable virtue: it makes us sceptical, and exceedingly doubtful.</p>
<p>We know little, in truth. Little can be understood by looking at numbers generated mostly by what an algorithm—unknown in composition and in how its variables are weighted—compresses into positions in an index. That little, however, must be reasonably supported by data and by common sense.</p>
<p>The rest is art, precisely. Intuition, experience, and more. Only by tracing the boundaries between what is reasonably (statistically) expressed by the data in our possession and what our experience suggests can the insight, the decisive experiment, emerge.</p>
<p>Only in this way, I believe, can we provide a service of genuine value.</p>
<p>SEO is an intrinsically treacherous subject. Approaching it requires experience, a certain dose of courage, and a necessary sense of one&#8217;s own limits.</p>
<hr />
<h3>You might also like</h3>
<ul>
<li><a href="https://www.gironi.it/blog/en/the-data-the-4-scales-of-measurement/">The Data: The 4 Scales of Measurement</a></li>
<li><a href="https://www.gironi.it/blog/en/descriptive-statistics-measures-of-position/">Descriptive Statistics: Measures of Position</a></li>
<li><a href="https://www.gironi.it/blog/en/guide-to-statistical-tests-for-a-b-analysis/">Guide to Statistical Tests for A/B Analysis</a></li>
</ul>
<hr />
<h3>Further Reading</h3>
<p>For a brilliant, accessible exploration of why statistical thinking matters—and how it protects us from misleading data—<a href="https://www.amazon.it/dp/8806246623?tag=consulenzeinf-21" rel="nofollow sponsored noopener" target="_blank"><em>The Art of Statistics</em></a> by David Spiegelhalter shows how to reason clearly with numbers in a world full of uncertainty.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/a-brief-personal-manifesto-for-seo/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Descriptive Statistics: Measures of Variability (or Dispersion)</title>
		<link>https://www.gironi.it/blog/en/descriptive-statistics-measures-of-variability-or-dispersion/</link>
					<comments>https://www.gironi.it/blog/en/descriptive-statistics-measures-of-variability-or-dispersion/#respond</comments>
		
		<dc:creator><![CDATA[autore-articoli]]></dc:creator>
		<pubDate>Sun, 01 Mar 2026 19:32:03 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/?p=3468</guid>

					<description><![CDATA[Measures of variability are used to describe the degree of dispersion of observations around a central tendency index. In other words, measures of variability allow us to assess how data are spread around a central value, which may be represented, for example, by the mean or the median. They provide valuable information about the distribution &#8230; <a href="https://www.gironi.it/blog/en/descriptive-statistics-measures-of-variability-or-dispersion/" class="more-link">Continue reading<span class="screen-reader-text"> "Descriptive Statistics: Measures of Variability (or Dispersion)"</span></a>]]></description>
										<content:encoded><![CDATA[<p>Measures of variability are used to <strong>describe the degree of dispersion of observations around a central tendency index</strong>.</p>
<p>In other words, measures of variability allow us to <strong>assess how data are spread around a central value</strong>, which may be represented, for example, by the <a href="https://www.gironi.it/blog/en/descriptive-statistics-measures-of-position-and-central-tendency/#the-arithmetic-mean" target="_blank" rel="noreferrer noopener">mean</a> or the <a href="https://www.gironi.it/blog/en/descriptive-statistics-measures-of-position-and-central-tendency/#the-median" target="_blank" rel="noreferrer noopener">median</a>. They <strong>provide valuable information about the distribution of data</strong>, enabling a better understanding of the phenomenon under observation.</p>
<p>The techniques for measuring the variability of datasets are numerous. Among them, the most widely known (and most commonly used) are:</p>
<ul>
<li>the <a href="#range">range</a></li>
<li>the <a href="#mean-deviation">mean deviation</a> and the <a href="#variance">variance</a></li>
<li>the <a href="#standard-deviation">standard deviation</a></li>
<li>the <a href="#cv">coefficient of variation</a></li>
</ul>
<p>We will also visualise the concepts of central tendency and dispersion by revisiting <a href="#skewness">skewness</a> and introducing the concept of <a href="#kurtosis">kurtosis</a>.</p>
<p><span id="more-3468"></span></p>
<hr />
<div style="border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px">
<h3 style="margin-top: 0">What We&#8217;ll Cover</h3>
<ul>
<li><a href="#range">The Range</a></li>
<li><a href="#mean-deviation">The Mean Deviation</a></li>
<li><a href="#variance">Variance</a></li>
<li><a href="#standard-deviation">The Standard Deviation</a></li>
<li><a href="#cv">The Coefficient of Variation</a></li>
<li><a href="#skewness">The Shape of a Distribution</a></li>
<li><a href="#kurtosis">Kurtosis</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
</div>
<hr />
<h2 id="range">The Range</h2>
<p>The range is the <strong>difference between the maximum and the minimum value</strong> of ungrouped data in a frequency distribution.</p>
<p>It is a very quick calculation, which in R can be computed as follows:</p>
<pre><code class="language-r">max(var) - min(var)</code></pre>
<p>The maximum and minimum can also be displayed with:</p>
<pre><code class="language-r">range(var)</code></pre>
<p>and they appear as the first and last terms in:</p>
<pre><code class="language-r">fivenum(var)</code></pre>
<p>For grouped data, the range is defined as the difference between the upper boundary of the highest class and the lower boundary of the lowest class.</p>
<p>A <strong>trimmed range</strong> is a range from which a certain percentage of extreme values has been removed from both ends of the distribution (for instance, the <em>middle 80 per cent</em>).</p>
<hr />
<h2 id="mean-deviation">The Mean Deviation</h2>
<p>The mean deviation is a measure of variability based on the difference between each data point and the mean. If we calculated the average by summing the positive and negative differences between each value and the arithmetic mean, <strong>the result would always be zero</strong>. For this reason, we <strong>sum the absolute values of the differences</strong>:</p>
<p>\(<br />
SM = \frac{\Sigma|X &#8211; \mu|}{N} \\<br />
\)</p>
<p>Those &#8220;absolute values&#8221; pose some computational efficiency issues, which is why the mean deviation is not widely used. There is another way to eliminate negative values, and so we introduce the important concept of&#8230;</p>
<h2 id="variance">Variance</h2>
<p>Variance is analogous to the mean deviation, since it is based on the <strong>differences between each data point and the mean</strong>, but these differences are <strong>squared before being summed</strong>. Variance is denoted by the lowercase sigma squared symbol, and the formula is:</p>
<p>\(<br />
\sigma^{2}=\frac{\Sigma(X &#8211; \mu)^{2}}{N} \\ \\<br />
\)</p>
<p>R has the <code>var()</code> function for computing variance, but it uses <em>(n-1)</em> in the denominator. To obtain the variance with <em>N</em> in the denominator, we can define a custom function:</p>
<pre><code class="language-r">pop_var <- function(x) { var(x) * (1 - 1/length(x)) }</code></pre>
<p>In general, <strong>interpreting the value of a variance is difficult</strong> because <strong>the units it is expressed in are not the same as those of the original observations</strong>.</p>
<p>For this reason, the <em>standard deviation</em> was introduced.</p>
<h2 id="standard-deviation">The Standard Deviation: The Most Widely Used Measure of Variability</h2>
<p>The standard deviation is simply the <strong>square root of the variance</strong>:</p>
<p>\(<br />
\sigma = \sqrt{\frac{\Sigma(X - \mu)^{2}}{N}} \\ \\<br />
\)</p>
<p>The standard deviation is of fundamental utility in statistics, particularly (as we shall see) in conjunction with the <a href="https://www.gironi.it/blog/en/the-normal-distribution/">normal distribution</a>.</p>
<p>For grouped data, we assume that the midpoint of each class represents all the measurements within that class. The variance formula then becomes:</p>
<p>\(<br />
\sigma^{2}=\frac{\Sigma f(X - \mu)^{2}}{N} \\ \\<br />
\)</p>
<p>and the standard deviation:</p>
<p>\(<br />
\sigma = \sqrt{\frac{\Sigma f(X - \mu)^{2}}{N}} \\ \\<br />
\)</p>
<p>In R, the function for computing the standard deviation is <code>sd()</code>. However, R uses <em>(n-1)</em> in the denominator. So, if we want the population standard deviation (with <em>n</em> in the denominator), we can define a dedicated function:</p>
<pre><code class="language-r">pop_sd <- function(x) { sqrt(sum((x - mean(x))^2) / length(x)) }</code></pre>
<h2 id="cv">The Coefficient of Variation</h2>
<p>The coefficient of variation indicates the relative magnitude of the standard deviation with respect to the distribution mean. It is extremely useful for comparing phenomena expressed in different units of measurement, since <strong>the CV is a "pure" number, independent of the unit of measurement</strong>:</p>
<p>\(<br />
CV = \frac{\sigma}{\mu} \\<br />
\)</p>
<p>As is often the case in R, there is a ready-made function: we can use <code>cv()</code>, defined in an external library, <a href="https://cran.r-project.org/web/packages/labstatR/index.html" target="_blank" rel="noreferrer noopener">labstatR</a>. Its usage is straightforward:</p>
<pre><code class="language-r">library(labstatR)
data <- c(24, 17, 21, 23, 15, 30, 24, 21, 24, 19,
          25, 28, 22, 20, 14, 19, 26, 29, 23, 25,
          24, 18, 27, 21)
cv(data)
# [1] 0.1817708</code></pre>
<p>We can also calculate the value quite simply without resorting to external libraries:</p>
<pre><code class="language-r">data <- c(24, 17, 21, 23, 15, 30, 24, 21, 24, 19,
          25, 28, 22, 20, 14, 19, 26, 29, 23, 25,
          24, 18, 27, 21)
pop_sd <- function(x) { sqrt(sum((x - mean(x))^2) / length(x)) }
cv_data <- pop_sd(data) / mean(data)
cv_data
# [1] 0.1817708</code></pre>
<h2 id="skewness">The Shape of a Distribution</h2>
<p>Frequency distributions can take on the most varied shapes. Among all of them, the one by far most important in statistics is the <strong>normal distribution</strong>, also known as the <strong>bell curve</strong> or <strong>Gaussian distribution</strong>.</p>
<p>In a <strong><a href="https://www.gironi.it/blog/en/the-normal-distribution/">normal</a></strong> distribution, data are arranged <strong>symmetrically around the mean</strong>. In a very straightforward way, to describe the shape of the distribution we simply <strong>compare the mean with the median</strong>: if they are <strong>equal, the distribution is symmetric</strong>. If the mean is greater than the median, we have <strong>positive skewness</strong> (with a longer "tail" on the right); if the mean is less than the median, the skewness is <strong>negative</strong> (with the longer "tail" on the left).</p>
<p>The best-known formula for calculating the skewness of a distribution is <strong>Pearson's coefficient of skewness</strong>:</p>
<p>\(<br />
Skewness = \frac{3(\mu - med)}{\sigma} \\ \\<br />
\)</p>
<p>A perfectly symmetric distribution has a skewness value of 0. A right-skewed (positive) distribution has a positive value, while a left-skewed distribution has a negative value.</p>
<p>Skewness values typically fall between -3 and 3, and the fact that the standard deviation appears in the denominator makes the <strong>value independent of the unit of measurement</strong>.</p>
<p>How do we calculate the skewness index in R? The simplest way is to use a library that provides the functions we need "ready to go":</p>
<pre><code class="language-r">library(moments)
data <- c(24, 17, 21, 23, 15, 30, 24, 21, 24, 19,
          25, 28, 22, 20, 14, 19, 26, 29, 23, 25,
          24, 18, 27, 21)
skewness(data)
# [1] -0.1918578</code></pre>
<p>The <code>moments</code> library serves us well. Let us see, however, how to calculate the index without relying on a library. It is very simple. The first step is to remember that <strong>R uses <em>n-1</em> in the denominator of the variance</strong>. We are reasoning about a population, however, with <em>n</em> in the denominator. So, let us define a function that gives us the value we need:</p>
<pre><code class="language-r">pop_var <- function(x) { var(x) * (1 - 1/length(x)) }</code></pre>
<p>At this point we can calculate the skewness index:</p>
<pre><code class="language-r">data <- c(24, 17, 21, 23, 15, 30, 24, 21, 24, 19,
          25, 28, 22, 20, 14, 19, 26, 29, 23, 25,
          24, 18, 27, 21)
pop_var <- function(x) { var(x) * (1 - 1/length(x)) }
z <- (data - mean(data)) / sqrt(pop_var(data))
skew <- mean(z^3)
skew
# [1] -0.1918578</code></pre>
<h2 id="kurtosis">Kurtosis</h2>
<p><strong>Kurtosis is the degree of peakedness of a distribution curve</strong>, relative to the normal distribution.</p>
<p>We have three cases:</p>
<ol>
<li>a <strong>tall</strong> curve, called <em><strong>leptokurtic</strong></em>, which is highly concentrated around its mean</li>
<li>a <strong>normal</strong> curve, called <em><strong>mesokurtic</strong></em></li>
<li>a <strong>low and flat</strong> curve, called <em><strong>platykurtic</strong></em>, with little concentration around its mean</li>
</ol>
<figure><img decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2018/10/curtosi.jpg" alt="Measures of variability: leptokurtic, mesokurtic, and platykurtic curves" /></figure>
<p>Kurtosis can be measured by dividing the fourth moment by the standard deviation raised to the fourth power. Sounds difficult? It is easier done than said. Here is the formula:</p>
<p>\(<br />
Kurtosis = \frac{\Sigma f(X -\mu)^{4}}{\sigma^{4}} \\ \\<br />
\)</p>
<p>The kurtosis of a mesokurtic curve has a value of 3. Naturally, a kurtosis coefficient less than 3 indicates a platykurtic curve, while a value greater than 3 indicates a leptokurtic curve.</p>
<p>As with the skewness index, the <code>moments</code> library provides a convenient ready-made function:</p>
<pre><code class="language-r">library(moments)
data <- c(24, 17, 21, 23, 15, 30, 24, 21, 24, 19,
          25, 28, 22, 20, 14, 19, 26, 29, 23, 25,
          24, 18, 27, 21)
kurtosis(data)
# [1] 2.480035</code></pre>
<p>But we are not averse to computing it ourselves:</p>
<pre><code class="language-r">data <- c(24, 17, 21, 23, 15, 30, 24, 21, 24, 19,
          25, 28, 22, 20, 14, 19, 26, 29, 23, 25,
          24, 18, 27, 21)
pop_var <- function(x) { var(x) * (1 - 1/length(x)) }
z <- (data - mean(data)) / sqrt(pop_var(data))
kurt <- mean(z^4)
kurt
# [1] 2.480035</code></pre>
<p>We have seen that measures of central tendency alone are not enough: two datasets can have the same mean yet behave in profoundly different ways. Variability tells us <em>how much</em> the data fluctuate, and the shape of the distribution tells us <em>how</em> they are arranged. In the next article, we will take the first steps into the world of probability—the essential bridge between descriptive statistics and the inferential tools that let us draw conclusions from data.</p>
<hr />
<h3>You might also like</h3>
<ul>
<li><a href="https://www.gironi.it/blog/en/descriptive-statistics-measures-of-position-and-central-tendency/">Descriptive Statistics: Measures of Position and Central Tendency</a></li>
<li><a href="https://www.gironi.it/blog/en/the-normal-distribution/">The Normal Distribution</a></li>
<li><a href="https://www.gironi.it/blog/en/anomaly-detection-how-to-identify-outliers-in-your-data/">Anomaly Detection: How to Identify Outliers in Your Data</a></li>
</ul>
<hr />
<h3 id="further-reading">Further Reading</h3>
<p>If you want a gentle yet rigorous introduction to the concepts behind descriptive statistics—including variability, distributions, and the art of making sense of data—<a href="https://www.amazon.it/dp/8806246623?tag=consulenzeinf-21" rel="nofollow sponsored noopener" target="_blank"><em>The Art of Statistics</em></a> by David Spiegelhalter is an excellent starting point. Spiegelhalter has the rare ability to explain statistical concepts without dumbing them down, and his treatment of uncertainty and dispersion is particularly illuminating.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/descriptive-statistics-measures-of-variability-or-dispersion/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Probability Distributions: Discrete Distributions and the Binomial</title>
		<link>https://www.gironi.it/blog/en/probability-distributions-discrete-distributions-and-the-binomial/</link>
					<comments>https://www.gironi.it/blog/en/probability-distributions-discrete-distributions-and-the-binomial/#respond</comments>
		
		<dc:creator><![CDATA[autore-articoli]]></dc:creator>
		<pubDate>Sun, 01 Mar 2026 19:31:56 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/?p=3469</guid>

					<description><![CDATA[A random variable (also called a stochastic variable) is a variable that can take on different values depending on some random phenomenon. In many statistics textbooks it is simply abbreviated as r.v. It is a numerical value. When probability values are assigned to all the possible numerical values of a random variable x, the result &#8230; <a href="https://www.gironi.it/blog/en/probability-distributions-discrete-distributions-and-the-binomial/" class="more-link">Continue reading<span class="screen-reader-text"> "Probability Distributions: Discrete Distributions and the Binomial"</span></a>]]></description>
										<content:encoded><![CDATA[<p>A <strong>random variable</strong> (also called a stochastic variable) is a variable that can take on different values depending on some random phenomenon. In many statistics textbooks it is simply abbreviated as r.v. It is a numerical value.</p>
<p>When probability values are assigned to all the possible numerical values of a random variable x, the result is a <strong>probability distribution</strong>.</p>
<p style="background-color:#f0f0f0;padding:1em;">In even simpler terms: a random variable is a variable whose values are each associated with a probability of being observed. The set of all possible values of a random variable and their associated probabilities is called a <strong>probability distribution</strong>. The <strong>sum of all probabilities is 1</strong>.</p>
<p><span id="more-3469"></span></p>
<div style="border: 1px solid #ccc;padding: 1.2em 1.5em;margin: 1.5em 0;border-radius: 6px">
<h3 style="margin-top: 0">What We&#8217;ll Cover</h3>
<ul>
<li><a href="#discrete-vs-continuous">Discrete and Continuous Variables</a></li>
<li><a href="#bernoulli">The Bernoulli Random Variable</a></li>
<li><a href="#binomial">The Binomial Distribution</a></li>
<li><a href="#mean-variance">Mean, Expected Value, and Variance</a></li>
<li><a href="#probability-density-example">An Example: Computing the Probability Density</a></li>
<li><a href="#other-distributions">Other Discrete Distributions</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
</div>
<hr />
<h2 id="discrete-vs-continuous">Discrete and Continuous Variables</h2>
<p>There are two main types of random variables: <strong>discrete</strong> and <strong>continuous</strong>.</p>
<ul>
<li>A <strong>discrete r.v.</strong> can take on a discrete (<strong>finite</strong> or countable) <strong>set of real numbers</strong>. That is, we could list all possible values in a table together with their respective probabilities. An example is the outcome of rolling a die: there are 6 possible outcomes, each with a probability of 1/6 (and the sum of all probabilities, of course, equals 1).</li>
<li>A <strong>continuous r.v.</strong> can take on <strong>all values within a real interval</strong>—that is, an infinite number of values within any given interval. The probability that X falls within a given interval is represented by the <strong>area under the probability distribution</strong>. In the case of a continuous random variable, probabilities are represented by means of a <strong>probability density function</strong>. The total area under the curve (i.e. the total probability) equals 1.</li>
</ul>
<p>Depending on the case, we deal with various types of distributions. These are the most common:</p>
<table>
<thead>
<tr>
<th>Discrete distributions</th>
<th>Continuous distributions</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<ul>
<li>Binomial</li>
<li>Poisson</li>
<li>Geometric</li>
</ul>
</td>
<td>
<ul>
<li><a href="https://www.gironi.it/blog/en/the-normal-distribution/">Normal</a></li>
<li>Uniform</li>
<li>Student&#8217;s t</li>
</ul>
</td>
</tr>
</tbody>
</table>
<hr />
<h2 id="bernoulli">Event Yes or Event No? The Bernoulli Random Variable</h2>
<p>Consider a trial in which we are only interested in verifying whether a certain event has occurred or not. The random variable generated by such a trial will take the value 1 if the event has occurred, 0 otherwise. This r.v. is called a <strong>Bernoulli random variable</strong>.</p>
<p>Any dichotomous trial can be represented by a Bernoulli random variable.</p>
<figure class="aligncenter"><a href="https://en.wikipedia.org/wiki/Jacob_Bernoulli" target="_blank" rel="noopener"><img decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2018/09/Jakob_Bernoulli-268x300.jpg" alt="Jakob Bernoulli - the binomial distribution" /></a><figcaption>This is Mr. Jakob Bernoulli. The details are on Wikipedia for those interested&#8230;</figcaption></figure>
<p>A bit of notation. We denote a Bernoulli r.v. as follows:</p>
<p>\( x \sim Bernoulli(\pi) \\ \)</p>
<p>Its mean is:</p>
<p>\( E(x)=\pi \\ \)</p>
<p>And its variance is:</p>
<p>\( V(x)=\pi(1-\pi) \\ \)</p>
<p><strong>All trials that produce only 2 possible outcomes generate Bernoulli random variables</strong> (for example, tossing a coin). Starting from this simple assumption, it is a very short step to the Binomial Distribution.</p>
<hr />
<h2 id="binomial">The Binomial Distribution</h2>
<p>Rather than dwelling on the conceptual aspects—important as they are, and for which I refer to specialised texts—what I want to do here is show in practice, and as clearly as possible, what we are talking about. Let us start with a definition and then look at the characteristics and a few practical examples.</p>
<p><strong>The Binomial random variable can be understood as a sum of Bernoulli random variables.</strong></p>
<p>What does this mean? Simply that if we repeat the success–failure dichotomy of the Bernoulli random variable <em>n</em> times under the same conditions, the result will be a sequence of <em>n</em> independent sub-trials, each of which can be associated with a Bernoulli random variable.</p>
<p>What are <strong>the characteristics of the binomial distribution</strong>?</p>
<ul>
<li>There is a <strong>fixed number of trials</strong> (<em>n</em>).</li>
<li>Each trial has two possible outcomes: <strong>success</strong> or <strong>failure</strong>.</li>
<li>The <strong>probability of success</strong> (<em>p</em>) is <strong>the same</strong> for every trial.</li>
<li>The outcome of one trial does not affect any other (the trials are <strong>independent</strong>).</li>
</ul>
<p>If even one of these characteristics is absent, the binomial distribution does not apply.</p>
<p style="background-color:#f0f0f0;padding:1em;"><strong>From a practical standpoint, the binomial distribution allows us to calculate the probability of obtaining <em>r</em> successes in <em>n</em> independent trials.</strong></p>
<p>The probability of a certain number of successes, <em>r</em>, depends on <em>r</em> itself, on the number of trials <em>n</em>, and on the individual probability, which we denote by <em>p</em>.</p>
<p>The probability of <em>r</em> successes in <em>n</em> trials is given by:</p>
<p>\( \frac{n!}{r!(n-r)!} \times p^r (1-p)^{n-r} \\ \)</p>
<p>Looks difficult? It really is not (and in practice it turns out to be useful and even fun!).</p>
<div style="border:1px dotted silver; padding:8px;">
NOTE: The part<br />
\(<br />
\frac{n!}{r!(n-r)!} \\<br />
\)<br />
is called the <strong>binomial coefficient</strong>, and is found in textbooks written as:<br />
\(<br />
{n\choose k} \\<br />
\)
</div>
<p>First, let us recall that the symbol ! in mathematics denotes the <em>factorial</em>. As you will certainly remember, the factorial of 3, i.e. 3!, is: 3 &times; 2 &times; 1 = 6; the factorial of 4, i.e. 4!, is: 4 &times; 3 &times; 2 &times; 1 = 24; and so on (it will not escape notice that the factorial grows very, very quickly as the number increases&#8230;).</p>
<blockquote>
<p><strong>The factorial of a natural number is the product of that number by all its predecessors.</strong></p>
</blockquote>
<p>With that said, let us first see how to find the mean—the centre of our distribution—and the variance. This way, we will have everything we need for a few practical examples.</p>
<hr />
<h2 id="mean-variance">Mean, Expected Value, and Variance of a Binomial Distribution</h2>
<p>Let us call <em>x</em> the expected value. We can write our problem as follows:</p>
<p>\( x \sim Binomial(n, p) \\ \)</p>
<p>The mean is:</p>
<p>\( E(x) = n \times p \\ \)</p>
<p>The variance is:</p>
<p>\( Var(x) = n \times p \times (1 &#8211; p) \\ \)</p>
<p>At this point, an example is in order. Let us calculate the variance of a distribution with size <em>n</em> = 10 and individual probability <em>p</em> = 0.5 (i.e. 50%). For instance, this could represent ten coin tosses.</p>
<p>\( x \sim Binomial(10, 0.5) \\ \)</p>
<p>So the variance will be:</p>
<p>\( Var(x) = 10 \times 0.5 \times (1 &#8211; 0.5) = 2.5 \\ \)</p>
<p>And the mean, naturally, will be:</p>
<p>\( E(x) = 10 \times 0.5 = 5 \\ \)</p>
<p><em>Side note: it is intuitive that if p = 1 &#8211; p = 0.5, the probability distribution will be symmetric. If p &lt; 0.5, it will be right-skewed, and if p &gt; 0.5, it will be left-skewed.</em></p>
<hr />
<h2 id="probability-density-example">An Example: Computing the Probability Density</h2>
<p>Let us now introduce the concept of <strong>probability density</strong>, which is what we will use most often in real-world applications. This is when, for example, we want to know the probability that exactly two out of ten coin tosses come up heads.</p>
<p>To explain this more clearly, let us take a problem from a textbook:</p>
<p><em>If I cross a black mouse with a white one, there is a 3/4 probability that the offspring will be black and 1/4 that it will be white. What is the probability that out of 7 offspring, exactly 3 are white?</em></p>
<p>Let us write down the data straight away:</p>
<ul>
<li><em>n</em> = 7</li>
<li><em>r</em> = 3</li>
<li><em>p</em> = 1/4, i.e. 0.25</li>
</ul>
<p>And now? Shall we do the calculations by hand? Why not:</p>
<p>\( \frac{n!}{r!(n-r)!} \times p^r (1-p)^{n-r} \\ \\ \)</p>
<p>therefore:</p>
<p>\( \frac{7!}{3!4!} \times 0.25^{3} \times 0.75^{4} = 35 \times 0.0049439 = 0.173 \\ \)</p>
<p>That is, 17.3%.</p>
<p>Doing calculations by hand is fun, but we are lazy and have R at our disposal. In R, the probability density is computed by a simple function:</p>
<p><strong>dbinom()</strong></p>
<p>The problem is therefore solved with the simple instruction:</p>
<pre><code class="language-r">dbinom(3, 7, 0.25)
# [1] 0.173</code></pre>
<p>which gives us 0.173, so the answer is 17.3%.</p>
<hr />
<h2 id="other-distributions">Other Discrete Distributions</h2>
<p>There are equally interesting questions that call upon other discrete distributions:</p>
<ul>
<li>How many trials should we expect before obtaining a success? This is where the <a href="https://www.gironi.it/blog/la-distribuzione-geometrica/">geometric distribution</a> enters the scene.</li>
<li>How many times can we expect an event to occur (or not) in a given time interval? That calls for the <a href="https://www.gironi.it/blog/la-distribuzione-di-poisson/">Poisson distribution</a>.</li>
<li>Are we sampling from a population without replacement? Then we use the <a href="https://www.gironi.it/blog/en/the-hypergeometric-distribution/">hypergeometric distribution</a>.</li>
</ul>
<p>As we can see, this is a vast and fascinating topic, which we will explore (lightly) across several articles. In the next one, we will look at another important distribution: the <a href="https://www.gironi.it/blog/en/the-beta-distribution-explained-simply/">beta distribution</a>, which plays a central role in Bayesian statistics.</p>
<hr />
<h3>You might also like</h3>
<ul>
<li><a href="https://www.gironi.it/blog/en/first-steps-into-the-world-of-probability/">First Steps into the World of Probability</a></li>
<li><a href="https://www.gironi.it/blog/en/the-hypergeometric-distribution/">The Hypergeometric Distribution</a></li>
<li><a href="https://www.gironi.it/blog/en/the-negative-binomial-distribution/">The Negative Binomial Distribution</a></li>
</ul>
<hr />
<h3 id="further-reading">Further Reading</h3>
<p>For an accessible yet thorough introduction to probability distributions and the reasoning behind them, <a href="https://www.amazon.it/dp/8867319396?tag=consulenzeinf-21" rel="nofollow sponsored noopener" target="_blank"><em>Finalmente ho capito la statistica</em></a> by Maurizio De Pra covers discrete distributions—including the binomial—in a clear and approachable style, ideal for building solid intuition before moving on to more advanced topics.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/probability-distributions-discrete-distributions-and-the-binomial/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
