<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>regola empirica &#8211; paologironi blog</title>
	<atom:link href="https://www.gironi.it/blog/en/tag/regola-empirica/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.gironi.it/blog</link>
	<description>Scattered notes on (retro) computing, data analysis, statistics, SEO, and things that change</description>
	<lastBuildDate>Wed, 20 Nov 2024 15:22:46 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>The Normal Distribution</title>
		<link>https://www.gironi.it/blog/en/the-normal-distribution/</link>
					<comments>https://www.gironi.it/blog/en/the-normal-distribution/#respond</comments>
		
		<dc:creator><![CDATA[paolo]]></dc:creator>
		<pubDate>Mon, 29 Oct 2018 15:16:00 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<category><![CDATA[chebyshev]]></category>
		<category><![CDATA[gaussiana]]></category>
		<category><![CDATA[regola empirica]]></category>
		<category><![CDATA[standardizzata]]></category>
		<category><![CDATA[z score]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/?p=3321</guid>

					<description><![CDATA[The concept of the normal distribution is one of the key elements in the field of statistical research. Very often, the data we collect shows typical characteristics, so typical that the resulting distribution is simply called&#8230; &#8220;normal&#8221;. In this post, we will look at the characteristics of this distribution, as well as touch on some &#8230; <a href="https://www.gironi.it/blog/en/the-normal-distribution/" class="more-link">Continue reading<span class="screen-reader-text"> "The Normal Distribution"</span></a>]]></description>
										<content:encoded><![CDATA[ <p>The concept of the normal distribution is one of the key elements in the field of statistical research. Very often, the data we collect shows typical characteristics, so typical that the resulting distribution is simply called&#8230; &#8220;normal&#8221;. In this post, we will look at the characteristics of this distribution, as well as touch on some other concepts of notable importance such as:</p>   <ul class="wp-block-list"> <li><a href="https://www.gironi.it/blog/la-distribuzione-normale#regolaempirica">the empirical rule</a></li>   <li><a href="https://www.gironi.it/blog/la-distribuzione-normale#zscore">the standardized variable</a> &#8211; The concept of Z score</li>   <li><a href="https://www.gironi.it/blog/la-distribuzione-normale#chebishev">Chebyshev&#8217;s inequality</a></li> </ul>   <hr class="wp-block-separator has-css-opacity"/>   <span id="more-3321"></span>  				<div class="wp-block-uagb-table-of-contents uagb-toc__align-left uagb-toc__columns-1  uagb-block-30b1fca4      "
					data-scroll= "1"
					data-offset= "30"
					style=""
				>
				<div class="uagb-toc__wrap">
						<div class="uagb-toc__title">
							What we will discuss						</div>
																						<div class="uagb-toc__list-wrap ">
						<ol class="uagb-toc__list"><li class="uagb-toc__list"><a href="#visualizing-the-normality-of-our-data" class="uagb-toc-link__trigger">Visualizing the &quot;normality&quot; of our data</a><li class="uagb-toc__list"><a href="#transforming-the-data" class="uagb-toc-link__trigger">Transforming the data</a><li class="uagb-toc__list"><a href="#the-empirical-rule" class="uagb-toc-link__trigger">The empirical rule</a><li class="uagb-toc__list"><a href="#standardizing-is-useful-and-beautiful-the-z-score" class="uagb-toc-link__trigger">Standardizing is useful (and beautiful&#8230;). The Z score.</a><ul class="uagb-toc__list"><li class="uagb-toc__list"><a href="#lets-do-a-quick-example" class="uagb-toc-link__trigger">Let&#039;s do a quick example</a></li></ul></li><li class="uagb-toc__list"><a href="#and-now-the-fun-part-lets-do-some-practical-examples" class="uagb-toc-link__trigger">And now the fun part: let&#039;s do some practical examples!</a><li class="uagb-toc__list"><a href="#chebyshevs-inequality" class="uagb-toc-link__trigger">Chebyshev&#039;s inequality</a></ul></ol>					</div>
									</div>
				</div>
			  <hr class="wp-block-separator has-css-opacity"/>   <p>In previous posts, we have seen examples of probability distributions for discrete variables: for example, the <a href="https://www.gironi.it/blog/distribuzioni-di-probabilita-distribuzioni-discrete-la-binomiale/">Binomial</a>, the <a href="https://www.gironi.it/blog/la-distribuzione-geometrica/">Geometric</a>, the <a href="https://www.gironi.it/blog/la-distribuzione-di-poisson/">Poisson</a> distribution&#8230;</p>   <p>The <strong>normal</strong> distribution is a <strong>continuous probability distribution</strong>; in fact, it is the most famous and most used of the continuous probability distributions. We recall that a continuous variable can take an infinite number of values within any given interval.</p>   <p>The normal distribution has a <strong>bell</strong> shape, is also called <strong>Gaussian</strong> &#8211; after the famous mathematician who made a fundamental contribution to this field &#8211; and is <strong>symmetrical about its mean</strong>. It extends indefinitely in both directions, but most of the area &#8211; that is, the probability &#8211; is collected around the mean. The curve appears to change shape at two points, which we call <strong>inflection points</strong>, and which coincide with a <strong>distance of one standard deviation more and less than the mean</strong>.</p>   <p>I generate with a couple of lines in R the characteristic shape of this distribution:</p>   <figure class="wp-block-image is-resized"><img fetchpriority="high" decoding="async" src="https://www.gironi.it/blog/wp-content/uploads/2018/09/image.png" alt="grafico curva gaussiana o normale" class="wp-image-918" width="647" height="413" srcset="https://www.gironi.it/blog/wp-content/uploads/2018/09/image.png 863w, https://www.gironi.it/blog/wp-content/uploads/2018/09/image-300x192.png 300w" sizes="(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px" /></figure>   <h2 class="wp-block-heading">Visualizing the &#8220;normality&#8221; of our data</h2>   <p>R offers several tools to evaluate the deviation of a distribution from a theoretical normal distribution.</p>   <p>One of these is the <strong>qqnorm()</strong> function, which creates a plot of the distribution as a function of the theoretical normal quantiles (qq=quantile-quantile):</p>   <pre class="wp-block-preformatted">qqnorm(variable) qqline(variable) </pre>   <p>I verify this with an example by generating a normal distribution:</p>   <pre class="wp-block-preformatted">x&lt;- rnorm(100,5,10) qqnorm(x) qqline(x)</pre>   <p>The result is this, and as you can see, we have visual confirmation of the substantial normality of the distribution:</p>   <figure class="wp-block-image"><img decoding="async" width="679" height="432" src="https://www.gironi.it/blog/wp-content/uploads/2018/11/qq-plot.png" alt="qq plot" class="wp-image-1114" srcset="https://www.gironi.it/blog/wp-content/uploads/2018/11/qq-plot.png 679w, https://www.gironi.it/blog/wp-content/uploads/2018/11/qq-plot-300x191.png 300w" sizes="(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px" /></figure>   <h2 class="wp-block-heading">Transforming the data</h2>   <p>When the <strong>asymmetry of a distribution depends on the fact that a variable extends over various orders of magnitude</strong>, we have an easy way to make our distribution symmetrical and similar to a normal distribution: <strong>transform the variable into its logarithm</strong>:</p>   <pre class="wp-block-preformatted">qqnorm(log10(variable)) qqline(log10(variable))</pre>   <p>But how do I calculate the central tendency in this case? If I use something like mean(log10(variable)), I no longer have the unit of measurement&#8230; To recover it, I can use the <strong>antilogarithm</strong>, i.e., calculate: 10^<sup>result</sup>. However, it must always be kept in mind that <strong>this is the <a href="https://www.gironi.it/blog/statistica-descrittiva-misure-di-posizione/#la-media-geometrica" target="_blank" rel="noreferrer noopener">geometric mean</a></strong>.</p>   <hr class="wp-block-separator has-css-opacity"/>   <p>Good: we have our dataset and we have verified that the distribution is reasonably similar to a normal distribution. It&#8217;s time to find some practical applications to put our new knowledge to use!</p>   <h2 class="wp-block-heading" id="regolaempirica">The empirical rule</h2>   <p>The empirical rule is one of the pillars of statistics. Without going into too much theoretical detail, the gist is this: <strong>the percentages of data from a normal distribution within 1, 2, and 3 standard deviations from the mean are approximately 68%, 95%, and 99.7%.</strong> It is a rule of such importance and common use that it is better to rewrite it with greater emphasis&#8230;</p>   <figure class="wp-block-pullquote is-style-default has-light-gray-background-color has-background" style="font-style:normal;font-weight:600"><blockquote><p>THE EMPIRICAL RULE<br>The percentages of data from a normal distribution within <br>1, 2, and 3 standard deviations from the mean <br>are approximately <br>68%, 95%, and 99.7%.</p></blockquote></figure>   <p></p>   <h2 class="wp-block-heading" id="zscore">Standardizing is useful (and beautiful&#8230;). The Z score.</h2>   <p>The <strong>standardized normal distribution</strong> is a normal distribution with a <strong>zero mean</strong> and <strong>standard deviation</strong> (or standard deviation, as the English speakers say) of <strong>one</strong>. <br><br><em>* In the blog, I use the two terms &#8220;standard deviation&#8221; and &#8220;standard deviation&#8221; interchangeably&#8230; since they express the same concept and are both commonly used.</em></p>   <p>That is, with:</p>  
\(
\mu=0 \
\sigma=1 \
\)

  <p>Any normal distribution can be converted into a standardized normal distribution by setting the mean to zero and expressing the deviations from the mean in units of standard deviation, what the English speakers very effectively call <em>Z-score</em>.</p>   <p class="has-light-gray-background-color has-background">A Z-score measures the distance between a data point and the mean, using standard deviations. Therefore, a Z-score can be positive (the observation is above the mean) or negative (below the mean). A Z-score of -1, for example, will indicate that our observation falls one standard deviation below the mean. Obviously, a Z-score equal to 0 is equivalent to the mean.</p>   <p><strong>The Z-score is a &#8220;pure&#8221; value, so it provides us with a &#8220;measure&#8221; of extraordinary effectiveness. In practice, it is an index that allows me to compare values between different distributions (as long as they are &#8220;normal,&#8221; of course), using a &#8220;standard&#8221; measure.</strong> <br><br>The calculation, as we have seen, is almost trivial: simply <strong>divide the deviation by the standard deviation</strong>:</p>  
\(
Z = \frac{Deviation}{Standard\ Deviation} \ \
\)

  <p>Under these conditions, we know that approximately 68% of the area under the standardized normal curve is within 1 standard deviation from the mean, 95% within two, and 99.7% within three.<br>That is:</p>  
\(
68.26% \ entro\  \mu \pm \sigma \
95.4% \ entro\  \mu \pm 2\sigma \
99.74% \ entro\  \mu \pm 3\sigma \
\
\)

  <p>To find the probabilities &#8211; that is, the areas &#8211; for problems involving the normal distribution, the value X is converted into the corresponding Z-score:</p>  
\(
Z = \frac{X-\mu}{\sigma}  \
\
\)

  <p>Then the value of Z is looked up in the tables and the probability under the curve between the mean and Z is found.</p>   <p>Seems difficult? It&#8217;s very easy, and very fun. And with R, or with the TI-83, it&#8217;s really a breeze!<br> </p>   <p class="has-light-gray-background-color has-background">The importance of the Z-score also lies (and especially) in its <strong>extreme practical utility</strong>: it allows, in fact, to usefully compare observations drawn from populations with different means and standard deviations, using a common scale. This is why the process is called <em>standardization</em>: it allows <strong>comparing observations between variables that have different distributions</strong>. Using the table (or the calculator or the computer) we can quickly calculate probabilities and percentiles, and identify any extreme values (<em>outliers)</em>. </p>   <p>Since sigma is positive, Z will be positive if X&gt;mu and negative if X&lt;mu. The value of Z represents the number of standard deviations the value is above or below the mean.</p>   <h5 class="wp-block-heading">Let&#8217;s do a quick example</h5>   <p>I have observations of some phenomenon that have a mean value of 65:</p>  
\(
\mu = 65 \
\)

  <p>The standard deviation is 10:</p>  
\(
\sigma = 10 \
\)

  <p>And I observe a value of 81&nbsp;:</p>  
\(
X = 81 \\
\)

  <p>The value of the Z-score is calculated in a moment:</p>  
\(
Z= \frac{X &#8211; \mu}{\sigma} = \frac{81 &#8211; 65}{10} = \frac{16}{10} = 1.6
\
\)

  <p>The observed value, on a standard scale, falls 1.6 standard deviations above the mean. To understand, therefore, what percentage of observations result below the observed value, I just need to take the table:</p>   <div class="wp-block-uagb-image uagb-block-b287af8d wp-block-uagb-image--layout-default wp-block-uagb-image--effect-static wp-block-uagb-image--align-none"><figure class="wp-block-uagb-image__figure"><img decoding="async" srcset="https://www.gironi.it/blog/wp-content/uploads/2023/02/tabella-z.png " src="https://www.gironi.it/blog/wp-content/uploads/2023/02/tabella-z.png" alt="Tabella Z scores" class="uag-image-2697" width="" height="" title="" loading="lazy"/><figcaption class="uagb-image-caption"><em>The Z score table in action&#8230;</em></figcaption></figure></div>   <p>As noted, by crossing my z value: 1.6 at the 0.05 level I find the value 0.9505, which is equivalent to saying that 95.05% of the observed values are less than 81.</p>   <p>Obviously, I could have obtained the value in R without using the table, simply with:</p>   <pre class="wp-block-preformatted">pnorm(1.6)</pre>   <p>For those using Python:</p>   <pre class="wp-block-preformatted">from scipy.stats import norm p = norm.cdf(1.6) print(p)</pre>   <h2 class="wp-block-heading">And now the fun part: let&#8217;s do some practical examples!</h2>   <p><strong>Example 1</strong></p>   <p>What is the probability of an event with a Z-score &lt; 2.47 ?</p>   <p>I take the <a href="http://www.matapp.unimib.it/~fcaraven/did0607/tavola_normale.pdf" target="_blank" rel="noreferrer noopener">table</a> and see that 2.47 = 0.9932.</p>   <p>Therefore, 99.32% of the values are within 2.47 standard deviations from the mean.<br><br>Graphically representing the situation, what I am asked to do is to find the area of the gray surface, that is, the area under the curve to the left of the point with abscissa Z=2.47:</p>   <figure class="wp-block-image"><img decoding="async" width="679" height="432" src="https://www.gironi.it/blog/wp-content/uploads/2018/10/normale-1.png" alt="normale 1" class="wp-image-1091" srcset="https://www.gironi.it/blog/wp-content/uploads/2018/10/normale-1.png 679w, https://www.gironi.it/blog/wp-content/uploads/2018/10/normale-1-300x191.png 300w" sizes="(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px" /></figure>   <p>In R, the calculation is very simple. All I have to do is type:</p>   <pre class="wp-block-preformatted">pnorm(2.47)</pre>   <p>The <strong>pnorm() function allows us to obtain the cumulative probability curve of the normal distribution</strong>. In other words, it allows us to calculate the relative area (remembering that the total area is 1) under the curve, from the given value of Z to +infinity or -infinity.<br><br>By default, R uses the lower tail, i.e., it finds the area from -infinity to Z.<br>To compute the area between Z and +infinity, I just need to set lower.tail=FALSE.<br><br></p>   <p><strong>Example 2</strong></p>   <p>What is the probability of a Z-score value &gt; 1.53 ?</p>   <p>From the table, I find the value 0.937, so I deduce that 93.7% of the values are below a Z-score of 1.53.<br>So, to find out how many are above: 100-93.7 = 6.3%</p>   <p>In R, all I have to do is type:</p>   <pre class="wp-block-preformatted">1 - pnorm(1.53)</pre>   <p><strong>Example 3</strong></p>   <p>What is the probability of &#8220;drawing&#8221; a random value of less than 3.65, given a normal distribution with a mean = 5 and a standard deviation = 2.2 ?</p>   <p>I immediately find the Z-score for the value 3.65:</p>  
\(
Z= \frac{3.65 &#8211; 5}{2.2} = \frac{-1.35}{2.2} \simeq -0.61 \ \
\)

  <p>I look for this value in the table: 0.2709. Therefore, there is a 27.09% probability that a value less than 3.65 will &#8220;come out&#8221; of a random selection with a mean of 5 and a standard deviation of 2.2.</p>   <p>If I wanted to use a scientific calculator, with the TI83 I would just have to type:</p>   <pre class="wp-block-preformatted">normalcdf(-1e99,3.65,5,2.2)</pre>   <p>While with a Casio fx I would just have to follow these steps:</p>   <pre class="wp-block-preformatted">MENU STAT DIST NORM Ncd Data: Variable Lower: -100 Upper: 3.65 sigma: 2.2 mu: 5 EXECUTE</pre>   <p>The result is obviously slightly different from that obtained from the table, because in the table I rounded the value of the division (3.65-5)/2.2 to -0.61, omitting the remaining decimal part&#8230;</p>   <p><strong>Example 4: finding probabilities between 2 Z-scores</strong></p>   <p>This is the most fun case of all. Actually, just find the 2 probabilities and subtract&#8230;</p>   <p>What is the probability associated with a value between Z=1.2 and Z=2.31 ?</p>   <p>I think of my normal curve: first I find the area to the left of Z<sub>2</sub>. Then I find the area to the left of Z<sub>1</sub>. Then I subtract the two values to find the area between the two, which is the probability sought.</p>   <p>Or I use R and simply write:</p>   <pre class="wp-block-preformatted">pnorm(2.31)-pnorm(1.2)</pre>   <p>and the result, in this case 10.46%, is found in a moment!</p>   <p>Wait, but what if I wanted to calculate the value of Z starting from a cumulative probability? Just use the inverse function of pnorm() which in R is qnorm(). For example, to find the value of Z with an area of 0.5, I type:</p>   <pre class="wp-block-preformatted">qnorm(0.5)</pre>   <p>and I will get the result, which is clearly 0 (the mean of a standardized normal has a value of 0 and the mean divides the normal into two equal areas&#8230;).</p>   <p>For those using Python, the code is:</p>   <pre class="wp-block-preformatted">from scipy.stats import norm q = norm.ppf(0.5) print(q) </pre>   <h2 class="wp-block-heading" id="chebishev">Chebyshev&#8217;s inequality</h2>   <p>The most important characteristic of Chebyshev&#8217;s inequality is that <strong>it applies to any probability distribution of which the mean and standard deviation are known</strong>.</p>   <p>Dealing with a distribution of unknown type or certainly not normal, Chebyshev&#8217;s inequality comes to our aid, stating that:<br><br>If we assume a real positive value k, the probability that the random variable X has a value between:</p>  
\( \mu \ &#8211; \ k \sigma \ e \ \mu \ + \ k \sigma \\)

  <p>is greater than:</p>  
\( 1 &#8211; \frac{1}{k^{2}} \ \)

  <p>In other words: suppose we know the mean and standard deviation of a set of data, which do not follow a normal distribution. We can say that for every value k &gt;0 at least a fraction (1-1/k<sup>2</sup>) of the data falls within the interval between:</p>  
\( \mu \ &#8211; \ k \sigma \ e \ \mu \ + \ k \sigma \\)

  <p>As always, an example is useful to clarify everything. I take an example dataset&#8230;the average salaries paid by U.S. baseball teams in 2016:</p>   <style type="text/css"> table.tableizer-table { font-size: 12px; border: 1px solid #CCC; font-family: Arial, Helvetica, sans-serif; } .tableizer-table td { padding: 4px; margin: 3px; border: 1px solid #CCC; } .tableizer-table th { background-color: #104E8B; color: #FFF; font-weight: bold; } </style> <table class="tableizer-table"> <thead><tr class="tableizer-firstrow"><th>Team</th><th>Salary (\$M)</th></tr></thead><tbody> <tr><td>Arizona Diamondbacks</td><td>91,995583</td></tr> <tr><td>Atlanta Braves</td><td>77,073541</td></tr> <tr><td>Baltimore Orioles</td><td>141,741213</td></tr> <tr><td>Boston Red Sox</td><td>198,328678</td></tr> <tr><td>Chicago Cubs</td><td>163,805667</td></tr> <tr><td>Chicago White Sox</td><td>113,911667</td></tr> <tr><td>Cincinnati Reds</td><td>80,905951</td></tr> <tr><td>Cleveland Indians</td><td>92,652499</td></tr> <tr><td>Colorado Rockies</td><td>103,603571</td></tr> <tr><td>Detroit Tigers</td><td>192,3075</td></tr> <tr><td>Houston Astros</td><td>89,0625</td></tr> <tr><td>Kansas City Royals</td><td>136,564175</td></tr> <tr><td>Los Angeles Angels</td><td>160,98619</td></tr> <tr><td>Los Angeles Dodgers</td><td>248,321662</td></tr> <tr><td>Miami Marlins</td><td>64,02</td></tr> <tr><td>Milwaukee Brewers</td><td>51,2</td></tr> <tr><td>Minnesota Twins</td><td>99,8125</td></tr> <tr><td>New York Mets</td><td>128,413458</td></tr> <tr><td>New York Yankees</td><td>221,574999</td></tr> <tr><td>Oakland Athletics</td><td>80,613332</td></tr> <tr><td>Philadelphia Phillies</td><td>91,616668</td></tr> <tr><td>Pittsburgh Pirates</td><td>95,840999</td></tr> <tr><td>San Diego Padres</td><td>94,12</td></tr> <tr><td>San Francisco Giants</td><td>166,744443</td></tr> <tr><td>Seattle Mariners</td><td>139,804258</td></tr> <tr><td>St, Louis Cardinals</td><td>143,514</td></tr> <tr><td>Tampa Bay Rays</td><td>60,065366</td></tr> <tr><td>Texas Rangers</td><td>158,68022</td></tr> <tr><td>Toronto Blue Jays</td><td>131,905327</td></tr> <tr><td>Washington Nationals</td><td>142,501785</td></tr> </tbody></table>   <p>The mean is: 125.3896<br>The standard deviation: 48.64039</p>   <p>Chebyshev&#8217;s inequality</p>  
\( 1 &#8211; \frac{1}{k^{2}} \ \)

  <p>tells us that at least 55.56% in this case is in the interval:</p>  
\(
(\mu − 1.5\sigma, \mu + 1.5\sigma)= (52.42902, 198.3502) \ \
\)

  <div style="height:50px" aria-hidden="true" class="wp-block-spacer"></div> ]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/the-normal-distribution/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
