<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>anova &#8211; paologironi blog</title>
	<atom:link href="https://www.gironi.it/blog/en/tag/anova/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.gironi.it/blog</link>
	<description>Scattered notes on (retro) computing, data analysis, statistics, SEO, and things that change</description>
	<lastBuildDate>Sun, 01 Mar 2026 19:05:21 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>Analysis of Variance, ANOVA. Explained simply</title>
		<link>https://www.gironi.it/blog/en/analysis-of-variance-anova-explained-simply/</link>
					<comments>https://www.gironi.it/blog/en/analysis-of-variance-anova-explained-simply/#respond</comments>
		
		<dc:creator><![CDATA[paolo]]></dc:creator>
		<pubDate>Sun, 03 Oct 2021 13:53:00 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<category><![CDATA[Analysis of Variance]]></category>
		<category><![CDATA[anova]]></category>
		<category><![CDATA[parametric test]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/?p=3303</guid>

					<description><![CDATA[Analysis of Variance (ANOVA) is a parametric test that evaluates the differences between the means of two or more data groups. It is a statistical hypothesis test that is widely used in scientific research and allows to determine if the means of at least two populations are different. As a minimum prerequisite, a continuous dependent &#8230; <a href="https://www.gironi.it/blog/en/analysis-of-variance-anova-explained-simply/" class="more-link">Continue reading<span class="screen-reader-text"> "Analysis of Variance, ANOVA. Explained simply"</span></a>]]></description>
										<content:encoded><![CDATA[
<p>Analysis of Variance (ANOVA) is a <a href="/blog/en/statistical-parametric-and-non-parametric-tests/" data-type="post" data-id="2306">parametric test</a> that evaluates the differences between the means of two or more data groups. <br>It is a statistical hypothesis test that is widely used in scientific research and allows to determine if the means of at least two populations are different. <br>As a minimum prerequisite, a <strong>continuous dependent variable</strong> and a <strong>categorical independent variable</strong> that divides the data into comparison groups are required.</p>



<span id="more-3303"></span>



<p>The term &#8220;analysis of variance&#8221; comes from the way the analysis uses variances to determine if the means are different.</p>



<p>ANOVA works by comparing the variance of the means between the groups (called <em>between</em> variance) with the variance within the individual groups (or <em>within</em> variance).</p>



<p>Analysis of Variance was developed by the great statistician <a href="https://en.wikipedia.org/wiki/Ronald_Fisher" target="_blank" rel="noreferrer noopener">Ronald Fisher</a> (we could say he is one of the Gods in the Olympus of statistics).<br>It is no coincidence that ANOVA is based on a distribution called the F distribution.</p>


<div class="wp-block-image is-style-rounded">
<figure class="aligncenter size-large"><img fetchpriority="high" decoding="async" width="291" height="408" src="https://www.gironi.it/blog/wp-content/uploads/2024/11/image.jpeg" alt="image" class="wp-image-3304" srcset="https://www.gironi.it/blog/wp-content/uploads/2024/11/image.jpeg 291w, https://www.gironi.it/blog/wp-content/uploads/2024/11/image-214x300.jpeg 214w" sizes="(max-width: 291px) 85vw, 291px" /><figcaption class="wp-element-caption">Young Ronald Fisher <br>(from Wikipedia)</figcaption></figure>
</div>

				<div class="wp-block-uagb-table-of-contents uagb-toc__align-left uagb-toc__columns-1  uagb-block-9d46178e      "
					data-scroll= "1"
					data-offset= "30"
					style=""
				>
				<div class="uagb-toc__wrap">
						<div class="uagb-toc__title">
							What we will talk about<br>						</div>
																						<div class="uagb-toc__list-wrap ">
						<ol class="uagb-toc__list"><li class="uagb-toc__list"><a href="#anova-a-parametric-test" class="uagb-toc-link__trigger">ANOVA: a parametric test</a><li class="uagb-toc__list"><a href="#why-anova-and-not-a-series-of-t-tests" class="uagb-toc-link__trigger">Why ANOVA and not a series of t-tests?</a><li class="uagb-toc__list"><a href="#the-simplest-case-one-way-anova" class="uagb-toc-link__trigger">The simplest case: One-way ANOVA</a><li class="uagb-toc__list"><a href="#the-classic-and-a-bit-tedious-way-to-perform-an-anova-test-the-anova-table" class="uagb-toc-link__trigger">The &quot;classic&quot; (and a bit tedious) way to perform an ANOVA test: the ANOVA table</a><li class="uagb-toc__list"><a href="#what-an-effort-its-time-to-harness-the-power-of-r" class="uagb-toc-link__trigger">What an effort&#8230; It&#039;s time to harness the power of R</a><li class="uagb-toc__list"><a href="#you-might-also-like" class="uagb-toc-link__trigger">You might also like</a></ol>					</div>
									</div>
				</div>
			


<h2 class="wp-block-heading">ANOVA: a parametric test</h2>



<p>ANOVA is a <a href="/blog/en/statistical-parametric-and-non-parametric-tests/" data-type="post" data-id="2306">parametric test</a>. Therefore, it requires that a certain number of requirements are met:</p>



<ul class="wp-block-list">
<li>Normality. The data in the groups must follow a normal distribution.</li>



<li>Homogeneity of variances: the groups should have approximately equal variances.</li>



<li>The residuals follow the least squares assumptions.</li>



<li>There is at least one categorical independent variable (factor).</li>



<li>The dependent variable is continuous.</li>



<li>The observations are independent.</li>
</ul>



<h2 class="wp-block-heading">Why ANOVA and not a series of t-tests?</h2>



<p>A question that is legitimate to ask is: why should I use ANOVA when I can use a series of comparisons between each group and each of the others?<br>The answer is not simply related to the boredom and difficulty of having to perform a large number of tests (for example, for 4 factors I would need to perform 6 different t-tests). The biggest problem is that the probability of committing a <a href="https://www.gironi.it/blog/il-test-delle-ipotesi/#errori-di-i-e-ii-tipo" target="_blank" rel="noreferrer noopener">Type I error</a> increases with an exponential progression. We know that if we choose a typical alpha of 0.05, we set the probability of incurring a Type I error at 5%.<br>If we call n the number of t-tests to be performed, we will have an overall probability of committing Type I errors equal to:</p>



\(
1-(1-\alpha)^n \\
\)



<p>in our example this means:</p>



\(
1-(1-0,05)^6 = \\
1-0,735 = \\
0,265 \\ \\
\)



<p>That is, a Type I error probability of 26.5%! Clearly unacceptable&#8230; <strong>When we want to test the mean of 3 or more groups, ANOVA is certainly preferable to a series of T-tests</strong>.</p>



<h2 class="wp-block-heading">The simplest case: One-way ANOVA</h2>



<p>The simplest type of ANOVA test is one-way ANOVA. This method is a generalization of <a href="https://www.gironi.it/blog/la-distribuzione-t-e-il-test-delle-ipotesi/" data-type="post" data-id="1131">t-tests</a> capable of evaluating the difference between more than two group means.<br>The data is organized into various groups based on a single categorical variable (called the <em>factor</em> variable).</p>



<p>As we said, ANOVA is a <a href="https://www.gironi.it/blog/il-test-delle-ipotesi/" data-type="post" data-id="1190">hypothesis test</a>. In this case, we have a <strong>null hypothesis</strong> H<sub>0</sub>:<br><em>the means between the different groups are equal</em><br>and an <strong>alternative hypothesis</strong> H<sub>a</sub>:<br><em>at least one mean is different</em>.</p>



<p class="has-light-gray-background-color has-background">ATTENTION: ANOVA tells us IF a mean is different, not WHICH group has a different mean. For that, we will need an additional step, the <em>post hoc</em> test, which we will see in due course.</p>



<h2 class="wp-block-heading">The &#8220;classic&#8221; (and a bit tedious) way to perform an ANOVA test: the ANOVA table</h2>



<p>It is true that using the &#8220;classic&#8221; way of computing the result of an ANOVA test can provide important theoretical notions, but it is also true that anyone who uses this type of test in everyday life rarely &#8211; if not never &#8211; uses paper and pencil and fills out an ANOVA table&#8230; The convenience of R functions in doing all the &#8220;hard work&#8221; with a click is really priceless. However, a step-by-step example will provide us with an important introduction.</p>



<p>The steps we will take can be schematized as follows:</p>



<ul class="wp-block-list">
<li>We will calculate the common variance, called the within-sample variance <em>S</em><sup>2</sup><em><sub>within</sub></em>, or residual variance.</li>



<li>We will calculate the variance between the sample means, so:<br>The mean of each group<br>The variance between the sample means<strong> </strong>(<em>S</em><sup>2</sup><em><sub>between</sub></em>)</li>



<li>And then we will derive the F statistic as the ratio between <em>S</em><sup>2</sup><em><sub>between</sub></em>/<em>S</em><sup>2</sup><em><sub>within</sub></em></li>
</ul>



<p>Since SEO is one of the fields I follow with the greatest interest, I hypothesize an example (obviously devoid of real value) that deals with the analysis of website traffic data.</p>



<p>My independent variable with multiple factors is the type of device used by the visitors: desktop, mobile, tablet.</p>



<p>My dependent variable will be the objectives achieved on the site.</p>



<p>Suppose we follow the monthly data for 6 months and obtain these measurements:</p>



<figure class="wp-block-table is-style-regular"><table><tbody><tr><td>Desktop</td><td>Mobile</td><td>Tablet</td></tr><tr><td>39</td><td>45</td><td>30</td></tr><tr><td>67</td><td>54</td><td>45</td></tr><tr><td>78</td><td>64</td><td>22</td></tr><tr><td>59</td><td>52</td><td>39</td></tr><tr><td>42</td><td>46</td><td>38</td></tr><tr><td>51</td><td>35</td><td>41</td></tr></tbody></table><figcaption class="wp-element-caption">Objectives by device</figcaption></figure>



<p>I&#8217;ll calculate the mean value for the Desktop group:<br>(39+67+78+59+42+51)/6 = 56</p>



<p>I calculate the mean for Mobile:<br>(45+54+64+52+46+35)/6 = 49.3</p>



<p>And the one for Tablets:<br>(30+45+22+39+38+41)/6 =  35.83</p>



<p>Let&#8217;s move on to calculating the sums of squares:</p>



<figure class="wp-block-table"><table><tbody><tr><td class="has-text-align-left" data-align="left"><strong>Desktop</strong></td><td><strong>Mobile</strong></td><td><strong>Tablet</strong></td></tr><tr><td class="has-text-align-left" data-align="left">(39-56)<sup>2</sup> = 289</td><td>(45-49.3)<sup>2</sup> = 18.49</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(30-35.83)<sup>2</sup> = 33.99</td></tr><tr><td class="has-text-align-left" data-align="left"><meta http-equiv="content-type" content="text/html; charset=utf-8">(67-56)<sup>2</sup> = 121</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(54-49,3)<sup>2</sup> = 22.09</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(45-35.83)<sup>2</sup> = 84.09</td></tr><tr><td class="has-text-align-left" data-align="left"><meta http-equiv="content-type" content="text/html; charset=utf-8">(78-56)<sup>2</sup> = 484</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(64-49,3)<sup>2</sup> = 216.09</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(22-35.83)<sup>2</sup> = 191.27</td></tr><tr><td class="has-text-align-left" data-align="left"><meta http-equiv="content-type" content="text/html; charset=utf-8">(59-56)<sup>2</sup> = 9</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(52-49,3)<sup>2</sup> = 7.29</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(39-35.83)<sup>2</sup> = 10.05</td></tr><tr><td class="has-text-align-left" data-align="left"><meta http-equiv="content-type" content="text/html; charset=utf-8">(42-56)<sup>2</sup> = 196</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(46-49,3)<sup>2</sup> = 10.89</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(38-35.83)<sup>2</sup> = 4.71</td></tr><tr><td class="has-text-align-left" data-align="left"><meta http-equiv="content-type" content="text/html; charset=utf-8">(51-56)<sup>2</sup> = 25</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(35-49,3)<sup>2</sup> = 204.49</td><td><meta http-equiv="content-type" content="text/html; charset=utf-8">(41-35.83)<sup>2</sup> = 26.73</td></tr><tr><td class="has-text-align-left" data-align="left">Total</td><td>Total</td><td>Total</td></tr><tr><td class="has-text-align-left" data-align="left"><strong>1124</strong></td><td><strong>479.34</strong></td><td><strong>350.84</strong></td></tr></tbody></table></figure>



<p>We are ready to derive SS<sub>e</sub>, the sum of squared errors:<br>SS<sub>e</sub> = 1124 + 479.34 + 350.84 = <strong>1954.18</strong></p>



<p>We calculate the Grand Mean of all observations by summing the values of the desktop, mobile, and tablet groups and dividing by the number of observations:<br>(336+296+215)/18 = 47</p>



<p>Let&#8217;s proceed with the calculation using a table:</p>



<figure class="wp-block-table is-style-regular"><table><tbody><tr><td></td><td>A-Observations</td><td>B-Grand Mean</td><td>C-Mean</td><td>(B-C)<sup>2</sup></td><td>A * D</td></tr><tr><td>Desktop</td><td>6</td><td>47</td><td>56</td><td>81</td><td>486</td></tr><tr><td>Mobile</td><td>6</td><td>47</td><td>49.3</td><td>5.29</td><td>31.74</td></tr><tr><td>Tablet</td><td>6</td><td>47</td><td>35.83</td><td>124.77</td><td>748.62</td></tr></tbody></table></figure>



<p>And we find the Sum of Squares between:<br><br>SS<sub>b</sub> = 486 + 31.74 + 748.62 = <strong>1266.4</strong></p>



<p>Just a bit more effort, and now it gets interesting!<br><br>The between degrees of freedom <em>df<sub>1</sub></em> are equal to N &#8211; 1, thus: <br>3 &#8211; 1 = 2<br>The within degrees of freedom <em>df<sub>2</sub></em> are equal to N &#8211; K, thus:<br>18 &#8211; 3 = 15</p>



<p>Let&#8217;s find the Mean Square Error, MS<sub>e</sub>:</p>



\(
MS_e=\frac{SS_e}{df_2} \\
\frac{1954.18}{15} = 130.3 \\ \\
\)



<p>And the Mean Square between:</p>



\(
MS_b=\frac{SS_b}{df_1} \\
\frac{1266.4}{2}=633.18 \\ \\
\)



<p>The moment has arrived: we can finally determine our F value!</p>



\(
F=\frac{MS_b}{MS_e} \\
\frac{633.18}{130.3}=4.86 \\
\)



<p><strong>I&#8217;ve finally found the value I was looking for, F=4.86.</strong><br>Now, I just need to consult an <a href="https://www.unirc.it/documentazione/materiale_didattico/600_2011_294_11517.pdf" target="_blank" rel="noreferrer noopener">F distribution table</a> and find the critical value corresponding to the intersection of the df<sub>2</sub>/df<sub>1</sub> values. <br>That value is 3.68.</p>



<p><strong>My F value of 4.86 falls into the rejection zone for the null hypothesis H<sub>0</sub>. <br>My test, with an alpha value of 0.05, indicates that the means of the three groups are not equal.</strong></p>



<h2 class="wp-block-heading">What an effort&#8230; It&#8217;s time to harness the power of R</h2>



<p>The example values are available in <a href="https://www.gironi.it/blog/wp-content/uploads/2021/10/anova-ex1.csv">this csv file</a>.</p>



<p>Assuming our csv file is in the home directory, I can create an R script in Rstudio and load my very simple dataset:</p>



<pre class="wp-block-preformatted">obiettivianova &lt;- read.csv("~/anova-ex1.csv")</pre>



<p>A graphical look at the values for the three groups:</p>



<pre class="wp-block-preformatted">boxplot(obiettivianova$obiettivi ~ obiettivianova$device, main="Boxplot objectives by device", xlab="Device", ylab="Objectives")</pre>



<figure class="wp-block-image size-full"><img decoding="async" width="855" height="540" src="https://www.gironi.it/blog/wp-content/uploads/2021/10/boxplot-ex1.png" alt="boxplot" class="wp-image-2376" srcset="https://www.gironi.it/blog/wp-content/uploads/2021/10/boxplot-ex1.png 855w, https://www.gironi.it/blog/wp-content/uploads/2021/10/boxplot-ex1-300x189.png 300w" sizes="(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px" /></figure>



<p>The boxplot already seems to suggest something, but we proceed analytically.</p>



<p>Let&#8217;s take a look at the means:</p>



<pre class="wp-block-preformatted">aggregate(objectives ~ device, obiettivianova, mean)</pre>



<pre class="wp-block-preformatted">   device objectives
1 desktop  56.00000
2  mobile  49.33333
3  tablet  35.83333</pre>



<p>and proceed with our test:</p>



<pre class="wp-block-preformatted">my_model &lt;- aov(obiettivianova$objectives ~ obiettivianova$device)

summary(my_model)</pre>



<p>The output we obtain is as follows:</p>



<pre class="wp-block-preformatted has-small-font-size">                      Df Sum Sq Mean Sq F value Pr(&gt;F)  
obiettivianova$device  2   1267   633.4   4.862 0.0236 *
Residuals             15   1954   130.3                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre>



<p>The power of R is clear here. In just moments, we have a wealth of useful information. The F value is 4.862, degrees of freedom are 2, etc., etc.<br>There&#8217;s no need to consult the F distribution table (or use the corresponding R command) because the p-value is already present, indicating rejection of the null hypothesis at the 5% level (p = 0.0236 &lt; 0.05).</p>



<p>ANOVA tells us that the means are not all equal. It&#8217;s time for a post-hoc test to evaluate where the &#8220;anomaly&#8221; lies:</p>



<pre class="wp-block-preformatted">TukeyHSD(my_model)</pre>



<p>The Tukey HSD Test is one of the most useful post hoc tests for cases like this. The output we get is:</p>



<pre class="wp-block-preformatted has-small-font-size">                     diff       lwr       upr     p adj
mobile-desktop  -6.666667 -23.78357 10.450234 0.5810821
tablet-desktop -20.166667 -37.28357 -3.049766 0.0204197
tablet-mobile  -13.500000 -30.61690  3.616900 0.1348303
</pre>



<p>As you can see, while for the comparisons between mobile-desktop and tablet-mobile means we cannot reject the null hypothesis, the same cannot be said for the tablet-desktop means, where the difference is statistically significant.</p>



<h2 class="wp-block-heading">You might also like</h2>



<ul class="wp-block-list">
<li><a href="/blog/en/the-chi-square-test-goodness-of-fit-and-test-of-independence/">The Chi-Square Test</a></li>



<li><a href="/blog/en/statistical-parametric-and-non-parametric-tests/">Statistical Parametric and Non-Parametric Tests</a></li>



<li><a href="/blog/en/correlation-and-regression-analysis-linear-regression/">Correlation and Regression Analysis</a></li>
</ul>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/analysis-of-variance-anova-explained-simply/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
