<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>correlazione &#8211; paologironi blog</title>
	<atom:link href="https://www.gironi.it/blog/en/tag/correlazione/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.gironi.it/blog</link>
	<description>Scattered notes on (retro) computing, data analysis, statistics, SEO, and things that change</description>
	<lastBuildDate>Fri, 13 Dec 2024 10:14:07 +0000</lastBuildDate>
	<language>en-GB</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>Correlation and Regression Analysis &#8211; Linear Regression</title>
		<link>https://www.gironi.it/blog/en/correlation-and-regression-analysis-linear-regression/</link>
					<comments>https://www.gironi.it/blog/en/correlation-and-regression-analysis-linear-regression/#respond</comments>
		
		<dc:creator><![CDATA[paolo]]></dc:creator>
		<pubDate>Tue, 25 Aug 2020 10:10:00 +0000</pubDate>
				<category><![CDATA[statistics]]></category>
		<category><![CDATA[coefficiente di determinazione]]></category>
		<category><![CDATA[correlazione]]></category>
		<category><![CDATA[grafico dispersione]]></category>
		<category><![CDATA[r di pearson]]></category>
		<category><![CDATA[regressione]]></category>
		<category><![CDATA[residui]]></category>
		<category><![CDATA[scatterplot]]></category>
		<guid isPermaLink="false">https://www.gironi.it/blog/?p=3340</guid>

					<description><![CDATA[In previous posts, we have examined concepts such as the mean and standard deviation, which are capable of describing a single variable. These statistics are of great importance; however, in daily practice, it is often necessary to investigate the relationships between two or more variables. This is where new key concepts emerge: correlation and regression &#8230; <a href="https://www.gironi.it/blog/en/correlation-and-regression-analysis-linear-regression/" class="more-link">Continue reading<span class="screen-reader-text"> "Correlation and Regression Analysis &#8211; Linear Regression"</span></a>]]></description>
										<content:encoded><![CDATA[
<p>In previous posts, we have examined concepts such as the mean and standard deviation, which are capable of describing a single variable. These statistics are of great importance; however, in daily practice, it is often necessary to <strong>investigate the relationships between two or more variables</strong>. This is where new key concepts emerge: <strong>correlation</strong> and <strong>regression analysis</strong>.</p>



<p>Correlation and regression analysis are tools widely used during the analysis of our datasets.<br>They involve <strong>estimating the relationship between a dependent variable and one or more independent variables</strong>.</p>



<span id="more-3340"></span>


				<div class="wp-block-uagb-table-of-contents uagb-toc__align-left uagb-toc__columns-1  uagb-block-ec276fe0      "
					data-scroll= "1"
					data-offset= "30"
					style=""
				>
				<div class="uagb-toc__wrap">
						<div class="uagb-toc__title">
							What We&#8217;ll Cover<br>						</div>
																						<div class="uagb-toc__list-wrap ">
						<ol class="uagb-toc__list"><li class="uagb-toc__list"><a href="#simple-linear-regression" class="uagb-toc-link__trigger">Simple Linear Regression</a><li class="uagb-toc__list"><a href="#pearsons-correlation-coefficient-r" class="uagb-toc-link__trigger">Pearson&#039;s Correlation Coefficient r</a><li class="uagb-toc__list"><a href="#the-coefficient-of-determination-r2" class="uagb-toc-link__trigger">The Coefficient of Determination r2</a><li class="uagb-toc__list"><a href="#finding-the-regression-equation" class="uagb-toc-link__trigger">Finding the Regression Equation</a><li class="uagb-toc__list"><a href="#outliers-and-influential-points" class="uagb-toc-link__trigger">Outliers and Influential Points</a><li class="uagb-toc__list"><a href="#model-assumptions" class="uagb-toc-link__trigger">Model Assumptions</a><li class="uagb-toc__list"><a href="#residual-analysis" class="uagb-toc-link__trigger">Residual Analysis</a><li class="uagb-toc__list"><a href="#regression-analysis-practical-difficulties" class="uagb-toc-link__trigger">Regression Analysis: Practical Difficulties</a><li class="uagb-toc__list"><a href="#other-types-of-correlation-coefficients" class="uagb-toc-link__trigger">Other Types of Correlation Coefficients</a><ul class="uagb-toc__list"><li class="uagb-toc__list"><a href="#the-point-biserial-correlation-coefficient" class="uagb-toc-link__trigger">The Point-Biserial Correlation Coefficient</a><li class="uagb-toc__list"><li class="uagb-toc__list"><a href="#the-phi-coefficient" class="uagb-toc-link__trigger">The Phi Coefficient</a><li class="uagb-toc__list"><li class="uagb-toc__list"><a href="#spearmans-rank-correlation-coefficient-rho-and-a-note-on-kendalls-tau" class="uagb-toc-link__trigger">Spearman&#039;s Rank Correlation Coefficient Rho (and a Note on Kendall&#039;s Tau)</a></ul></ol>					</div>
									</div>
				</div>
			


<h2 class="wp-block-heading">Simple Linear Regression</h2>



<p>We can consider <strong>regression</strong> as a method suitable for finding a mathematical relationship that expresses a link between a variable y (the <em><strong>dependent variable</strong></em> or <em><strong>response variable</strong></em>) and a variable x (the <em><strong>independent variable</strong></em> or <em><strong>predictor variable</strong></em>).</p>



<p><strong>The first useful step to qualitatively investigate the possible dependence between two variables x and y is always to draw a graph, called a scatter diagram (<em>scatterplot</em>).</strong></p>



<p>We place the data relating to one of the two variables on the abscissa, the data relating to the other variable on the ordinate, and we represent the individual observations with points. Remember that scatter plots compare two variables.</p>



<p>If there is a simple relationship between the two variables, the diagram should show it!</p>



<p>Let&#8217;s use some example values and let R draw our scatter diagram.</p>



<p>I put some imaginary data in a CSV file relating to a hypothetical correlation between the ambient temperature recorded in my city and the sales of ice cream. I name the file <em>gelati.csv</em> and save it as plain text in any folder on my filesystem (in my example in <em>tmp/gelati.csv</em>). The file will have this content:</p>



<pre class="wp-block-preformatted">temperatura,gelati
25,58
30,70
29,61
26,53
25,48
28,66
24,47
22,47
20,40
18,29
22,33</pre>



<p>Now I open R Studio and load my dataset:</p>



<pre class="wp-block-preformatted">gelati &lt;- read.csv("/tmp/gelati.csv")</pre>



<p>Then I plot the scatterplot to see if the figure is compatible with a linear regression hypothesis:</p>



<pre class="wp-block-preformatted">plot(gelati)</pre>



<figure class="wp-block-image size-large is-resized"><img fetchpriority="high" decoding="async" width="855" height="540" src="https://www.gironi.it/blog/wp-content/uploads/2020/08/diagramma-a-dispersione.png" alt="Scatterplot" class="wp-image-1810" srcset="https://www.gironi.it/blog/wp-content/uploads/2020/08/diagramma-a-dispersione.png 855w, https://www.gironi.it/blog/wp-content/uploads/2020/08/diagramma-a-dispersione-300x189.png 300w" sizes="(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px" /><figcaption class="wp-element-caption">The scatter diagram representing the variables temperature and number of ice creams sold.</figcaption></figure>



<p>The diagram shows an evident linear correlation, with an upward trend.</p>



<h2 class="wp-block-heading">Pearson&#8217;s Correlation Coefficient r</h2>



<p>There are various types of correlation coefficients in statistics. <strong>If the relationship to be investigated is between two <a href="https://www.gironi.it/blog/i-dati-scale-di-misura/" class="rank-math-link">variables of interval or ratio type</a> (i.e., quantitative, numerical variables), the best known and most used is certainly Pearson&#8217;s correlation coefficient</strong>, generally indicated with the letter <em>r</em>.</p>



<p>I use the R language (not to be confused with the coefficient <em>r</em>) for a practical example with Pearson&#8217;s correlation coefficient.<br>The function to use is cor():</p>



<pre class="wp-block-preformatted">cor(gelati$temperatura,gelati$gelati)

[1] 0.9320279</pre>



<p>As you can see, the correlation in this example is very strong.</p>



<p><em><strong>r</strong></em> is a <strong>standardized value ranging from +1 to -1</strong>. The further it is from zero and closer to 1 (or -1 in the case of negative correlation), the stronger the correlation.</p>



<p class="has-light-gray-background-color has-background">If r is positive, the correlation is positive, meaning y increases as x increases.<br>If r is negative, y decreases as x increases.</p>



<p>But when can I say that a correlation is strong or very strong, and when is it moderate or even null? The answer is&#8230; it depends 🙂<br>There is no standard answer. Very arbitrarily, we can say that generally a correlation below 0.20 (or better, between -0.2 and +0.2) is considered very weak, between 0.2 and 0.5 (or between -0.2 and -0.5) moderate, between 0.5 and 0.8 (or between -0.5 and -0.8) quite strong. Correlations higher than 0.8 or lower than -0.8, very strong, are actually quite rare.</p>



<p><strong><em>CAUTION 1</em></strong>: <strong>The most important thing is that the evidence of a relationship between two variables does not necessarily imply the presence of a cause-effect relationship between the two variables. This is a point of utmost importance, which must always be kept in mind. </strong>Both variables that in my study show a very strong correlation may actually depend on a third variable, or many other variables, which constitute the real cause. Finding and calculating the correlation between two variables is relatively simple; finding and especially proving a cause-and-effect relationship is extremely complex!</p>



<p><strong><em>CAUTION 2:</em></strong> Another point I would like to emphasize is that <strong>Pearson&#8217;s correlation coefficient r measures the linear correlation</strong> between two variables. This means that two variables can show an apparent null correlation (<em>r</em> around 0) and yet be correlated, for example showing a <strong>curvilinear correlation</strong>. A classic textbook example concerns the correlation between stress level and performance in an exam. In fact, a slight stress contributes to improving performance, but beyond a certain threshold it is completely harmful, leading to a decline in the result. In this case, the analysis in terms of r and linear correlation would lead to discarding an existing correlation.</p>



<div class="wp-block-group bordo has-light-gray-background-color has-background"><div class="wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow">
<p><b>And now some math (but just a little)</b></p>



<p>Pearson&#8217;s correlation coefficient for a population, given two variables X and Y, is indicated by the Greek letter rho and is given by:</p>



<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-flow wp-block-group-is-layout-flow">
\(
\\
\rho_{X,Y}=\frac{COV(X,Y)}{\sigma_X \sigma_Y}
\\
\)



<p>where:<br>
<ul>
<li>COV indicates the covariance</li>
<li>&sigma;<sub>X</sub> is the <a href="https://www.gironi.it/blog/statistica-descrittiva-misure-di-dispersione-o-variabilita/#sqm">standard deviation</a> of X</li>
<li>&sigma;<sub>Y</sub> is the <a href="https://www.gironi.it/blog/statistica-descrittiva-misure-di-dispersione-o-variabilita/#sqm">standard deviation</a> of Y</li>
</p>



<p>To calculate the population covariance (which, we recall, is a <em>non-standardized</em> measure of the direction and strength of the relationship between the elements of two populations):</p>



\(
\sigma_{XY}=\frac{\sum\limits_{i=1}^n (X_i-\mu_x)(Y_i-\mu_y)}{n}
\
\)



<p>where:</p>



<ul>
<li>μ<sub>x</sub> is the population mean for x</li>
<li>μ<sub>y</sub> is the population mean for y</li>
<li>n is the number of elements in both variables</li>
<li>i is the index ranging from 1 to n</li>
<li>X<sub>i</sub> is a single element of population x</li><li>Y<sub>i</sub> is a single element of population y</li></ul>



<p><strong>Note</strong>: to calculate the values for an estimated population, just use <em>n-1</em> in the denominator. By default, R always uses the sample standard deviation, so the calculated value of <em>Pearson&#8217;s r</em> will always be calculated with <em>n-1</em> in the denominator.</p>



<p><strong>I take out the calculator &#8211; or paper and pen &#8211; and do some calculations&#8230;</strong></p>



<p>This is my table, which I completed by computing the various values:</p>



<table border=1 style="font-size:11px!important">
<tr> <th> </th> <th> temperatura </th> <th> gelati </th> <th> x<sub>i</sub>-X </th> <th> y<sub>i</sub>-Y </th> <th> (x<sub>i</sub>-X)-(y<sub>i</sub>-Y)</th> </tr>
<tr> <td align="right"> 1 </td> <td align="right"> 25 </td> <td align="right"> 58 </td> <td align="right"> 0.55 </td> <td align="right"> 7.82 </td> <td align="right"> 4.26 </td> </tr>
<tr> <td align="right"> 2 </td> <td align="right"> 30 </td> <td align="right"> 70 </td> <td align="right"> 5.55 </td> <td align="right"> 19.82 </td> <td align="right"> 109.90 </td> </tr>
<tr> <td align="right"> 3 </td> <td align="right"> 29 </td> <td align="right"> 61 </td> <td align="right"> 4.55 </td> <td align="right"> 10.82 </td> <td align="right"> 49.17 </td> </tr>
<tr> <td align="right"> 4 </td> <td align="right"> 26 </td> <td align="right"> 53 </td> <td align="right"> 1.55 </td> <td align="right"> 2.82 </td> <td align="right"> 4.36 </td> </tr>
<tr> <td align="right"> 5 </td> <td align="right"> 25 </td> <td align="right"> 48 </td> <td align="right"> 0.55 </td> <td align="right"> -2.18 </td> <td align="right"> -1.19 </td> </tr>
<tr> <td align="right"> 6 </td> <td align="right"> 28 </td> <td align="right"> 66 </td> <td align="right"> 3.55 </td> <td align="right"> 15.82 </td> <td align="right"> 56.08 </td> </tr>
<tr> <td align="right"> 7 </td> <td align="right"> 24 </td> <td align="right"> 47 </td> <td align="right"> -0.45 </td> <td align="right"> -3.18 </td> <td align="right"> 1.45 </td> </tr>
<tr> <td align="right"> 8 </td> <td align="right"> 22 </td> <td align="right"> 47 </td> <td align="right"> -2.45 </td> <td align="right"> -3.18 </td> <td align="right"> 7.81 </td> </tr>
<tr> <td align="right"> 9 </td> <td align="right"> 20 </td> <td align="right"> 40 </td> <td align="right"> -4.45 </td> <td align="right"> -10.18 </td> <td align="right"> 45.36 </td> </tr>
<tr> <td align="right"> 10 </td> <td align="right"> 18 </td> <td align="right"> 29 </td> <td align="right"> -6.45 </td> <td align="right"> -21.18 </td> <td align="right"> 136.72 </td> </tr>
<tr> <td align="right"> 11 </td> <td align="right"> 22 </td> <td align="right"> 33 </td> <td align="right"> -2.45 </td> <td align="right"> -17.18 </td> <td align="right"> 42.17 </td> </tr>
</table>



<p>The sum of the values in the last column is 456.0909.</p>



<p>So I can calculate the covariance by dividing by n-1 = 10, therefore:<br>456.0909/10 = 45.60909</p>



<p>Calculating the sample standard deviations for X and Y, I find the values:</p>



<p>S<sub>x</sub> = 3.751363<br>S<sub>y</sub> = 13.04468</p>



<p>So S<sub>x</sub> * S<sub>y</sub> = 48.9353299</p>



<p>Last step. <strong>I can calculate r</strong>:</p>



<p>45.60909 / 48.9353299 = 0.9320278435</p>



<p>which, as you can see, perfectly coincides with the value provided by R&#8217;s cor() function.</p>
</div></div>
</div></div>



<h2 class="wp-block-heading">The Coefficient of Determination r<sup>2</sup></h2>



<p>By squaring <em><strong>r</strong></em>, we obtain the <strong>coefficient of determination</strong>.</p>



<p>In our case, the coefficient of determination r<sup>2</sup> will be:</p>



<pre class="wp-block-preformatted">R-squared = 0.86868</pre>



<p>But what does this number mean?</p>



<p>r<sup>2</sup> tells us to what extent our regression equation reproduces the variance of the data.<br>In other words, how much of the variation in the response variable is explained by the predictor variable. The more accurate the regression equation, the more the value of r<sup>2</sup> tends to 1.</p>



<p>The <em>cor()</em> function in R allows you to easily obtain the correlation values calculated through all these methods, as you can easily see from the function&#8217;s syntax:</p>



<pre class="wp-block-preformatted">cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))</pre>



<h2 class="wp-block-heading">Finding the Regression Equation</h2>



<p>Our goal is to obtain the regression equation, and since it is a linear regression, the typical form will be:</p>



<pre class="wp-block-preformatted">y = mx + b</pre>



<p>m indicates the slope of the line in the graph, and is called the <strong>regression coefficient</strong>.</p>



<p>b is the point where the line intersects the y-axis, and is called the <strong>intercept</strong>.</p>



<p>Always remember that the linear regression line is the line that best fits the data provided. Ideally, we would like to minimize the distances of all data points from the regression line. These distances are called <strong>errors</strong> and are also known as <strong>residuals</strong>. A good line will have small residuals.</p>



<p>We fit the regression line to the data points in a scatter plot using the <strong>least squares method</strong>.</p>



<p>The calculations are not difficult (I will not report the procedure here), but the procedure can be tedious. For this reason, all advanced scientific calculators and many spreadsheets and programs provide procedures that simplify our lives.<br>Using R, the process is even easier.</p>



<p>I proceed by calculating the parameters of the regression line:</p>



<pre class="wp-block-preformatted"># calculate the parameters of the regression line
lm(gelati$gelati ~ gelati$temperatura)

Call: lm(formula = gelati$gelati ~ gelati$temperatura)
Coefficients:
(Intercept) gelati$temperatura
-29.074 3.241</pre>



<p>So my line will have the equation:</p>



<pre class="wp-block-preformatted"><strong>y = 3.241x - 29.074</strong></pre>



<p>It&#8217;s time to plot the scatter diagram again, overlapping the regression line I just found:</p>



<pre class="wp-block-preformatted"># draw the scatterplot
plot(gelati$temperatura,gelati$gelati, main="Scatterplot and Regression Line",xlab="temperature", ylab="gelati")

# and draw the regression line in red
abline(lm(gelati$gelati ~ gelati$temperatura),col="red",lwd=2)</pre>



<figure class="wp-block-image size-large is-resized"><img decoding="async" width="855" height="540" src="https://www.gironi.it/blog/wp-content/uploads/2020/08/regressione-lineare.png" alt="Linear Regression: simple regression line" class="wp-image-1816" srcset="https://www.gironi.it/blog/wp-content/uploads/2020/08/regressione-lineare.png 855w, https://www.gironi.it/blog/wp-content/uploads/2020/08/regressione-lineare-300x189.png 300w" sizes="(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px" /><figcaption class="wp-element-caption">The regression line superimposed on the scatter diagram.</figcaption></figure>



<h2 class="wp-block-heading">Outliers and Influential Points</h2>



<p>An outlier is an extreme observation that does not fit the general correlation or regression pattern. In practice, in our graph we will see that such outliers, if any, will be very far from the regression line in the y direction.<br>The inclusion of an outlier can affect the slope and y-intercept of the regression line.</p>



<p>When examining a scatter plot and calculating the regression equation, it is worth considering whether or not anomalous observations should be included. They could in fact be errors in the data – and then they should be excluded – but also “real” values, and in this case they are information of the utmost importance for the analyst.</p>



<p><strong>But when can we speak of outliers?</strong> There is no fixed rule when trying to decide whether or not to include an outlier in the regression analysis. This decision depends on the sample size, how extreme the outlier is, and the normality of the distribution.<br><br>For univariate data, an <strong>empirical rule based on the <a href="https://www.gironi.it/blog/statistica-descrittiva-misure-di-posizione/" class="rank-math-link">interquartile range IQR</a></strong> can be used to determine whether or not a point is an outlier.<br><br>We proceed in this way:</p>



<ul class="wp-block-list">
<li>Calculate the interquartile range for our data.</li>



<li>Multiply the interquartile range (IQR) by 1.5.</li>



<li>Add 1.5 x (IQR) to the third quartile.<br>Any number greater than the value found is a suspected extreme datum.</li>



<li>Subtract 1.5 x (IQR) from the first quartile. Any number lower is a suspected extreme datum.</li>
</ul>



<hr class="wp-block-separator has-css-opacity"/>



<p>An <strong>influential point</strong> is a point that, if removed, produces a significant change in the model estimate. <strong>An influential point may or may not be an outlier</strong>.</p>



<p>The influence.measures() command provides a whole series of useful influence measures: dfbeta, dffit, covratio, Cook&#8217;s distance and leverage points of all observations:</p>



<pre class="wp-block-preformatted"># measure influential points
influence.measures(lm(gelati$gelati ~ gelati$temperatura))</pre>



<h2 class="wp-block-heading">Model Assumptions</h2>



<p>For the linear regression model to be effectively usable, some assumptions must be met:</p>



<ul class="wp-block-list">
<li><strong>Normal distribution of errors</strong>: the errors must have, for each value of X, a normal distribution.</li>



<li><strong>Homoscedasticity</strong>: the variability of the errors is constant for each value of X.</li>



<li><strong>Independence of errors</strong>: the errors must be independent for each value of X (it is especially important for observations over time, in which it must be verified that there is no autocorrelation).</li>
</ul>



<p class="has-dark-gray-color has-light-gray-background-color has-text-color has-background">Therefore, specific tests of the model must be carried out, and <strong>all must give a positive result for the estimation model to be considered correct</strong>. <br>If even one of the tests gives a negative result (<em>non-normality of the residuals</em>, <em>heteroscedasticity</em>, <em>serial correlation</em>) the estimation method through least squares is not good.</p>



<h2 class="wp-block-heading">Residual Analysis</h2>



<p>The <strong>residual</strong> is equal to the difference between the observed value and the predicted value of Y.</p>



<ul class="wp-block-list">
<li>To estimate the goodness of fit of the regression line to the data, a graphical analysis is appropriate using a scatter plot of the residuals (on the ordinate) and the values of X (on the abscissa).<br> </li>



<li>To verify the assumptions, I can evaluate the <strong>graph of the residuals with respect to X</strong>: this allows us to establish whether the variability of the errors varies according to the values of X, confirming or not the <strong>assumption of homoscedasticity</strong>.<br> </li>



<li>To verify linearity, plot the residuals, on the ordinate, with respect to the predicted values, on the abscissa. The points should be distributed symmetrically around a horizontal line with an intercept equal to zero. Different trends indicate the presence of non-linearity.</li>
</ul>



<p></p>



<pre class="wp-block-preformatted"># look at the distribution of residuals.
# must be balanced above and below the zero line.
lmgelati &lt;- lm(gelati$gelati ~ gelati$temperatura)
plot (lmgelati$residual ~ lmgelati$fitted, ylab="Residuals",
xlab="Fitted")
abline(h=0)</pre>



<figure class="wp-block-image size-large is-resized"><img decoding="async" width="855" height="540" src="https://www.gironi.it/blog/wp-content/uploads/2020/08/linearita-residui.png" alt="graph of residuals linearity" class="wp-image-1830" srcset="https://www.gironi.it/blog/wp-content/uploads/2020/08/linearita-residui.png 855w, https://www.gironi.it/blog/wp-content/uploads/2020/08/linearita-residui-300x189.png 300w" sizes="(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px" /><figcaption class="wp-element-caption">The graph of residuals with respect to predicted values.</figcaption></figure>



<p>R&#8217;s <em>lmtest</em> package provides us with the Breusch-Pagan test to verify the homoscedasticity of the residuals:</p>



<pre class="wp-block-preformatted"># check homoscedasticity of residuals
# using the Breusch-Pagan test
library(lmtest)
testbp &lt;- bptest(gelati ~ temperatura, data=gelati)
testbp</pre>



<p>As for the <strong>normality of the residuals</strong>, the <strong>frequency histogram</strong> allows us to verify or not the condition.</p>



<figure class="wp-block-image size-large is-resized"><img loading="lazy" decoding="async" width="855" height="540" src="https://www.gironi.it/blog/wp-content/uploads/2020/08/residui-normalita.png" alt="Histogram of residuals" class="wp-image-1820" srcset="https://www.gironi.it/blog/wp-content/uploads/2020/08/residui-normalita.png 855w, https://www.gironi.it/blog/wp-content/uploads/2020/08/residui-normalita-300x189.png 300w" sizes="auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px" /><figcaption class="wp-element-caption">The verification of normality of residuals in our example, using a histogram.</figcaption></figure>



<p>We can also verify the normality of the residuals numerically, using a Shapiro-Wilk test:</p>



<pre class="wp-block-preformatted"># check the normality of the distribution of errors
# with a Shapiro-Wilcox test
residui &lt;- residuals(lm(gelati$gelati ~ gelati$temperatura))
shapiro &lt;- shapiro.test(residui)
shapiro</pre>



<p>Let&#8217;s verify that the mean of the errors is not significantly different from zero. To do this we can use a Student&#8217;s t-test:</p>



<pre class="wp-block-preformatted">residui &lt;- residuals(lm(gelati$gelati ~ gelati$temperatura))
t.test(residui)</pre>



<p>The <strong>graph of the residuals with respect to time</strong> (and the use of the <em>Durbin-Watson</em> statistic) allows us to highlight the existence or not of <strong>autocorrelation</strong>.</p>



<pre class="wp-block-preformatted"># Durbin Watson test for autocorrelation
dwtest(gelati$temperatura~gelati$gelati)</pre>



<h2 class="wp-block-heading">Regression Analysis: Practical Difficulties</h2>



<p>Simple regression analysis is a widely used model, but very, very insidious. The generalized tendency, in fact, is to use this type of analysis in a not very conscious and rigorous way – as for example in the simplified example I proposed 🙂<br>The assumptions underlying the model are rather stringent, and very often ignored&#8230;</p>



<p>Frequently, the analysis is carried out without taking into account the way in which these assumptions must be evaluated or the simple linear regression model is chosen instead of more appropriate alternative models.</p>



<p>Another very common error is given by <strong>extrapolation</strong>: that is, <strong>an estimation of values external to the range of observed values is made</strong>. This is a big no-no.</p>



<p>The advice is always to start the analysis by looking very carefully at the scatter diagram and to carefully verify the hypotheses underlying the regression model before using the results.</p>



<h2 class="wp-block-heading">Other Types of Correlation Coefficients</h2>



<p>Pearson&#8217;s correlation coefficient is certainly the best known, studied and used, but as we have seen it applies in cases where both variables are of quantitative type, measured through an interval or ratio scale. There are other methods that allow us to obtain the measure of correlation between variables of different types. All share the characteristic of being conceptually very similar to Pearson&#8217;s r coefficient.</p>



<h3 class="wp-block-heading">The Point-Biserial Correlation Coefficient</h3>



<p>Let&#8217;s take the case of an analysis in which one of the variables is of quantitative type (measured on an interval or ratio scale) and the second is a <strong>categorical variable with two levels</strong> (or dichotomous variable). In this case, the <strong>point-biserial correlation coefficient</strong> comes to our aid. I will not delve into the concept here, as it is in fact a &#8220;special&#8221; version of Pearson&#8217;s coefficient, leaving the reader to deepen when required by the analysis.</p>



<h3 class="wp-block-heading">The Phi Coefficient</h3>



<p>If we need to know if <strong>two dichotomous variables are correlated</strong>, we could then resort to the <strong>phi coefficient</strong>, another &#8220;special&#8221; case of Pearson&#8217;s r coefficient. Many readers will certainly know that two dichotomous variables can also be analyzed using a <a href="https://www.gironi.it/blog/il-test-del-chi-quadrato-bonta-di-adattamento-e-test-di-indipendenza/" class="rank-math-link">chi-squared test</a>.</p>



<h3 class="wp-block-heading">Spearman&#8217;s Rank Correlation Coefficient Rho (and a Note on Kendall&#8217;s Tau)</h3>



Sometimes the data are reported in terms of ranks. Ranks are a form of ordinal data, and since the other types of correlation coefficients do not treat this type of data, here is the need to introduce the use of Spearman&#8217;s rho coefficient.
Spearman&#8217;s correlation follows a simple and ingenious approach: it converts each data set into ranks and then calculates the correlation. It is a non-parametric statistical measure of correlation: the only assumption required is that the variables are sortable, and possibly continuous.
<p>
Here is the formula for Spearman&#8217;s coefficient:
</p>
\(
\\
r_s=\frac{6\sum{d}_i^2}{N(N^2-1)}
\\ \\
\)
<p>
r<sub>s</sub> can also assume values between –1.00 and +1.00, with the same meanings seen for r.
</p>
<p>
The coefficient r<sub>s</sub> has a serious defect: it can give an overestimate of the correlation between X and Y if, for at least one of the two variables, there are many equal ranks.
For this reason, to measure the correlation between two ordinal variables, another statistic is often used: <b>Kendall&#8217;s <i>tau</i> coefficient</b>.
</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.gironi.it/blog/en/correlation-and-regression-analysis-linear-regression/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
