  {"id":3280,"date":"2023-01-12T13:07:22","date_gmt":"2023-01-12T12:07:22","guid":{"rendered":"https:\/\/www.gironi.it\/blog\/?p=3280"},"modified":"2024-12-08T12:04:36","modified_gmt":"2024-12-08T11:04:36","slug":"logistic-regression-predicting-the-outcome-of-an-event","status":"publish","type":"post","link":"https:\/\/www.gironi.it\/blog\/en\/logistic-regression-predicting-the-outcome-of-an-event\/","title":{"rendered":"Logistic Regression: Predicting the Outcome of an Event"},"content":{"rendered":"\n<p>Logistic regression is a statistical model used to predict the probability of an event based on a set of independent variables. It&#8217;s particularly useful when you want to classify an event as belonging or not to a specific category (for example, whether a customer will buy a product or not, or whether a patient will develop a disease or not).<\/p>\n\n\n\n<p>It is a <strong><em>Supervised Machine Learning<\/em><\/strong> algorithm that can be used to model the probability of a specific class or event. <strong>It is used when the data is linearly separable<\/strong> &#8211; that is, if there exists a line or plane that can be used to uniquely separate the data into different classes &#8211; <strong>and the outcome is binary or dichotomous<\/strong>. This means that logistic regression is typically used for <strong><em>binary classification<\/em><\/strong> problems (Yes\/No, Correct\/Incorrect, True\/False, etc.),<\/p>\n\n\n\n<p>In this post, I will demonstrate how to perform binomial logistic regression to create a classification model, in order to predict binary responses on a given set of predictors.<\/p>\n\n\n\n<!--more-->\n\n\n\t\t\t\t<div class=\"wp-block-uagb-table-of-contents uagb-toc__align-left uagb-toc__columns-1  uagb-block-8db1afac      \"\n\t\t\t\t\tdata-scroll= \"1\"\n\t\t\t\t\tdata-offset= \"30\"\n\t\t\t\t\tstyle=\"\"\n\t\t\t\t>\n\t\t\t\t<div class=\"uagb-toc__wrap\">\n\t\t\t\t\t\t<div class=\"uagb-toc__title\">\n\t\t\t\t\t\t\tWhat we&#8217;ll discuss\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"uagb-toc__list-wrap \">\n\t\t\t\t\t\t<ol class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#how-logistic-regression-works-and-steps-to-build-it\" class=\"uagb-toc-link__trigger\">How logistic regression works and steps to build it<\/a><li class=\"uagb-toc__list\"><a href=\"#an-example-in-r-calculating-the-probability-of-survival-on-the-titanic\" class=\"uagb-toc-link__trigger\">An example in R: calculating the probability of survival on the Titanic<\/a><li class=\"uagb-toc__list\"><a href=\"#a-bit-of-math-the-logit-equation\" class=\"uagb-toc-link__trigger\">A bit of math: the logit equation<\/a><li class=\"uagb-toc__list\"><a href=\"#lets-sum-up\" class=\"uagb-toc-link__trigger\">Let&#039;s sum up<\/a><\/ol>\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\n\n\n<h2 class=\"wp-block-heading\">How logistic regression works and steps to build it<\/h2>\n\n\n\n<p>Logistic regression is a <strong>statistical modeling technique<\/strong> used to <strong>predict the probability of a binary event<\/strong> (e.g., yes\/no, true\/false) <strong>based on a set of independent variables<\/strong>.<\/p>\n\n\n\n<p>Unlike linear regression, which is used to predict continuous values, logistic regression uses the logistic function to &#8220;model&#8221; the probability of the observed event.<\/p>\n\n\n\n<p>Logistic regression uses the <strong>logistic function<\/strong>, also known as the <strong><em>sigmoid<\/em><\/strong>, to produce the probability of an event. <br>The logistic function <strong>produces a value between 0 and 1<\/strong>, which can be interpreted as a probability. <br>After the model has been trained, it can be used to make predictions on new data, providing an estimate of the probability of an event.<\/p>\n\n\n\n<div class=\"wp-block-uagb-image aligncenter uagb-block-c2155627 wp-block-uagb-image--layout-default wp-block-uagb-image--effect-static wp-block-uagb-image--align-center\"><figure class=\"wp-block-uagb-image__figure\"><img decoding=\"async\" srcset=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2023\/01\/sigmoide.png \" sizes=\"auto, (max-width: 480px) 150px\" src=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2023\/01\/sigmoide.png\" alt=\"sigmoid graph\" class=\"uag-image-2632\" width=\"630\" height=\"401\" title=\"\" loading=\"lazy\" role=\"img\"\/><figcaption class=\"uagb-image-caption\">The sigmoid function is useful for mapping any predicted probability value: the predicted value is always between 0 and 1<\/figcaption><\/figure><\/div>\n\n\n\n<p>The steps to build a logistic regression are as follows:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Select and collect data<\/strong>: gather the data you want to use to predict the binary event and select the independent variables you believe are relevant to your analysis.<\/li>\n\n\n\n<li><strong>Clean and prepare the data<\/strong>: check the data for any missing or erroneous values and ensure that the data is properly formatted for analysis.<\/li>\n\n\n\n<li><strong>Build the model<\/strong>: use the logistic function to build the model on the training data. The logistic function is an S-shaped function that returns values between 0 and 1, which can be interpreted as probabilities.<\/li>\n\n\n\n<li><strong>Evaluate the model<\/strong>: Use the test data to evaluate the accuracy of the model. There are various metrics that can be used for evaluation, such as accuracy, precision, and recall.<\/li>\n\n\n\n<li><strong>Interpret the results<\/strong>: analyze the model coefficients to understand the relative importance of the independent variables and to better understand how the values of the variables affect the probability of the event.<\/li>\n\n\n\n<li><strong>Use the model to make predictions<\/strong>: use the model to make predictions on new data based on the provided independent variables.<\/li>\n<\/ol>\n\n\n\n<p>These are obviously the general steps for building a logistic regression. However, in some situations, it may be necessary to perform additional operations or adjustments, such as using regularization methods to avoid overfitting problems, or using cross-validation to have a more reliable estimate of the model&#8217;s accuracy.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">An example in R: calculating the probability of survival on the Titanic<\/h2>\n\n\n\n<p>To make a very simplified practical example, I downloaded one of the most well-known and used datasets, the one related to the Titanic passengers, which contains information about the passengers of the famous Titanic shipwreck, including age, gender, social class, and whether the passengers survived the accident or not.<br><br>I got it from <a href=\"https:\/\/raw.githubusercontent.com\/datasciencedojo\/datasets\/master\/titanic.csv\" target=\"_blank\" rel=\"noopener\">this address<\/a> and saved it locally as <em>titanic.csv<\/em><\/p>\n\n\n\n<p>Note: the Titanic dataset is also available in the Kaggle data library (<a href=\"https:\/\/www.kaggle.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">kaggle.com<\/a>) and in the UCI Machine Learning dataset collection (<a href=\"http:\/\/archive.ics.uci.edu\/ml\/datasets.php\" target=\"_blank\" rel=\"noreferrer noopener\">archive.ics.uci.edu\/ml\/datasets.php<\/a>).<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"1024\" src=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2023\/03\/eb2df3e4-514b-423a-86be-9eb08458d104-1024x1024.jpeg\" alt=\"\" class=\"wp-image-2920\" srcset=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2023\/03\/eb2df3e4-514b-423a-86be-9eb08458d104.jpeg 1024w, https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2023\/03\/eb2df3e4-514b-423a-86be-9eb08458d104-300x300.jpeg 300w, https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2023\/03\/eb2df3e4-514b-423a-86be-9eb08458d104-150x150.jpeg 150w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/figure>\n<\/div>\n\n\n<p>I don&#8217;t need to clean the data in this case because I&#8217;m using a &#8220;safe&#8221; and extensively tested dataset. <br>Obviously, in a &#8220;real&#8221; use case, the data would need to be carefully examined, studied, and &#8220;processed&#8221; in the preliminary phase&#8230;<\/p>\n\n\n\n<p>Here&#8217;s an example code in R:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Load libraries\nlibrary(ggplot2)\nlibrary(caret)\n# Load data into the titanic dataset\n# Replace the path with the one on your PC\ntitanic &lt;- read.csv(\"\/mypath\/titanic.csv\")\n# Display the first 10 rows\nhead(titanic, 10)\n# Create dummy variables for categorical fields\ntitanic$Sex &lt;- as.factor(titanic$Sex)\ntitanic$Survived &lt;- as.factor(titanic$Survived)\n# Create a logistic regression model\nmodel &lt;- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = titanic, family = binomial(link = \"logit\"))\n# Show the model\nsummary(model)\n# How to predict the probability of survival for an example case\nexample &lt;- data.frame(Pclass = 3, Sex = \"male\", Age = 32, SibSp = 0, Parch = 0, Fare = 8.05, Embarked = \"S\")\npredict(model, newdata = example, type = \"response\")\n# Graphically display the probabilities of survival based on class\nggplot(titanic, aes(x = Pclass, fill = factor(Survived))) + \n  geom_bar(position = \"fill\") +\n  labs(x = \"Class\", y = \"Probability of survival\") +\n  scale_fill_discrete(name = \"Survived\", labels = c(\"No\", \"Yes\"))<\/pre>\n\n\n\n<p>In this case, we notice that a 32-year-old man in third class would have had about an 8.5% chance of survival.<br>Graphically, we then have a way to visualize the probability of survival based on the class of the seat.<\/p>\n\n\n\n<div class=\"wp-block-uagb-image uagb-block-6e99849a wp-block-uagb-image--layout-default wp-block-uagb-image--effect-static wp-block-uagb-image--align-none\"><figure class=\"wp-block-uagb-image__figure\"><img decoding=\"async\" srcset=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2023\/01\/sopravvivenza-classe.png \" sizes=\"auto, (max-width: 480px) 150px\" src=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2023\/01\/sopravvivenza-classe.png\" alt=\"graph representing the probability of survival in the Titanic shipwreck based on the Class of travel\" class=\"uag-image-2642\" width=\"862\" height=\"549\" title=\"\" loading=\"lazy\" role=\"img\"\/><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">A bit of math: the logit equation<\/h2>\n\n\n\n<p>As we have seen, the logit equation is a mathematical equation used in logistic regression to describe the relationship between the dependent variable (which can only take binary values) and one or more independent variables (also called predictors or covariates).<br><br>In general, the form of the logit equation is as follows:<\/p>\n\n\n\n\\(\nlogit(p) = \\ln\\left(\\frac{p}{1-p}\\right) = b_0 + b_1x_1 + b_2x_2 + &#8230; + b_n*x_n \\\\ \\\\\n\\)\n\n\n\n<p>where:<br><br><strong>p<\/strong> is the probability that the dependent variable takes the value &#8220;1&#8221;<br><strong>logit(p)<\/strong> is called the log-odds<br><strong>b_0, b_1, b_2, \u2026, b_n<\/strong> are the <strong>model coefficients<\/strong> (also called weights or parameters)<br><strong>x_1, x_2, \u2026, x_n<\/strong> are the <strong>independent variables<\/strong> (predictors or covariates)<\/p>\n\n\n\n<p class=\"has-light-gray-background-color has-background\">In summary, the logit equation describes how the probability of an event (e.g., a binary response) depends on the values of the independent variables, through the model weights.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Let&#8217;s sum up<\/h2>\n\n\n\n<p><strong>Logistic regression is a powerful statistical model that can help predict the outcome of an event based on a set of independent variables.<\/strong> It&#8217;s easy to use and interpret, and can be used in many fields, from medicine to finance.<\/p>\n\n\n\n<p><strong>It represents an effective tool for solving binary classification problems<\/strong> because it allows modeling the relationship between the binary dependent variable and one or more independent variables.<\/p>\n\n\n\n<p>It allows you to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Model the relationship between a binary dependent variable and one or more independent variables.<\/li>\n\n\n\n<li>Predict the probability that the dependent variable will take a specific value (e.g., 1 or 0) based on the values of the independent variables.<\/li>\n\n\n\n<li>Use these probability predictions to classify new cases based on a predefined threshold (for example, if the probability of a case being classified as 1 is greater than 0.5, then it is classified as 1, otherwise as 0)<\/li>\n\n\n\n<li>Interpret the model weights (coefficients) to understand which independent variables are most important for classification.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Logistic regression is a statistical model used to predict the probability of an event based on a set of independent variables. It&#8217;s particularly useful when you want to classify an event as belonging or not to a specific category (for example, whether a customer will buy a product or not, or whether a patient will &hellip; <a href=\"https:\/\/www.gironi.it\/blog\/en\/logistic-regression-predicting-the-outcome-of-an-event\/\" class=\"more-link\">Leggi tutto<span class=\"screen-reader-text\"> &#8220;Logistic Regression: Predicting the Outcome of an Event&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","footnotes":""},"categories":[161,299],"tags":[1205],"class_list":["post-3280","post","type-post","status-publish","format-standard","hentry","category-statistics","category-ai","tag-logistic-regression"],"lang":"en","translations":{"en":3280,"it":2631},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"post-thumbnail":false},"uagb_author_info":{"display_name":"paolo","author_link":"https:\/\/www.gironi.it\/blog\/author\/paolo\/"},"uagb_comment_info":20,"uagb_excerpt":"Logistic regression is a statistical model used to predict the probability of an event based on a set of independent variables. It&#8217;s particularly useful when you want to classify an event as belonging or not to a specific category (for example, whether a customer will buy a product or not, or whether a patient will&hellip;","_links":{"self":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3280","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/comments?post=3280"}],"version-history":[{"count":3,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3280\/revisions"}],"predecessor-version":[{"id":3337,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3280\/revisions\/3337"}],"wp:attachment":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/media?parent=3280"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/categories?post=3280"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/tags?post=3280"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}