{"id":3260,"date":"2024-01-11T15:50:29","date_gmt":"2024-01-11T14:50:29","guid":{"rendered":"https:\/\/www.gironi.it\/blog\/?p=3260"},"modified":"2024-09-23T15:53:50","modified_gmt":"2024-09-23T14:53:50","slug":"how-to-use-decision-trees-to-classify-data","status":"publish","type":"post","link":"https:\/\/www.gironi.it\/blog\/en\/how-to-use-decision-trees-to-classify-data\/","title":{"rendered":"How to Use Decision Trees to Classify Data"},"content":{"rendered":"\n<p>Decision Trees are a type of machine learning algorithm that uses a tree structure to divide data based on logical rules and predict the class of new data. They are easy to interpret and adaptable to different types of data, but can also suffer from problems such as <em>overfitting<\/em>, complexity, and imbalance.<br>Let&#8217;s understand a bit more about them and examine a simple example of use in R.<\/p>\n\n\n\n<!--more-->\n\n\n\t\t\t\t<div class=\"wp-block-uagb-table-of-contents uagb-toc__align-left uagb-toc__columns-1  uagb-block-5fb2a094      \"\n\t\t\t\t\tdata-scroll= \"1\"\n\t\t\t\t\tdata-offset= \"30\"\n\t\t\t\t\tstyle=\"\"\n\t\t\t\t>\n\t\t\t\t<div class=\"uagb-toc__wrap\">\n\t\t\t\t\t\t<div class=\"uagb-toc__title\">\n\t\t\t\t\t\t\tTable Of Contents\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"uagb-toc__list-wrap \">\n\t\t\t\t\t\t<ol class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#decision-trees-a-powerful-classification-tool\" class=\"uagb-toc-link__trigger\">Decision Trees: A Powerful Classification Tool<\/a><ul class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#a-simple-example-of-a-decision-tree-in-r\" class=\"uagb-toc-link__trigger\">A Simple Example of a Decision Tree in R<\/a><li class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#how-to-evaluate-the-accuracy-of-a-decision-tree\" class=\"uagb-toc-link__trigger\">How to Evaluate the Accuracy of a Decision Tree<\/a><li class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#what-is-overfitting\" class=\"uagb-toc-link__trigger\">What is Overfitting?<\/a><li class=\"uagb-toc__list\"><li class=\"uagb-toc__list\"><a href=\"#brief-overview-of-other-classification-algorithms\" class=\"uagb-toc-link__trigger\">Brief Overview of Other Classification Algorithms<\/a><\/ul><\/ol>\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\n\n\n<h1 class=\"wp-block-heading\">Decision Trees: A Powerful Classification Tool<\/h1>\n\n\n\n<p>Imagine you&#8217;re a doctor and you need to diagnose a patient&#8217;s illness based on some symptoms. How would you decide which disease the patient has? You could use your experience, your intuition, or consult manuals. Or you could use an algorithm that guides you step by step to choose the most likely diagnosis, based on the data you have available. This algorithm is called a <strong>Decision Tree<\/strong>.<\/p>\n\n\n\n<p class=\"has-light-gray-background-color has-background\">A Decision Tree is a graphical structure that represents a series of logical rules for classifying objects or situations. <br>Each <strong>node<\/strong> of the tree represents a question or condition, which divides the data into two or more homogeneous subgroups. <br>Each <strong>branch<\/strong> represents a possible answer or action, which connects one node to another node or to a leaf. <br>The <strong>initial node is called the root<\/strong>, and it&#8217;s the starting point of the tree. <br>The <strong>final nodes are called leaves<\/strong>, and they are the <strong>end points of the tree<\/strong>. <br><strong>Each leaf corresponds to a class<\/strong>, that is, a category to which the object or situation to be classified belongs.<\/p>\n\n\n\n<p>Decision Trees are widely used in scientific, technological, medical, economic, and social fields because they have several advantages:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>They are <strong>easy to interpret<\/strong> and communicate, even to non-experts.<\/li>\n\n\n\n<li>They are <strong>flexible<\/strong> and can adapt to different types of data, both numerical and categorical.<\/li>\n\n\n\n<li>They are <strong>robust<\/strong> and can handle incomplete, noisy, or inconsistent data.<\/li>\n\n\n\n<li>They are <strong>efficient<\/strong> and require little time and memory to be built and applied.<\/li>\n<\/ul>\n\n\n\n<p>However, Decision Trees also have some disadvantages:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>They can be <strong>unstable<\/strong>, that is, sensitive to small variations in the initial data, and therefore produce very different trees.<\/li>\n\n\n\n<li>They can be <strong>complex<\/strong>, that is, have many nodes and branches, and thus lose clarity and accuracy.<\/li>\n\n\n\n<li>They can be <strong>imbalanced<\/strong>, that is, favor some classes over others, and therefore be unrepresentative of reality.<\/li>\n<\/ul>\n\n\n\n<p>To overcome these problems, there are various techniques for <strong>optimization<\/strong> and <strong>validation<\/strong> of Decision Trees, which allow improving their performance and evaluating their reliability.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">A Simple Example of a Decision Tree in R<\/h2>\n\n\n\n<p>To better understand how Decision Trees work, let&#8217;s look at a practical example in R language.<\/p>\n\n\n\n<p>For our example, we&#8217;ll use the <code>iris<\/code> dataset, which contains measurements of sepal length and width and petal length and width of 150 iris flowers, belonging to three different species: setosa, versicolor, and virginica. Our goal is to build a Decision Tree that allows us to classify an iris flower based on its species, using its measurements as explanatory variables.<\/p>\n\n\n\n<p>First, let&#8217;s load the <code>iris<\/code> dataset and the <code>rpart<\/code> library, which allows us to create Decision Trees in R.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Load the iris dataset\ndata(iris)\n# Load the rpart library\nlibrary(rpart)\n# Set the seed for reproducibility\nset.seed(123)\n# Randomly extract 80% of the rows from the dataset\ntrain_index &lt;- sample(1:nrow(iris), 0.8*nrow(iris))\n# Create the training dataset\ntrain_data &lt;- iris[train_index, ]\n# Create the test dataset\ntest_data &lt;- iris[-train_index, ]\n<\/pre>\n\n\n\n<p>Now, we&#8217;re ready to build our Decision Tree, using the <code>rpart<\/code> function. This function requires some parameters:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>formula<\/strong>, which specifies the variable to be classified (in this case, <code>Species<\/code>) and the explanatory variables (in this case, all the others).<\/li>\n\n\n\n<li>The <strong>dataset<\/strong>, which contains the data to be used to build the Decision Tree (in this case, <code>train_data<\/code>).<\/li>\n\n\n\n<li>The <strong>method<\/strong>, which specifies the type of classification to be used (in this case, <code>class<\/code>, which indicates a categorical classification).<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\"># Build the Decision Tree\ntree &lt;- rpart(formula = Species ~ ., data = train_data, method = \"class\")\n<\/pre>\n\n\n\n<p>To visualize our Decision Tree, we use the <code>plot<\/code> function, which allows us to draw the graphical structure of the tree, and the <code>text<\/code> function, which allows us to add labels to the nodes and branches.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Visualize the Decision Tree\nplot(tree, uniform = TRUE, branch=0.8)\ntext(tree, all=TRUE, use.n = TRUE)\n<\/pre>\n\n\n\n<p>The result is as follows:<\/p>\n\n\n\n<div class=\"wp-block-uagb-image uagb-block-0ef4be0f wp-block-uagb-image--layout-default wp-block-uagb-image--effect-static wp-block-uagb-image--align-none\"><figure class=\"wp-block-uagb-image__figure\"><img decoding=\"async\" srcset=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2024\/01\/decision-tree-1-935x1024.png ,https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2024\/01\/decision-tree-1.png 780w, https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2024\/01\/decision-tree-1.png 360w\" sizes=\"auto, (max-width: 480px) 150px\" src=\"https:\/\/www.gironi.it\/blog\/wp-content\/uploads\/2024\/01\/decision-tree-1-935x1024.png\" alt=\"\" class=\"uag-image-3074\" width=\"856\" height=\"574\" title=\"\" loading=\"lazy\" role=\"img\"><\/figure><\/div>\n\n\n\n<p>How can we interpret this simple Decision Tree? Let&#8217;s start from the root, which is the node at the top. This node tells us that the most important variable for classifying an iris flower is the <strong>petal length<\/strong> (<code>Petal.Length<\/code>). If the petal length is less than 2.45 cm, then the flower is of the <strong>setosa<\/strong> species. If instead the petal length is greater, we need to consider whether the petal length is less than or equal to 4.75 cm. If it&#8217;s less, then the flower is of the <strong>versicolor<\/strong> species. If instead the petal length is greater than 4.75 cm, then the flower is of the <strong>virginica<\/strong> species.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to Evaluate the Accuracy of a Decision Tree<\/h2>\n\n\n\n<p>To evaluate the accuracy of a Decision Tree, we need to compare the classes predicted by the tree with the actual classes of the test data. To do this, we use the <code>predict<\/code> function, which allows us to apply the Decision Tree to the test data and obtain the predicted classes.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Apply the Decision Tree to the test data\npred_class &lt;- predict(tree, newdata = test_data, type = \"class\")\n<\/pre>\n\n\n\n<p>Then, we use the <code>table<\/code> function, which allows us to create a contingency table between the predicted classes and the actual classes.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Create the contingency table\ntable(pred_class, test_data$Species)\n<\/pre>\n\n\n\n<p>The result is as follows:<\/p>\n\n\n\n<table>\n  <tr>\n    <td><\/td>\n    <td>setosa<\/td>\n    <td>versicolor<\/td>\n    <td>virginica<\/td>\n  <\/tr>\n  <tr>\n    <td>setosa<\/td>\n    <td>10<\/td>\n    <td>0<\/td>\n    <td>0<\/td>\n  <\/tr>\n  <tr>\n    <td>versicolor<\/td>\n    <td>0<\/td>\n    <td>13<\/td>\n    <td>0<\/td>\n  <\/tr>\n  <tr>\n    <td>virginica<\/td>\n    <td>0<\/td>\n    <td>2<\/td>\n    <td>5<\/td>\n  <\/tr>\n<\/table>\n\n\n\n<p>This table shows us how many times the Decision Tree correctly predicted or misclassified the species of an iris flower. For example, the cell in the top left tells us that the Decision Tree correctly predicted that 10 flowers were of the setosa species. The cell at the bottom center tells us that the Decision Tree wrongly predicted that 2 flowers were of the virginica species, when in fact they were of the versicolor species.<\/p>\n\n\n\n<p>To calculate the accuracy of a Decision Tree, we need to divide the number of correct predictions by the total number of predictions. In this case, the accuracy is:<\/p>\n\n\n\n\\(\n\\frac{10 + 13 + 5}{10 + 13 + 5 + 2} = \\frac{28}{30} = 0.93\\\\\n\\)\n\n\n\n<p>This means that our Decision Tree correctly predicted the species of an iris flower in 93% of cases. This is a good result, but it could be improved with some optimization techniques, such as <strong>pruning<\/strong> or <strong>variable selection<\/strong>.<\/p>\n\n\n\n<p>Pruning is a technique that consists of reducing the complexity of a Decision Tree by eliminating some nodes or branches that do not significantly contribute to accuracy. This can prevent the problem of <strong>overfitting<\/strong>, which is when the Decision Tree adapts too much to the training data and loses the ability to generalize to the test data.<\/p>\n\n\n\n<p>Variable selection is a technique that consists of choosing the most relevant variables for classification, eliminating those that are irrelevant or redundant. This can improve the accuracy and clarity of the Decision Tree, reducing the number of questions or conditions to consider.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is Overfitting?<\/h2>\n\n\n\n<p><strong><em>Overfitting<\/em><\/strong> is a problem that occurs <strong>when a machine learning model adapts too much to the training data, and fails to generalize well to new data<\/strong>. This means that the model memorizes the specific characteristics and noise of the training data, but fails to capture the general trend of the data. As a result, the model has high accuracy on the training data, but low accuracy on the test or validation data. Overfitting can be caused by excessive complexity of the model, insufficient training data, or too long training.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Brief Overview of Other Classification Algorithms<\/h2>\n\n\n\n<p>There are countless other classification algorithms, such as <strong>logistic regression<\/strong>, <strong>k-nearest neighbor<\/strong>, <strong>support vector machine<\/strong>, and <strong>neural networks<\/strong>. These algorithms are based on principles different from Decision Trees, such as probability function, distance, margin, or non-linear transformation of data. Some of<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Decision Trees are a type of machine learning algorithm that uses a tree structure to divide data based on logical rules and predict the class of new data. They are easy to interpret and adaptable to different types of data, but can also suffer from problems such as overfitting, complexity, and imbalance.Let&#8217;s understand a bit &hellip; <a href=\"https:\/\/www.gironi.it\/blog\/en\/how-to-use-decision-trees-to-classify-data\/\" class=\"more-link\">Leggi tutto<span class=\"screen-reader-text\"> &#8220;How to Use Decision Trees to Classify Data&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","footnotes":""},"categories":[161],"tags":[1189],"class_list":["post-3260","post","type-post","status-publish","format-standard","hentry","category-statistics","tag-decision-trees"],"lang":"en","translations":{"en":3260,"it":3065},"uagb_featured_image_src":{"full":false,"thumbnail":false,"medium":false,"medium_large":false,"large":false,"1536x1536":false,"2048x2048":false,"post-thumbnail":false},"uagb_author_info":{"display_name":"paolo","author_link":"https:\/\/www.gironi.it\/blog\/author\/paolo\/"},"uagb_comment_info":4,"uagb_excerpt":"Decision Trees are a type of machine learning algorithm that uses a tree structure to divide data based on logical rules and predict the class of new data. They are easy to interpret and adaptable to different types of data, but can also suffer from problems such as overfitting, complexity, and imbalance.Let&#8217;s understand a bit&hellip;","_links":{"self":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3260","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/comments?post=3260"}],"version-history":[{"count":1,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3260\/revisions"}],"predecessor-version":[{"id":3261,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/posts\/3260\/revisions\/3261"}],"wp:attachment":[{"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/media?parent=3260"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/categories?post=3260"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gironi.it\/blog\/wp-json\/wp\/v2\/tags?post=3260"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}