Classification and Regression Trees (CART) is one of the most used algorithms in Machine Learning, as it appears in Gradient Boosting. This means that the most popular packages like XGBoost and LightGBM are using CART to build trees.

Decision Tree is a generic term, and they can be implemented in many ways – don't get the terms mixed, we mean the same thing when we say classification trees, as when we say decision trees. But a decision tree is not necessarily a classification tree, it could also be a regression tree.

We will be exploring Gini Impurity, which helps us measure the quality of a split. Gini impurity is a classification metric that measures how we should create internal nodes and leaf nodes. This metric is different from but similar to Gini Index and Information Gain/Impurity Improvement.

Terminology of Trees

Let's get the definitions rolling right away. We have root nodes, internal node and leaf nodes – each illustrated in the picture below:

• Root node: The very first node in a tree.
• Internal node: Nodes which are connected to deeper nodes.
• Leaf nodes: Nodes that are not connected to deeper nodes, but have an internal node connected to it.

Throughout this article, we will color code the root and internal nodes as blue, while the leaf nodes are mint green.

Some general terminology for trees is the concept of parents and children. The nodes above a certain node are called parent nodes, while the nodes below are called child nodes.

Gini Impurity For Leaf Nodes

When calculating the gini impurity for a single leaf node, we can have multiple classes $C$ – e.g. the classes "YES" and "NO". The following are the steps:

1. Calculate the probabilities of all classes, given the samples in a Leaf $k$
2. Square the calculated probabilities, i.e. raised to the power of 2
3. Sum all the squared probabilities into a single integer
4. Subtract the single integer from $1$

The probabilities are easy to find. If the class is "YES", then we simply count the number of observations labelled "YES" in $Leaf_k$ and divide it by the total number of observations. We will see exactly how this is done later on.

$$Gini(Leaf_k) = 1 \sum_{class=1}^{C \, classes} -(probability(class|Leaf_k))^2$$
$$\Updownarrow$$
$$Gini(L_k) = 1\sum_{c=1}^{C} -\left( p(c|L_k) \right )^2$$

We have expanded the terms a bit in the top equation, for better readability, while the bottom equation is the one we will continue to use.

The $p(c|L_k)$ reads:

What is the probability that a class $c$ is true, given the observations in the k'th leaf (called $Leaf_k$)?

If you are not comfortable with the thought of probabilities yet, I have written a free post explaining some of the basic probability theory. Visit the post and come back afterwards, if you feel like you need a refresher on probability theory.

Gini Impurity For Internal Nodes

The next thing we need to explain is how gini impurity is calculated for an internal node (and the root node, more on this later). The following are the steps:

1. Loop over each leaf node, from $k=1$ all the way to $K$ leaf nodes.
2. Find the gini impurity of the current leaf $k$'th leaf (with the first equation that was just explained.
3. Count the number of observations in the $k$'th leaf.
4. Divide by the total number of observations in all the leaf nodes.
5. After looping over all leafs, the result of each leaf is summed for a final gini impurity.

This equation should be nothing hard to handle, but there is many things to account for; which leaf holds which observations?

$$Gini(Internal_j) = \sum_{k'th \, leaf=1}^{K \, leaf \, nodes} \left(\frac{count(L_k)}{count(L_1,...,L_k)}\right) Gini(L_k)$$

We have again expanded the definitions in the upper equation and simplified the terms in the lower equation. The definitions are mathematically the same, but the sum/count might be confusing. All you have to think about when calculating that is, how many observations can we currently consider in the internal node?

Example of Building a Classification Tree

First and foremost, we need a dataset to even start thinking about building a classification tree. Let's define this dataset as below with 5 examples.

We want to predict whether a person has The Flu. We are given a dataset, where we have four features: does the person have a fever, cough, headache and/or the flu? We use $1 = \text{YES}$ and $0 = \text{NO}$, with each observation indexed in the left-most column.

0 1 0 0 1
1 0 1 1 1
2 0 1 1 0
3 0 1 0 0
4 1 1 0 1
... ... ... ... ...
267 0 1 0 1

As you can see from the table, we have $N=268$ observations and $M=4$ features. Note that N and M is the standard notation; N means observations, while M means features. We also assume that this dataset has no missing values.

Finding The Root Node

To build a decision tree, we have to start off by finding the best feature for the root node – the feature which best separates the observations, measured by impurity.

We iteratively go feature by feature, to find the impurity for each feature. We quite literally test which feature separates the data the best.

Fever

We start with fever – and we ask, how well does the feature fever separate our 268 observations into NO flu and YES flu? After having separated by YES/NO fever, we look at how many observations in each leaf has YES/NO for flu.

This split looks good, but let's actually measure just how good it is, by using our equations from earlier. We start off with the first equation.

Step 1: Calculating the gini impurities for each leaf node.

We can treat the root node just like an internal node when calculating the impurity. To expand the equation, we know our two classes YES (1) and NO (0). The big greek sigma $\Sigma$ works like a foreach loop, where we just loop over each class, from $c=1$ until $C$ classes.

$$Gini(L_1) = 1 \sum_{c=1}^{C} -( p(c|L_1) )^2 = 1-p(1|L_1)^2 - p(0|L_1)^2$$

But wait, how do we measure the probabilities? We could ask, given the observations in Leaf 1, how likely is a person to have the flu? Now, how we actually measure this is by putting the class in the numerator and total number of observations in the leaf in the denominator.

$$p(1|L_1) = \frac{138}{154},p(0|L_1) = \frac{16}{154}$$

In these instances, we can think about $p(c|L_k)$ as counting the number of observations with class c, divided by the count of observations in the leaf k $\frac{count(c)}{count(L_k)}$.

$$Gini(L_1) = 1 - \left(\frac{138}{154} \right)^2 - \left(\frac{16}{154} \right)^2 = 0.186$$

The same approach is used for Leaf 2:

$$Gini(L_2) = 1 - \left(\frac{18}{114} \right)^2 - \left(\frac{96}{114} \right)^2 = 0.266$$

Now that we have the gini impurity of each leaf, we can calculate the gini impurity for the fever feature by the second equation.

Step 2: Calculating the gini impurities for the fever feature.

We know the gini impurity of $L_1$ and $L_2$, we just calculated them above in step 1. The second step is expanding the summation. We realize that we only have two leaf nodes, so we just have to expand it to the two leafs.

Then we count the number of observations in $N$ in each leaf node and replace the $count(L_k)$ by the actual count in the leaf. And then we replace the $Gini(L_k)$ by the gini impurity that we just calculated above.

\begin{align} Gini(I_1) &= \sum_{k=1}^{K} \left(\frac { count(L_k) } { count(L_1,...,L_K) } \right)Gini(L_k) \\ & = \left(\frac { count(L_1) } { count(L_1 + L_2) } \right)Gini(L_1)\,+ \left(\frac { count(L_2) } { count(L_1 + L_2) } \right)Gini(L_2) \\ & = \left(\frac{154}{268} \right)0.186 + \left(\frac{114}{268} \right)0.266 \\ & = 0.22 \end{align}

We end up with a gini impurity of $0.22$. Next, we will calculate the gini impurity for the other features – at the end, we choose the root to be the feature with the lowest gini impurity, since we could consider that feature the most pure.

Now that I have already showed you how to calculate the gini impurity for the first feature, we will quickly go through the gini impurity for the coughing and headache features, since the type of data is the same.

Splitting by the coughing feature, the leaf nodes ended up like the in the following image.

Given the information from this decision tree, we can calculate the gini impurity of the two leaf nodes.

$$Gini(L_1) = 1 - \left(\frac{60}{132} \right)^2 - \left(\frac{72}{132} \right)^2 = 0.496$$

$$Gini(L_2) = 1 - \left(\frac{35}{136} \right)^2 - \left(\frac{101}{136} \right)^2 = 0.382$$

It turns out that these gini impurity scores are way higher than for the fever feature, which means the gini impurity is going to be higher.

\begin{align} Gini(I_1) & = \sum_{k=1}^{K} \left(\frac { count(L_k) } { count(L_1,...,L_K) } \right)Gini(L_k) \\ & = \left(\frac { count(L_1) } { count(L_1 + L_2) } \right)Gini(L_1)\,+ \left(\frac { count(L_2) } { count(L_1 + L_2) } \right)Gini(L_2) \\ & = \left(\frac{132}{268} \right)0.496 + \left(\frac{136}{268} \right)0.3825 \\ & = 0.438 \end{align}

Using these gini impurities from the leaf nodes to calculate the gini impurity for the coughing feature, we get a gini impurity feature of $0.438$, which is comparatively much worse than the one for the fever feature.

Given the information from this decision tree, we can calculate the gini impurity of the two leaf nodes.

$$Gini(L_1) = 1 - \left(\frac{84}{148} \right)^2 - \left(\frac{64}{148} \right)^2 = 0.491$$

$$Gini(L_2) = 1 - \left(\frac{26}{120} \right)^2 - \left(\frac{94}{120} \right)^2 = 0.339$$

And yet again, it turns out that these gini impurity scores are way higher than for the fever feature, which means the gini impurity is going to be higher.

\begin{align} Gini(I_1) & = \sum_{k=1^{K}} \left(\frac { count(L_k) } { count(L_1,...,L_K) } \right)Gini(L_k) \\ & = \left(\frac { count(L_1) } { count(L_1 + L_2) } \right)Gini(L_1)\,+ \left(\frac { count(L_2) } { count(L_1 + L_2) } \right)Gini(L_2) \\ & = \left(\frac{148}{268} \right)0.491 + \left(\frac{120}{268} \right)0.339 \\ & = 0.423 \end{align}

The final gini impurity is $0.423$ for the headache feature. Since the gini impurity for the fever feature is the lowest, the fever feature now becomes the root.

Dealing With Discrete/Continuous Values

What if our data instead had discrete values, like the age of a person – 10, 15 etc. How would we calculate the gini impurity? We can't exactly find the probability because we are not working with classes now.

0 1 0 0 10 1
1 0 1 1 15 1
2 0 1 1 55 0
3 0 1 0 31 0
4 1 1 0 22 1
... ... ... ... ... ...
267 0 1 0 21 1

Well actually, we can find the probability and calculate the gini impurity – we just have to make some thresholds for a split. Let me show you how.

The following is the way to calculate the gini impurity for a feature:

1. Sort the feature by ascending or descending.
2. For as many data points as possible, find the average of two adjacent rows. We call each of these averages a threshold.
3. Use each average as a threshold and calculate the gini impurity.
4. Use the threshold with the lowest gini impurity to compare to the other features.

If we have the first 5 rows of the age feature as an array `[10, 15, 55, 31, 22]`, we can sort it such that the array becomes `[10, 15, 22, 31, 55]`. Next, we can find the averages between the adjacent data points.

So the technique is to iterate over each row, then find the average of the value of the current row plus the value of the next row. This is easily achieved by storing an index of the current row and looking for the next value in the array by looking up index+1.

• First threshold: $(10+15)/2 = 12.5$
• Second threshold: $(15+22)/2 = 18.5$
• Third threshold: $(22+31)/2 = 26.5$
• Fourth threshold: $(31+55)/2 = 43$

Now we have four thresholds `[12.5, 18.5, 26.5, 43]`. We find a gini impurity by asking: what is the probability that the a given weight is LESS than a given threshold?

$$p(\text{all_\text{weights}} < \text{threshold}| \text{threshold})$$

And the reverse, what is the probability that the a given weight is GREATER than a given threshold?

$$p(\text{all_\text{weights}} > \text{threshold} | \text{threshold})$$

Great, this fits perfectly into our gini impurity calculations, since we just split the observations into two leaf nodes for each threshold. We construct 4 potential splits in a decision tree, and then we evaluate how well the split went. Note that we went from using 5 rows of data to make thresholds, to using the whole dataset to calculate the gini impurity. In practice, you would like to make thresholds for all the adjacent rows possible, but for this example, we will stick with a low number of thresholds for clarity.

Given these four different splits, all we have to do is calculate the gini impurity, just like we did with the other features. For the sake of clarity, I'm leaving the final gini impurities for each of these thresholds below, since we have already done this exercise for the three other features.

Gini Impurity of Threshold at 12.5:

$$Gini(L_1)=1-\left(\frac{118}{223}\right)^2 -\left(\frac{105}{223}\right)^2=0.4983$$

$$Gini(L_2)=1-\left(\frac{21}{45}\right)^2 -\left(\frac{24}{45}\right)^2=0.49777$$

$$Gini(I_1) = \left(\frac{223}{268}\right)0.4983 + \left(\frac{45}{268}\right)0.49777 = 0.4982$$

Gini Impurity of Threshold at 18.5:

$$Gini(L_1)=1-\left(\frac{91}{172}\right)^2 -\left(\frac{81}{172}\right)^2=0.4983$$

$$Gini(L_2)=1-\left(\frac{49}{96}\right)^2 -\left(\frac{47}{96}\right)^2=0.4997$$

$$Gini(I_1) = \left(\frac{172}{268}\right)0.4983 + \left(\frac{96}{268}\right)0.4997 = 0.4988$$

Gini Impurity of Threshold at 26.5:

$$Gini(L_1)=1-\left(\frac{51}{145}\right)^2 -\left(\frac{94}{145}\right)^2=0.456$$

$$Gini(L_2)=1-\left(\frac{59}{123}\right)^2 -\left(\frac{64}{123}\right)^2=0.4992$$

$$Gini(I_1) = \left(\frac{145}{268}\right)0.456 + \left(\frac{123}{268}\right)0.4992 = 0.4758$$

Gini Impurity of Threshold at 43:

$$Gini(L_1)=1-\left(\frac{32}{83}\right)^2 -\left(\frac{51}{83}\right)^2=0.4738$$

$$Gini(L_2)=1-\left(\frac{105}{185}\right)^2 -\left(\frac{80}{185}\right)^2=0.4909$$

$$Gini(I_1) = \left(\frac{83}{268}\right)0.4738 + \left(\frac{185}{268}\right)0.4909 = 0.4856$$

Since the gini impurity for the threshold at 26.5 is the lowest, that means we split at this threshold, since it's the optimal split of compared to the rest.

Dealing with Categorical Data

Suppose we add another feature to the table called eye color. This feature contains a string of letters that describe the color of a persons eye (obviously). But the important part is the string, since we cannot exactly calculate the gini impurity based on a string value.

0 1 0 0 Blue 1
1 0 1 1 Blue 1
2 0 1 1 Brown 0
3 0 1 0 Brown 0
4 1 1 0 Green 1
... ... ... ... ... ...
267 0 1 0 Brown 1

What is the solution? The way modern solutions like scikit-learn or XGBoost builds their decision trees is by only allowing 3 data types: booleans, floats and integers. This means they also reject datetime and anything else that is not one of those three data types. You can also encode datetime observations, but that is a topic for another day.

This is exactly the approach we should follow, and we should reject any data that has another data type than booleans, floats and integers. This means we need to transform our string into one of the three data types, such that the algorithm will accept it – a popular choice is one-hot encoding.

Let's say we find that there exists three categorical values for the eye color feature: blue, green and brown. The main idea is that we turn each categorical value into its own feature with a boolean value.

FEVER COUGHING HEADACHE EYECOLOR_Blue EYECOLOR_Green EYECOLOR_Brown FLU
0 1 0 0 1 0 0 1
1 0 1 1 1 0 0 1
2 0 1 1 0 0 1 0
3 0 1 0 0 0 1 0
4 1 1 0 0 1 0 1
... ... ... ... ... ... ... ...
267 0 1 0 0 0 1 1

You see how we get three new features: EYECOLOR_BLUE, EYECOLOR_GREEN and EYECOLOR_BROWN. We give them a boolean value (1 or 0), and at this point, we can treat them like all the other features and calculate the gini impurity, just like I have shown you earlier.

Building the full decision tree

Now that we have inspected how we calculate the root node, we will go through calculating the child nodes in the tree until we are done. To recap, we found that fever was the best feature at the root node, and it splits the observations such that 154 observations goes to the left, and 114 observations goes to the right in the decision tree.

We choose to expand the left side first, then the right side second – the next step is calculating the gini impurity for the remaining features: coughing and headache. We always explore the left side of things first, then go back and explore the right side. It's really the same process as before, as you will see in a moment.

Building The Left Side

We try to split the 154 observations by the coughing feature first. The procedure will look exactly the same as before – the main point you need to get, is that we keep trying to split by a new feature the deeper we get into the tree. This is what it looks like, when we explore the best features to split by on the left side of the tree. Notice how the left green leaf, in the image above, becomes a new feature in blue.

Note that the fever feature cannot be used to split the observations anymore, as we consider that feature used. The feature can only be reused when the parent nodes have not used the feature yet, but since fever is at the root of the tree, there are no parent nodes.

It looks like the coughing feature ends up splitting the observations quite nicely, when looking at the ratio between the persons who had the flu and persons who did not have the flu. This looks to be a promising feature.

As you have seen a couple times now, we calculate the gini impurity for each leaf, and then the gini impurity for the internal node coughing.

$$Gini(L_1)=1-\left(\frac{102}{124}\right)^2 -\left(\frac{22}{124}\right)^2=0.2919$$

$$Gini(L_2)=1-\left(\frac{8}{30}\right)^2 -\left(\frac{22}{30}\right)^2=0.3911$$

$$Gini(I_1) = \left(\frac{124}{154}\right)0.2919 + \left(\frac{30}{154}\right)0.3911 = 0.3112$$

There we have it, the gini impurity for the coughing feature is $0.3112$.

The next feature is headache. The split looks like the following, when splitting the observations based on if a person had a headache.

At first glance, the headache feature does not split the observations as well. If you look at the ratio between $FLU=1$ and $FLU=0$, we see that the ratio from coughing is better.

It looks like headache won't make it, but we won't know until we have calculated the gini impurity. Same method as before:

$$Gini(L_1)=1-\left(\frac{55}{80}\right)^2 -\left(\frac{25}{80}\right)^2=0.4297$$

$$Gini(L_2)=1-\left(\frac{32}{74}\right)^2 -\left(\frac{42}{74}\right)^2=0.4909$$

$$Gini(I_1) = \left(\frac{80}{154}\right)0.4297 + \left(\frac{74}{154}\right)0.4909 = 0.4591$$

And it turns out the gini impurity for the headache feature is $0.4591$, which is much higher than the gini impurity for the coughing feature, which was $0.3112$. Now we split by the headache feature.

Last Two Leaf Nodes

Now we just need to see if headache splits the two leaf nodes better than what's already there. We start by the left side again.

Now we are matching a gini impurity of a leaf against a potential new internal node. Recall the gini impurity for the left leaf, where $coughing = 1$, was $0.2919$, and the gini impurity for the right leaf, where $coughing = 0$, was $0.3911$. These are the scores we need to beat, if we want to split by our last feature headache.

Let's calculate the gini impurity for headache on the left side, given that the above image is how headache turns out to split the observations.

$$Gini(L_1)=1-\left(\frac{97}{101}\right)^2 -\left(\frac{4}{101}\right)^2=0.076$$

$$Gini(L_2)=1-\left(\frac{5}{23}\right)^2 -\left(\frac{18}{23}\right)^2=0.3402$$

$$Gini(I_1) = \left(\frac{101}{124}\right)0.076 + \left(\frac{23}{124}\right)0.3402 = 0.125$$

We split by the headache feature, since the gini impurity $0.125$ for the headache feature is lower than the previous gini impurity $0.2919$ for the leaf where $coughing = 1$.

The last thing we need to do is go back and check the second leaf node. We need to know if the gini impurity is lower than $0.3911$.

Now that we have seen the split, it still looks promising. Though, let's see what the gini impurity turns out to be.

$$Gini(L_1)=1-\left(\frac{4}{16}\right)^2 -\left(\frac{12}{16}\right)^2=0.375$$

$$Gini(L_2)=1-\left(\frac{4}{14}\right)^2 -\left(\frac{10}{14}\right)^2=0.4082$$

$$Gini(I_1) = \left(\frac{16}{30}\right)0.375 + \left(\frac{14}{30}\right)0.4082 = 0.3905$$

Since $0.3905$ is less than $0.3911$, we will have to split here again. This concludes building the left side of the tree.

Summarization Of Tree Construction

Let's put tree construction into steps, for the sake of reference and making it as easy as possible to understand.

1. Calculate the gini impurity for each feature, as a candidate to become the next node in the classification tree.
• a. For each feature in unused_features
• b. For each class in feature
• c. Calculate the gini impurity for class
• d. Calculate the gini impurity for feature
2. Compare all gini impurities for all unused features, and split the observations based on the feature with the lowest gini impurity.
• a. Compare gini impurities for all unused features
• b. If one of the gini impurities is less than the gini impurity for the leaf node, split the observations with the feature that has the lowest gini impurity.

Notice how the strategy for the root node is the same as the internal nodes when finding the best split, except for the fact that there are no previous leaf nodes, which means we skip part of step 2b.

Regularization for CARTs

Regularization is when we try to control the complexity of a machine learning model, such that we are not overfitting/underfitting to our dataset. The way regularization usually works is by adding a penalty to a score based on the parameters of a model – the penalty is large when the parameters have large values, and the penalty is small when the parameters have small values.

What are examples of regularization for classification trees?

1. Post-training pruning. This goes back in the classification tree and removes internal nodes and leaf nodes, based on calculations of a tree score. The algorithm is called minimal cost complexity pruning.
2. Max tree depth. This can limit the number of splits we can use in a tree.
3. Minimum samples required for a split. This makes it such that a leaf node needs at least n observations in a leaf node, if that leaf node is to be split with by feature.

There are other parameters that we can adjust, but these are the main one's.

Did you like this article? Then you will enjoy following the implementation of classification trees from scratch in Python, which is coming up next.