Perhaps one of the most common algorithms in Kaggle competitions, and machine learning in general, is the random forest algorithm. It performs well in almost all scenarios and is mostly impossible to overfit, which is probably why it is popular to use.

Here, I will be explaining decision trees shortly, then giving you a function in Python.

Table of Contents

1.Decision Trees explained
2.Example of Gini Impurity
3.Shortcomings of Decision Trees
4.Random Forest explained
5.Code: GridSearchCV with Random Forest Regression

Decision Trees

Decision trees has three types of nodes. There is the root node, internal nodes and leaf nodes.

We might want to do classification, e.g. answer a question of is this an animal? We could begin to ask relevant questions, that would make it more likely that we classify the animal correctly.

This is a classification tree. We are trying to classify which animal this is, that we are handing to the decision tree. The questions cold blooded is the root node, the first node, and lays eggs are an internal node, while non-mammal and mammal are called leaf nodes.

A decision tree always considers all features (internal nodes) and values for every feature (leaf nodes). This means we could call it a greedy algorithm. The way a decision tree decides on which question, i.e. which feature, to ask, is by which question will have the highest impurity gain. In other words, it chooses the local optimal next feature. The process can be cut short to a few bullet points

  • Calculate impurity at root node
  • Calculate impurity gain for each feature and value
  • Choose the feature and values which grants the biggest impurity gain

This is done for each internal node and its leaf nodes, until there is no further impurity gain. What usually referred to is gini impurity.

$$ 1-\sum_{c}^{C}p(c|v)^2 $$

If you have read my post about introduction to probability, this would be a piece of cake, when given an example. I heavily suggest that you read it before reading further in this post.

Example of Gini Impurity

Let us consider a case where we are trying to classify 15 animals as mammal or non-mammal. We have classified them according to the image below.

  • We have 15 animals at the root node $I(r)$, or $r=15$ observations
  • 5 animals at one internal node, with 1 mammal and 4 non-mammal at $I(v_1)$
  • 10 animals at another internal node with 4 mammal and 6 non-mmal at $I(v_2)$

We have two classes, mammal and non-mammal, which we will denote as $c_1$ and $c_2$.

The impurity at the root node will then be

$$ I(r)=1-\left(\frac{v_1}{r}\right)^2-\left(\frac{v_2}{r}\right)^2 =1-\left(\frac{5}{15}\right)^2-\left(\frac{10}{15}\right)^2=\frac{4}{9} $$

Because $v_1$ consists of 5 animals and $v_2$ consists of 10 animals, while $r$ is the total number of animals, which is 15.

The impurity at $I(v_1)$ will be

$$ I(v_1)=1-\left(\frac{c_1}{v_1}\right)^2-\left(\frac{c_2}{v_1}\right)^2 =1-\left(\frac{1}{5}\right)^2-\left(\frac{4}{5}\right)^2=\frac{8}{25} $$

The impurity at $I(v_2)$ will be

$$ I(v_2)=1-\left(\frac{c_1}{v_2}\right)^2-\left(\frac{c_2}{v_2}\right)^2 =1-\left(\frac{4}{10}\right)^2-\left(\frac{6}{10}\right)^2=\frac{12}{25} $$

Finally, the impurity gain will be

$$ \Delta=I(r)-\frac{5}{15}I(v_1)-\frac{10}{15}I(v_2) = \frac{4}{9}-\frac{5}{15}\times\frac{8}{25}-\frac{10}{15}\times\frac{12}{25} \approx 0.0177 $$
Shortcomings of decision trees

Why don't we just use decision trees in machine learning? Well here are some of the shortcomings. As mentioned, at each step in the decision tree, we calculate the impurity gain, but only for what gives the biggest impurity gain locally. The global picture, i.e. what leads to the best overall tree, is not taken into account. This means that the next question we might ask, is going to equal a big impurity gain, but in the overall picture, asking another question leads to a better tree.

Another shortcoming is when we ask questions. When we have asked a question, we have sort of 'used' that features observations, leaving us with less observations for the rest of the questions. This leads to asking question with not a lot of observations, when asking, say, the 5th or 6th question. You could then argue that we shouldn't ask anymore questions, because we have little to no evidence to support what the next best question might be.

Some of the problems with decision trees are adressed by:

  • Pruning, which is limiting the number of questions we ask.
  • Stop asking questions, when there is X number of observations left
Random Forest

Random Forest is similar to decision trees, in that it builds a similar tree to a decision tree, but just based on different rules. Random forest also implements pruning, i.e. setting a limit for how many questions we ask. The algorithm is part of something we call 'bagging', which refers to splitting your data into subsamples and training $M_y$ number of regressor models, when encountering a regression problem.

The important part to remember when splitting your entire dataset, is that we split into random subsamples, which means the same observations from our original dataset, can occur multiple times in a subsample and across all subsamples. It also means that some observations might be omitted, that is, not included in the subsamples.

For random forest, we make $M_y$ random forest models, then average the result (e.g. mean squared error) from each model.

Bagging: Split orginal dataset into $D_r$ subsamples. For each subsample, train a random forest. Average the result from all random forests.

It's very common to use the RandomForestRegressor from sklearn, so that is where I'll start explaining. We already improved upon decision trees, by using random forests as explained above. But there is even more upside to random forests.

When doing random forests, we can implement pruning by settting max_depth. The most common way to do pruning with random forest is by setting that parameter to be between 3 and 7.

Another parameter is n_estimators, which is the number of trees we are generating in the random forest. Higher is not necessarily better, since you at some point will reach a plateau, when increasing the number of random forests. I suggest you limit it to 1000.

The last parameter I want to introduce is max_features, which is self explanatory. You can set the number of features to consider, when asking the next question, i.e. the next internal node. Note that some would argue, that you won't need to use this parameter, because the bagging strategy, as illustrated in the picture above, mitigates the problem with the number of features.

GridSearchCV with Random Forest Regression

One way to find the optimal number of estimators is by using GridSearchCV, also from sklearn. You just give it an estimator, param_grid and define the scoring, along with how many cross-validation folds. But you are not to worry about the last part, just set cv=10.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

def random_forest_prediction(X_train_data,X_test_data,y_output):
    gs = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'max_depth': [3, None],
            'n_estimators': (10, 30, 50, 100, 200, 400, 600, 800, 1000),
            'max_features': (2,4,6)
        }, cv=10, n_jobs=-1, scoring='neg_mean_squared_error'
    )
    model = gs.fit(X_train_data,y_output)
    pred = model.predict(X_test_data)
    
    # Do sqrt(-model.best_score_) for RMSE
    score = -model.best_score_
    
    # return all predictions and MSE of all cross validated scores
    return pred, score
    
pred,score = random_forest_prediction(train_data,test_data,output_variable)

What I have done here is make the GridSearchCV find the max_depth,n_estimators and max_features, which gives the lowest error, from the list I gave each one. GridSearchCV along with other search methods are very common and a standard approach, that you will see me use a lot in future posts. The n_estimators will find the best number of random forests, from 10 to 1000, and max_features will find the best number of features, while the max_depth is the max number of internal nodes. Note that you can just exclude the best number of features, and it would consider all features. It is not usually implemented in practice, but is rather showcased here.

This function supposes that you have split your data into training and testing data, as well as an output variable. The output variable are to be excluded from the training data, so that you don't train on your output variable.

Here were some of the predictions I got for house prices, using this algorithm: [127251.54702308 154956.43728829 182879.14145627 ... 151510.37138696  121843.09356488 215477.78201348].