This article was first published by IBM Developer at developer.ibm.com, but authored by Casper Hansen. Here is the Direct link.

Sometimes you discover small tips and tricks to improve your code and make life easier for yourself, e.g. better maintainability, efficiency etc. — well this is one of those improvements to your machine learning, except it’s essential and takes an extra thought to implement.

The goal is to introduce you, the developer, to stacking in machine learning. Using your own models, you will learn how to apply stacking to your own datasets. Follow this article and get better results — it’s that simple.

You should at least be familiar with overfitting, underfitting and cross-validation to read this article.

What is Model Stacking?

Stacking is the process of using different Machine Learning models one after another, where you add the predictions from each model to make a new feature.

There are generally two different variants for stacking, variant A and B. For this article, we are focusing on variant A, since this seems to get the better results of the two variants, because models more easily overfit to training data in variant B. This is likely also the reason why practitioners use variant A, although it does not eliminate overfitting.

Please note that there is no correct way of implementing model stacking (David H. Wolpert), because model stacking only describes the process of combining many models with a final generalized model. There exists ways to implement model stacking, some of which has been proven to work well in practice. This is why we explore variant A.

Explaining The Model Stacking Process

Model stacking should always be accompanied by cross validation, to reduce overfitting models to training data. This is common practice – read more here.

Model stacking will seem like a simple technique to improving your results, when you understand what happens inside the algorithm. Though, there is many components interacting, and keeping track of all of them can be quite difficult, especially when first learning this concept. For you to fully understand the algorithm, I created a step-by-step image and description, such that it is easier to understand.

For starters, when doing model stacking with cross-validation, we require three parameters: a Training dataset, Holdout dataset (validation dataset) and a list of models called models_to_train.

Figure 1: Model stacking diagram

The most essential part here, is that each model's predictions becomes a slice of a new feature, such that each model gets to predict a slice of the training data for this new feature.

Now, let's put figure 1 into text, to actually explain what goes on! Later on, you will encounter a real example in Python.

  1. Gather models with optimized hyperparameters into a models_to_train array.
  2. Split the original dataset into a Training and Holdout dataset.
  3. Let Training go onwards into the upcoming loop, and save Holdout until the last part in the upcoming loop.
  4. Make a for loop with KFold Cross-Validation where k=5
    • In each iteration, split the Training dataset into another training and testing dataset. Call them X_train, y_train, X_test and y_test. The white parts in figure 1 represents X_test and y_test, while the green represents X_train and y_train.
    • Set the current model equal to models_to_train[k-1].
    • Train the current model on X_train and y_train.
    • Make predictions on the test dataset X_test and call them y_test_pred. Extend an array full_y_pred with the predictions y_test_pred.
    • Make predictions on the Holdout dataset Holdout and call them holdout_pred.
  5. Average the holdout_pred arrays into a full_holdout_pred array.
  6. Add full_y_pred as a new feature in Training and add full_holdout_pred as a new feature in Holdout
  7. Return the datasets Training and Holdout with the new features for use in the next layer.

Notice that we use one model to predict each part of the total training dataset; we predict one fold at a time, with a different model each time.

Great, we have some stacked machine learning models, but how can we use them for anything? The next step is taking the new training and holdout dataset and use them as input in the next layer.

Introducing Layers Of Model Stacks

Stacking machine learning models is done in layers, and there can be arbitrarily many layers, dependent on exactly how many models you have trained, along with the best combination of these models. For an instance, the first layer could boil down to learning some feature that has great predictive power, while the next layer might be doing something else, like reducing noise.

We put models stacks in layers, and usually with a different purpose. At the end, we get a final dataset, that we feed into a last model. The last model is called a meta-learner (e.g. meta-regressor or meta-classifier), and its purpose is to generalize all the features from each layer into the final predictions.

Below is an example, where each level represents a new layer in our total pipeline, when applying model stacking (there is a switch between layer and level, but their meaning is the same). The example given is only with two levels, where level 2 is the final model that gives the final predictions, which is the output.

Figure 2: Layers of model stacks (in levels)

If you wanted to add more layers, we could just add another level after level 1 and make the meta-learner level 3. If we were to increase the numbers of layers, level 2 would be another round of model stacking of M models, just as in level 1. The meta-learner is always present in the last layer, hence why we insert a new level before the meta-learner.

Figure 3: Adding a layer to figure 2

To summarize, we produce new features by different models stack, which is combined in a new dataset to make final predictions at the end. People sometimes combine other methods in these model stacks, such as feature selection or normalization, but we will not explore these other methods here.

What is a Stack of Models?

Quite literally, you define a stack of models for each layer, and you run the first model from the first layer, all the way down to the last model in the last layer, while making predictions between levels. These stacks of models could be any type of model, e.g. we could have a stack with XGBoost, Neural Networks and Linear Regression.

The image below helps realizing how models are stacked in layers, even though it is not as specific as the one I provided earlier.

Why Should You Use Stacking?

Throughout this article, the promise of better results have been mentioned a couple of times. Let's justify how and why we get better results when using model stacking.

1) Model stacking is used amongst competition winners and practitioners – and the reason for why is simple. Your meta-learner generalizes better than a single model, i.e. it makes better predictions on unseen data, than just a single model. It's that simple.

2) A good and older paper on this topic is Stacked Generalization by David H. Wolpert (1992). Wolpert argues that model stacking deduces the bias in a model on a particular dataset, such that we later on can correct for such bias in a meta-learner.

3) Before model stacking became popular and more widely used, a technique called Bayes Model Averaging (BMA) existed. In such a model, weights were the posterior probabilities of models – but instead, empirical evidence by Bertrand Clarke has shown that model stacking with cross validation out-performs BMA, while being robust, because model stacking is less sensitive to changes in a dataset.

I think the argument you should pay most attention to, is the first point – everyone uses model stacking to get better results. But if you are looking for an explanation of why it works, I would read the paper by Wolpert.

Step By Step: Code For Stacking in Python

Now for what most developers would consider the fun part. The part where we apply what we just learned from reading about what model stacking is, and how exactly it improves the predictive power. GitHub repository with full code.

The first thing to do, in a Machine Learning project, is finding a dataset. For this illustrative example, we are going to be using a toy dataset (one that requires no data cleaning or transforming).

Importing and Preparing Data

Before we can import a dataset, we need to import some of the basic packages for handling datasets.

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
import warnings

The next step is to load the dataset into a Pandas dataframe, such that we can index and inspect the dataset easily. We simply provide the load_boston function from sklearn, and we input it into the function as seen below. This will return a dataframe containing all the features and samples in the Boston Housing Dataset.

def dataset_to_df(load):
    # Load the input data into the dataframe
    df = pd.DataFrame(load.data, columns=load.feature_names)
    
    # Add the output data into the dataframe
    df['label'] = pd.Series(load.target)
    
    # Return the dataframe
    return df

df = dataset_to_df(load_boston())

After loading the dataset, we typically want to inspect it and learn what the features means. This task is omitted here, but I will leave 5 samples of data, along with the meaning of each feature in the dataset means here. You can also find a description of all these features here.

Inspecting the first five rows in our dataset, we can see some values. We don't know what they mean yet.

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT label
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

1-13 is our input and 14 is the output which we are trying to predict as precisely as possible, given an unseen test dataset.

  1. CRIM: per capita crime rate by town
  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of non-retail business acres per town.
  4. CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX: nitric oxides concentration (parts per 10 million)
  6. RM: average number of rooms per dwelling
  7. AGE: proportion of owner-occupied units built prior to 1940
  8. DIS: weighted distances to five Boston employment centres
  9. RAD: index of accessibility to radial highways
  10. TAX: full-value property-tax rate per \$10,000
  11. PTRATIO: pupil-teacher ratio by town
  12. B: $1000(Bk-0.63)^2$ where Bk is the proportion of blacks by town
  13. LSTAT: % lower status of the population
  14. MEDV (label): Median value of owner-occupied homes in $1000's

Grid Search Before Stacking

Before we start doing model stacking, we want to optimize the hyperparameters for each algorithm for a baseline comparison. This is commonly done by a searching algorithm, for an example, scikit-learn offers a grid search and random search algorithm.

If we cannot exceed the baseline predictions from the baseline models, then we should simply not bother trying model stacking. Sometimes though, model stacking cannot improve results by that much, either because of a low number of samples, or not complex enough data.

Before we start stacking, we split the dataset df into four variables X_train, X_test, y_train and y_test. Any variable named X is typically our input data, while y is what we want to predict (the output data).

from sklearn.model_selection import train_test_split

# Getting the output variable
y = df['label']

# Getting the input variables
X = df.drop(['label'], axis=1)

# Diving our input and output into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, 
                                    test_size=0.33, 
                                    random_state=42
                                   )

I made a function for doing hyperparameter tuning. This function gives us exactly what we want; the best model, the predictions and the score of the best model on the test dataset. It uses imports from the scikit-learn package (sklearn), which makes it easier to run the grid search algorithm.

As parameters to the function, we can just pass in our data, a model and different hyperparameters for tuning.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

def algorithm_pipeline(X_train_data, X_test_data, y_train_data, y_test_data, 
                       model, param_grid, cv=10, scoring_fit='neg_mean_squared_error',
                       scoring_test=r2_score, do_probabilities = False):
    gs = GridSearchCV(
        estimator=model,
        param_grid=param_grid, 
        cv=cv, 
        n_jobs=-1, 
        scoring=scoring_fit,
        verbose=2
    )
    fitted_model = gs.fit(X_train_data, y_train_data)
    best_model = fitted_model.best_estimator_
    
    if do_probabilities:
      pred = fitted_model.predict_proba(X_test_data)
    else:
      pred = fitted_model.predict(X_test_data)
    
    score = scoring_test(y_test_data, pred)
    
    return [best_model, pred, score]

Next, we define the hyperparameters for tuning, along with the models. We use Random Forest, LightGBM and XGBoost in the following code, since they usually perform the best. First, we import and instantiate the classes for the models, then we define some parameters to input into the grid search function.

from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

# Defining our estimator, the algorithm to optimize
models_to_train = [XGBRegressor(), LGBMRegressor(), RandomForestRegressor()]

# Defining the hyperparameters to optimize
grid_parameters = [
    { # XGBoost
        'n_estimators': [400, 700, 1000],
        'colsample_bytree': [0.7, 0.8],
        'max_depth': [15,20,25],
        'reg_alpha': [1.1, 1.2, 1.3],
        'reg_lambda': [1.1, 1.2, 1.3],
        'subsample': [0.7, 0.8, 0.9]
    },
    { # LightGBM
        'n_estimators': [400, 700, 1000],
        'learning_rate': [0.12],
        'colsample_bytree': [0.7, 0.8],
        'max_depth': [4],
        'num_leaves': [10, 20],
        'reg_alpha': [1.1, 1.2],
        'reg_lambda': [1.1, 1.2],
        'min_split_gain': [0.3, 0.4],
        'subsample': [0.8, 0.9],
        'subsample_freq': [10, 20]
    }, 
    { # Random Forest
        'max_depth':[3, 5, 10, 13], 
        'n_estimators':[100, 200, 400, 600, 900],
        'max_features':[2, 4, 6, 8, 10]
    }
]

Now that we have defined all the models and hyperparameters for tuning, we can simply input our data into the function defined earlier.

models_preds_scores = []

for i, model in enumerate(models_to_train):
    params = grid_parameters[i]
    
    result = algorithm_pipeline(X_train, X_test, y_train, y_test, 
                                 model, params, cv=5)
    models_preds_scores.append(result)

Inspecting the results, we finally get to see which models performed the best, by running the following lines of code.

for result in models_preds_scores:
    print('Model: {0}, Score: {1}'.format(type(result[0]).__name__, result[2]))

As we can see from the results, XGBoost clearly outperformed the two other algorithms, therefore we will use XGBoost as the meta-learner.

Model: XGBRegressor, Score: 0.8954
Model: LGBMRegressor, Score: 0.8679
Model: RandomForestRegressor, Score: 0.8697

We have a score to beat; the XGBRegressor score of $0.8954$. The point of stacking, is that we can improve results. Let me show you how to make it happen.

Stacking Models

After doing some research on existing packages, I found pystacknet and mlxtend. For this article, I went with mlxtend, because it integrates nicely with scikit-learn, which defines a great API that XGBoost, LightGBM and other packages follow.

We start with importing the classes from the packages that we need to run this code. We assume no hyperparameters for the stacking – this means that we use the predefined hyperparameters for each model.

from mlxtend.regressor import StackingCVRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

xgb = XGBRegressor()
lgbm = LGBMRegressor()
rf = RandomForestRegressor()
ridge = Ridge()
lasso = Lasso()
svr = SVR(kernel='linear')

Now that we have defined all our models, we can begin improving our results by stacking some models! As we can see here, we defined two levels, where the first level has 6 models, and the second level has the meta-learner.

Note: XGBoost/mlxtend has a bug when trying to mix the packages. That is why we edit the names of the columns of the X_test to $f0$, $f1$, etc. – XGBoost internally renames the features of X_train, and if we don't provide the exact same names, it will not run. We also use shuffle=False and random_state=42 for the sake of reproducing the same results displayed in this article.

stack = StackingCVRegressor(regressors=(ridge, lasso, svr, rf, lgbm, xgb),
                            meta_regressor=xgb, cv=12,
                            use_features_in_secondary=True,
                            store_train_meta_features=True,
                            shuffle=False,
                            random_state=42)

stack.fit(X_train, y_train)

X_test.columns = ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12']
pred = stack.predict(X_test)
score = r2_score(y_test, pred)

Finally, after solving some small tasks throughout the code, we get to a result.

0.9071

About 0.01 better in the $r^2$ score – which makes sense. The dataset used here is highly linear, which means it's really hard to capture any patterns that the baseline algorithms couldn't capture already. But it still demonstrates the effect of stacking, which is better results – even on highly linear datasets.

The possibilities for optimizing results for more complex datasets are much greater, and if you want to compete on Kaggle, I suggest picking up a version of model stacking. This is a rather simple case; on Kaggle, you will see stacking of 100+ models in a single layer. Gilberto and Stanislav on Kaggle optimized 33 models from scratch and combined them in a model stack to win the first place award of \$5000. This was some years back, today there are prices of up to \$1.000.000 for detecting Deepfakes.