Want an unbiased estimation of the true error of an algorithm? This is where you are going to find it. I will explain the what, why, when and how for nested cross-validation. Specifically, the concept will be explained with K-Fold cross-validation.

GitHub package: I released an open-source package for nested cross-validation, that works with Scikit-Learn, TensorFlow (with Keras), XGBoost, LightGBM and others.

Table of Contents

1.What Is Cross-Validation?
2.What Is Nested Cross-Validation?
3.When To Use Nested Cross-Validation?
4.Other findings for Nested Cross-Validation
5.What To Do After Nested Cross-Validation
6.Code for Nested Cross-Validation

What Is Cross-Validation?

Firstly, a short explanation of cross-validation. K-Fold cross-validation is when you split up your dataset into K-partitions — 5- or 10 partitions being recommended. The way you split the dataset is making K random and different sets of indexes of observations, then interchangeably using them. The percentage of the full dataset that becomes the testing dataset is $1/K$, while the training dataset will be $K-1/K$. For each partition, a model is fitted to the current split of training and testing dataset.

Below is an example of K-Fold cross-validation with $K=5$. The full dataset will interchangeably be split up into a testing and training dataset, which a model will be trained upon.

The idea is that you use cross-validation with a search algorithm, where you input a hyperparameter grid — parameters that are selected before training a model. In combination with Random Search or Grid Search, you then fit a model for each pair of different hyperparameter sets in each cross-validation fold (example with random forest model).

                    'max_depth': [3, None],
                    'n_estimators': [10, 30, 50, 100, 200, 400, 600, 800, 1000],
                    'max_features': [2,4,6]
What Is Nested Cross-Validation?

First the: why should you care? Nested Cross-Validation is an extension of the above, but it fixes one of the problems that we have with normal cross-validation. In normal cross-validation you only have a training and testing set, which you find the best hyperparameters for.

  1. This may cause information leakage
  2. You estimate the error of a model on the same data, which you found the best hyperparameters for. This may cause significant bias.

You would not want to estimate the error of your model, on the same set of training and testing data, that you found the best hyperparameters for. Thus, we say that the model is biased, and it has been shown that the bias can be significantly large [1].

Why? Along with the fact that bias and variance is linked with model selection, I would suggest that this is possibly one of the best approaches to estimate a true error, that is almost unbiased and with low variance.

As the image below suggests, we have two loops. The inner loop is basically normal cross-validation with a search function, e.g. random search or grid search. Though the outer loop only supplies the inner loop with the training dataset, and the test dataset in the outer loop is held back.

For this reason, we definitely stop information leakage due to cross-validation. And we also get a relatively low-absent bias, as the papers suggest (papers explained further below).

The above image is a nonformal view of it, but I compiled a more abstract view of nested cross-validation, describing the process.

Require means that something needs to be provided, and the rest of the parameters in this algorithm should be fairly obvious. Keep in mind that the RandomSample($P_{sets}$) is a function, that takes a random set from the hyperparameter grid.

Please leave a comment at the bottom of the post if you need more explaining. If it's hard to grasp, try to distinguish between i and j from the for-loops – they are very important to keep track of when reading this.

When To Use Nested Cross-Validation?

It is hard to deny the fact, that nested cross-validation is computationally expensive, in the case of larger datasets. Anything more than a few thousand observations, you might find this computationally expensive.

Nested cross-validation has its purpose. Especially in fields where data is limited, e.g. in biology, there is usually not a lot of data to go with machine learning projects.

To Summarize, when to use nested cross-validation:

  1. Where the dataset is small (a few thousand observations or less)
  2. When it is not too computationally expensive. Ask yourself if you find it feasible, given what type of computing power you have access to.

If you fit those two criterions, you should use nested cross-validation for getting an almost unbiased estimate of the true error, and therefore comparing the performance of different algorithms.

Note: If your dataset is considered large, you would produce an almost unbiased true error for your algorithm, hence the use-case for nested cross-validation being when your dataset is relatively small.

Other findings for Nested Cross-Validation

Although the case is clear for nested cross-validation, there have been released a paper [3], which details when you should not use nested cross-validation.

In particular, let me quote the two claims:

Nested CV procedures are probably not needed when selecting from among random forest, SVM with Gaussian kernel, and gradient boosting machines...

The words that caught my eyes are probably not needed. As it stands, if your dataset is relatively small and you find it computationally feasible, you should still go for testing it with nested cross-validation.

Nested CV procedures are probably not needed when selecting from among any set of classifier algorithms (provided they have only a limited number of hyper-parameters that must be tuned...).

Another probably not needed caught my eyes. But this one is a more understandable claim, and one could see why, without the evidence, that this is the case. If you have a limited number (not specified) of hyperparameters, you don't have that much to optimize for — thus, nested cross-validation could just be a waste of computational time for your project.

Though, when thinking about algorithms with a limited number of hyperparameters, I don't exactly find it easy to think of an algorithm that has relatively few hyperparameters to tune. Maybe when testing something like K-Nearest Neighbors, it would be overzealous?

What To Do After Nested Cross-Validation?

This one is a big question, and surely has caused many misconceptions. So let me just make it clear:

If your results are stable (i.e. same hyperparameter sets gives roughly the same estimate of the error), then proceed with normal cross-validation.

To really pinpoint the procedure, let's put it into steps:

  1. Input all the algorithms, which you intend to estimate the error of, into a nested cross-validation procedure.
    a.   Continue if results are stable. Redo/rethink each unstable algorithm.
  2. Select the model with the lowest generalization error, i.e. take the mean of the outer scores for each algorithm.
  3. Run that algorithm in a normal cross-validation with grid search or random search, without any of the optimized hyperparameters.
  4. Compare the score from nested cross-validation to the one from normal cross-validation.

At this point, you have tested if there is bias introduced to the procedure of estimating the error of your model. If results are very different, take it as an indication of normal cross-validation introducing bias into the estimate.

Code for Nested Cross-Validation

You reached the fun part! Let me show you just how to use the GitHub package, that I released specifically for this.

Firstly, you can install it with the command pip command pip install nested-cv. Let's break down the documentation. This is going to be a regression example.
For classification, modifying the cv_options found here is needed, e.g. set metric to a classification metric and metric_score_indicator_lower to False.

In this next part, we simply make an array with different models to run in the nested cross-validation algorithm. An example used here is Random Forest, XGBoost and LightGBM:

models_to_run = [RandomForestRegressor(), xgb.XGBRegressor(), lgb.LGBMRegressor()]

Next, we would need to include the hyperparameter grid for each of the algorithms. The way we do this is by an array that contains different dictionaries. Note that the index for each algorithm in the array models_to_run should be the same index in the models_param_grid:

models_param_grid = [ 
                    { # 1st param grid, corresponding to RandomForestRegressor
                            'max_depth': [3, None],
                            'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
                            'max_features' : [50,100,150,200]
                    { # 2nd param grid, corresponding to XGBRegressor
                            'learning_rate': [0.05],
                            'colsample_bytree': np.linspace(0.3, 0.5),
                            'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
                            'reg_alpha' : (1,1.2),
                            'reg_lambda' : (1,1.2,1.4)
                    { # 3rd param grid, corresponding to LGBMRegressor
                            'learning_rate': [0.05],
                            'n_estimators': [100,200,300,400,500,600,700,800,900,1000],
                            'reg_alpha' : (1,1.2),
                            'reg_lambda' : (1,1.2,1.4)

Now we have two arrays, with the algorithms and corresponding hyperparameter grids for them. Firstly, we loop over them and then input each model into nested cross-validation with the corresponding hyperparameter grid. Then there is also some other configurations. So we input this along with other configurations to the nested cross-validation procedure:

for i,model in enumerate(models_to_run):
    nested_CV_search = NestedCV(model=model, params_grid=models_param_grid[i],
                                outer_kfolds=5, inner_kfolds=5,
                                cv_options={'sqrt_of_score':True, 'randomized_search_iter':30})
    model_param_grid = nested_CV_search.best_params


Now we would get the three generalization errors printed for us, and at this point, we could look at the scores and select the best model. Then we continue on in the steps showed earlier in What To Do After Nested Cross-Validation, if the results are stable.

Sources and further reading:
  1. Probably the most recognized article on bias in cross-validation predictions.
    Cawley and Talbot: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation — Link here.
  2. General reading on cross-validation, and nested cross-validation on page 45.
    Sebastian Raschka: Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning — Link here.
  3. Indications of when you probably cannot make use of nested cross-validation.
    Wainer and Cawley: Nested cross-validation when selecting classifiers is overzealous for most practical applications — Link here.