Grid Search with Cross-Validation (GridSearchCV) is a brute force on finding the best hyperparameters for a specific dataset and model. Why not automate it to the extend we can? Stay around until the end for a RandomizedSearchCV in addition to the GridSearchCV implementation.

This is perhaps a trivial task to some, but a very important one – hence it is worth showing how you can run a search over hyperparameters for all the popular packages.

In this post, I'm going to go over a code piece for both classification and regression, varying between Keras, XGBoost, LightGBM and Scikit-Learn.

There is a GitHub available with a colab button, where you instantly can run the same code, which I used in this post.

Table of Contents (Click To Scroll)

  1. What Is GridSearchCV?

  2. Setup: Prepared Dataset

  3. Running GridSearchCV (Keras, sklearn, XGBoost and LightGBM)

  4. Running Nested Cross-Validation with Grid Search

  5. Running RandomSearchCV

  6. Further Readings (Books and References)

What Is GridSearchCV?

In one line: cross-validation is the process of splitting the same dataset in K-partitions, and for each split, we search the whole grid of hyperparameters to an algorithm, in a brute force manner of trying every combination.

Note that I'm referring to K-Fold cross-validation (CV), even though there are other methods of doing CV.

In an iterative manner, we switch up the testing and training dataset in different subsets from the full dataset. We usually split the full dataset so that each testing fold has 10% ($K=10$) or 20% ($K=5$) of the full dataset.

k-fold cross-validation, where we slide over the dataset, switching between which part of the dataset is the training and testing dataset
K-Fold Cross-Validation with $K=5$

Grid Search: From this image of cross-validation, what we do for the grid search is the following; for each iteration, test all the possible combinations of hyperparameters, by fitting and scoring each combination separately.

Setup: Prepared Dataset

We need a prepared dataset to be able to run a grid search over all the different parameters we want to try. I'm assuming you have already prepared the dataset, else I will show a short version of preparing it and then get right to running grid search.

In this post, I'm going to be running models on three different datasets; MNIST, Boston House Prices and Breast Cancer. The sole purpose is to jump right past preparing the dataset and right into running it with GridSearchCV. But we will have to do just a little preparation, which we will keep to a minimum.

For the MNIST dataset, we normalize the pictures, divide by the RGB code values and one-hot encode our output classes

# LOAD DATA
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# PREPROCESSING
def preprocess_mnist(x_train, y_train, x_test, y_test):
    # Normalizing all images of 28x28 pixels
    x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
    x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
    input_shape = (28, 28, 1)
    
    # Float values for division
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    
    # Normalizing the RGB codes by dividing it to the max RGB value
    x_train /= 255
    x_test /= 255
    
    # Categorical y values
    y_train = to_categorical(y_train)
    y_test= to_categorical(y_test)
    
    return x_train, y_train, x_test, y_test, input_shape
    
X_train, y_train, X_test, y_test, input_shape = preprocess_mnist(x_train, y_train, x_test, y_test)

For the house prices dataset, we do even less preprocessing. We really just remove a few columns with missing values, remove the rest of the rows with missing values and one-hot encode the columns.

boston = load_boston()
X = boston.data
y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

For the last dataset, breast cancer, we don't do any preprocessing except for splitting the training and testing dataset into train and test splits

# Load breast cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Running GridSearchCV

The next step is to actually run grid search with cross-validation. How does it work? Well, I made this function that is pretty easy to pick up and use. You can input your different training and testing split X_train_data, X_test_data, y_train_data, y_test_data. You can also input your model, whichever library it may be from; could be Keras, sklearn, XGBoost or LightGBM. You would have to specify which parameters, by param_grid, you want to 'bruteforce' your way through, to find the best hyperparameters. An important thing is also to specify which scoring you would like to use; there is one for fitting the model scoring_fit. At last, you can set other options, like how many K-partitions you want and which scoring from sklearn.metrics that you want to use. We use n_jobs=-1 as a standard, since that means we use all available CPU cores to train our model.

def algorithm_pipeline(X_train_data, X_test_data, y_train_data, y_test_data, 
                       model, param_grid, cv=10, scoring_fit='neg_mean_squared_error',
                       do_probabilities = False):
    gs = GridSearchCV(
        estimator=model,
        param_grid=param_grid, 
        cv=cv, 
        n_jobs=-1, 
        scoring=scoring_fit,
        verbose=2
    )
    fitted_model = gs.fit(X_train_data, y_train_data)
    
    if do_probabilities:
      pred = fitted_model.predict_proba(X_test_data)
    else:
      pred = fitted_model.predict(X_test_data)
    
    return fitted_model, pred

Note that we could switch out GridSearchCV by RandomSearchCV, if you want to use that instead.

Keras

Firtly, we define the neural network architecture, and since it's for the MNIST dataset that consists of pictures, we define it as some sort of convolutional neural network (CNN).

# Readying neural network model
def build_cnn(activation = 'relu',
              dropout_rate = 0.2,
              optimizer = 'Adam'):
    model = Sequential()
    
    model.add(Conv2D(32, kernel_size=(3, 3),
              activation=activation,
              input_shape=input_shape))
    model.add(Conv2D(64, (3, 3), activation=activation))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(dropout_rate))
    model.add(Flatten())
    model.add(Dense(128, activation=activation))
    model.add(Dropout(dropout_rate))
    model.add(Dense(10, activation='softmax'))
    
    model.compile(
        loss='categorical_crossentropy', 
        optimizer=optimizer, 
        metrics=['accuracy']
    )
    
    return model

Next, we just define the parameters and model to input into the algorithm_pipeline; we run classification on this dataset, since we are trying to predict which class a given image can be categorized into. Note that I commented out some of the parameters, because it would take a long time to train, but you can always fiddle around with which parameters you want. You could even add pool_size or kernel_size.

param_grid = {
              'epochs':[1,2,3],
              'batch_size':[128]
              #'epochs' :              [100,150,200],
              #'batch_size' :          [32, 128],
              #'optimizer' :           ['Adam', 'Nadam'],
              #'dropout_rate' :        [0.2, 0.3],
              #'activation' :          ['relu', 'elu']
             }

model = KerasClassifier(build_fn = build_cnn, verbose=0)

model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model, 
                                        param_grid, cv=5, scoring_fit='neg_log_loss')

print(model.best_score_)
print(model.best_params_)

From this GridSearchCV, we get the best score and best parameters to be:

-0.04399333562212302
{'batch_size': 128, 'epochs': 3}

Fixing bug for scoring with Keras

I came across this issue when coding a solution trying to use accuracy for a Keras model in GridSearchCV – you might wonder why 'neg_log_loss' was used as the scoring method?

The solution to using something else than negative log loss is to remove some of the preprocessing of the MNIST dataset; that is, REMOVE the part where we make the output variables categorical

# Categorical y values
y_train = to_categorical(y_train)
y_test= to_categorical(y_test)

Surely we would be able to run with other scoring methods, right? Yes, that was actually the case (see the notebook). This was the best score and best parameters:

0.9858
{'batch_size': 128, 'epochs': 3}

XGBoost

Next we define parameters for the boston house price dataset. Here the task is regression, which I chose to use XGBoost for. I also chose to evaluate by a Root Mean Squared Error (RMSE).

model = xgb.XGBRegressor()
param_grid = {
    'n_estimators': [400, 700, 1000],
    'colsample_bytree': [0.7, 0.8],
    'max_depth': [15,20,25],
    'reg_alpha': [1.1, 1.2, 1.3],
    'reg_lambda': [1.1, 1.2, 1.3],
    'subsample': [0.7, 0.8, 0.9]
}

model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model, 
                                 param_grid, cv=5)

# Root Mean Squared Error
print(np.sqrt(-model.best_score_))
print(model.best_params_)

The best score and parameters for the house prices dataset found from the GridSearchCV was

3.4849014783892733
{'colsample_bytree': 0.8, 'max_depth': 20, 'n_estimators': 400, 'reg_alpha': 1.2, 'reg_lambda': 1.3, 'subsample': 0.8}

LightGBM

The next task was LightGBM for classifying breast cancer. The metric chosen was accuracy.

model = lgb.LGBMClassifier()
param_grid = {
    'n_estimators': [400, 700, 1000],
    'colsample_bytree': [0.7, 0.8],
    'max_depth': [15,20,25],
    'num_leaves': [50, 100, 200],
    'reg_alpha': [1.1, 1.2, 1.3],
    'reg_lambda': [1.1, 1.2, 1.3],
    'min_split_gain': [0.3, 0.4],
    'subsample': [0.7, 0.8, 0.9],
    'subsample_freq': [20]
}

model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model, 
                                 param_grid, cv=5, scoring_fit='accuracy')

print(model.best_score_)
print(model.best_params_)

The best parameters and best score from the GridSearchCV on the breast cancer dataset with LightGBM was

0.9736263736263736
{'colsample_bytree': 0.7, 'max_depth': 15, 'min_split_gain': 0.3, 'n_estimators': 400, 'num_leaves': 50, 'reg_alpha': 1.3, 'reg_lambda': 1.1, 'subsample': 0.7, 'subsample_freq': 20}

Sklearn

Just to show that you indeed can run GridSearchCV with one of sklearn's own estimators, I tried the RandomForestClassifier on the same dataset as LightGBM.

model = RandomForestClassifier()
param_grid = {
    'n_estimators': [400, 700, 1000],
    'max_depth': [15,20,25],
    'max_leaf_nodes': [50, 100, 200]
}

model, pred = algorithm_pipeline(X_train, X_test, y_train, y_test, model, 
                                 param_grid, cv=5, scoring_fit='accuracy')

print(model.best_score_)
print(model.best_params_)

And indeed the score was worse than from LightGBM, as expected:

0.9648351648351648
{'max_depth': 25, 'max_leaf_nodes': 50, 'n_estimators': 1000}

Interested in running a GridSearchCV that is unbiased? I welcome you to Nested Cross-Validation; where you get the optimal bias-variance trade-off and, by the theory, as unbiased of a score as possible.

I would encourage you to check out this repository over at GitHub.

I made a post about it:

Nested Cross-Validation Python Code
Code for nested cross-validation in machine learning - unbiased estimation of true error. What is nested cross-validation, and the why and when to use it.

I embedded the examples below, and you can install the package by the a pip command:
pip install nested-cv

RandomSearchCV

We don't have to restrict ourselves to GridSearchCV – why not implement RandomSearchCV too, if that is preferable to you. This is implemented at the bottom of the notebook available here.

We can specify another parameter for the pipeline search_mode, which let's us specify which search algorithm we want to use in our pipeline. But we also introduce another parameter called n_iterations, since we need to provide such a parameter for both the RandomSearchCV class – but not GridSearchCV.

We can set the default for both those parameters, and indeed that is what I have done. search_mode = 'GridSearchCV' and n_iterations = 0 is the defaults, hence we default to GridSearchCV where the number of iterations is not used.

Here the code is, and notice that we just made a simple if-statement for which search class to use:

from sklearn.model_selection import RandomizedSearchCV

def search_pipeline(X_train_data, X_test_data, y_train_data, y_test_data, 
                       model, param_grid, cv=10, scoring_fit='neg_mean_squared_error',
                       do_probabilities = False, search_mode = 'GridSearchCV', n_iterations = 0):
    fitted_model = None
    
    if(search_mode == 'GridSearchCV'):
        gs = GridSearchCV(
            estimator=model,
            param_grid=param_grid, 
            cv=cv, 
            n_jobs=-1, 
            scoring=scoring_fit,
            verbose=2
        )
        fitted_model = gs.fit(X_train_data, y_train_data)

    elif (search_mode == 'RandomizedSearchCV'):
        rs = RandomizedSearchCV(
            estimator=model,
            param_distributions=param_grid, 
            cv=cv,
            n_iter=n_iterations,
            n_jobs=-1, 
            scoring=scoring_fit,
            verbose=2
        )
        fitted_model = rs.fit(X_train_data, y_train_data)
    
    
    if(fitted_model != None):
        if do_probabilities:
            pred = fitted_model.predict_proba(X_test_data)
        else:
            pred = fitted_model.predict(X_test_data)
            
        return fitted_model, pred

Running this for the breast cancer dataset, it produces the below results, which is almost the same as the GridSearchCV result (which got a score of 0.9648)

0.9626373626373627
{'n_estimators': 1000, 'max_leaf_nodes': 100, 'max_depth': 25}

Further Readings

Most recommended books (referral to Amazon) are the following, in order. The first one is particularly good for practicing ML in Python, as it covers much of scikit-learn and TensorFlow.

Books from Amazon

1. Hands-On Machine Learning, best practical book!


2. Condensed book with all the material needed to get started; a great reference!


3. Acedemic and theory-oriented book for deep learning


4. Learning and looking at Machine Learning with probability theory. Recommended if you have a mathematics background

Class references from this post

I recommend reading the documentation for each model you are going to use with this GridSearchCV pipeline – it will solve complications you will have migrating to other algorithms. In particular, here is the documentation from the algorithms I used in this posts: