Chapter 8: Cross-validation

Grid search

face Josiah Wang

Sometimes your model may contain many hyperparameters that is not learnt from training data. For example the K and distance metric in a K-nearest neighbour classifier.

You might want to automatically figure out which of these hyperparameters are optimal for your dataset. This is also known as hyperparameter tuning.

One way to do this is to test all combinations of hyperparameters exhaustively and pick the combination that gives the best performance. This is called a grid search.

Let’s say we want to search for the best K (number of neighbours) and distance metric to use for our K-nearest neighbour classifier. We will test from 1 to 50 neighbours, and try three distance metrics.

We will use scikit-learn’s GridSearchCV class to do this easily, rather than trying to write nested loops and tracking all the scores ourselves!

GridSearchCV will automatically set the n_neighbors and metric parameters of KNeighborsClassifier when performing the grid search.

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

# ... assume data is already loaded

# the values of the different hyperparameters to search
grid = {"n_neighbors": np.arange(1, 51),
        "metric": ["manhattan", "euclidean", "chebyshev"]
       }

# fit a classifier using grid search
classifier = GridSearchCV(KNeighborsClassifier(), cv=10, param_grid=grid)
classifier.fit(x_train, y_train)

# check which hyperparameters gave the best score
print(classifier.best_params_)  ## {'metric': 'chebyshev', 'n_neighbors': 5}
print(classifier.best_score_)   ## 0.9666666666666666

# predict using the model with the set of hyperparameters that gave the best score
predictions = classifier.best_estimator_.predict(x_test)

If you are passing a Pipeline to GridSearchCV, then you can refer to the parameter of a specific step of the pipeline by the step name, followed by two underscores __, and followed by the parameter name. The example below will probably make it clearer.

from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([("scaler", StandardScaler()), 
                     ("classifier", KNeighborsClassifier())])

grid = {"classifier__n_neighbors": np.arange(1, 51),
        "classifier__metric": ["manhattan", "euclidean", "chebyshev"]
       }

classifier = GridSearchCV(pipeline, cv=10, param_grid=grid)
classifier.fit(x_train, y_train)

print(classifier.best_params_)  ## {'classifier__metric': 'euclidean', 'classifier__n_neighbors': 11}
print(classifier.best_score_)   ## 0.9583333333333333

predictions = classifier.best_estimator_.predict(x_test)