Chapter 8: Cross-validation

Pipeline

face Josiah Wang

Sometimes you may need to perform a sequence of transformations to your dataset before fitting a classifier.

For example, you might have to perform some preprocessing on both training and test datasets. Doing these separately before training and predicting is a bit tedious.

Or you might have to perform cross-validation and grid search (discussed in the next pages), and need to perform the transformations over and over again or construct multiple models with different hyperparameters.

To make it easy, scikit-learn provides a class called Pipeline to help you group together a sequence of transforms followed by a final estimator (e.g. your model).

As an example, let’s say we need to first standardise our input and then scale the values to between 0 and 1, before fitting a Support Vector Machine classifier. Rather than doing these steps manually, we put them together into a Pipeline.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# ... assume you have already loaded and split your dataset

pipeline = Pipeline([("standardiser", StandardScaler()), # Step 1: standardise data
                     ("minmaxscaler", MinMaxScaler()),   # Step 2: scale numbers to [0,1]
                     ("classifier", SVC())               # Step 3: classifier
                    ])

pipeline.fit(x_train, y_train)
predictions = pipeline.predict(x_test)
print(accuracy_score(y_test, predictions))

Note that the classes in the pipeline (except the last) must implement both .fit() and .transform() methods. The last estimator only needs to implement the .fit() method.