Chapter 2: Classification pipeline

Classification

face Josiah Wang

For this module, I will only discuss scikit-learn with regards to the classification task.

As I mentioned in the Introduction to Machine Learning course, classification is one of the most common machine learning task.

It is also often tackled in a supervised setting, and we will assume this setting for this module.

As a quick reminder, an ML model (hypothesis) takes as input a feature vector X and outputs a predicted label \hat{y}.

In classification, \hat{y} is a categorical or discrete variable (e.g. “yes”, “no”, “cat”, “car”, “class 1”).

At training time, a supervised learning model takes as input a sequence of N feature vectors \mathbf{X} = \{X^{(i)}\}^N and the correct (gold standard/ground truth) labels \mathbf{y} = \{y^{(i)}\}^N for each of these N samples.

The algorithm will then try to fit the model to \mathbf{X} and \mathbf{y}.

At test time, you will take some previously unseen data \mathbf{X^{test}} and predict the output labels \mathbf{\hat{y}^{test}} for these.

You then evaluate your model by comparing \mathbf{\hat{y}^{test}} against the gold standard/ground truth labels \mathbf{y^{test}}.

Here is a diagram of the pipeline from the Introduction to Machine Learning course. You have set this as your wallpaper or have it framed up in your room, haven’t you? 🥺

Supervised learning pipeline