Introduction to Scikit-learn > Classification | Python Programming (70053 Autumn Term 2021/2022) | Department of Computing

Introduction to Scikit-learn

Chapter 2: Classification pipeline

Classification

face Josiah Wang

For this module, I will only discuss scikit-learn with regards to the classification task.

As I mentioned in the Introduction to Machine Learning course, classification is one of the most common machine learning task.

It is also often tackled in a supervised setting, and we will assume this setting for this module.

As a quick reminder, an ML model (hypothesis) takes as input a feature vector $X$ and outputs a predicted label $\hat{y}$ .

In classification, $\hat{y}$ is a categorical or discrete variable (e.g. “yes”, “no”, “cat”, “car”, “class 1”).

At training time, a supervised learning model takes as input a sequence of $N$ feature vectors $\mathbf{X} = \{X^{(i)}\}^N$ and the correct (gold standard/ground truth) labels $\mathbf{y} = \{y^{(i)}\}^N$ for each of these $N$ samples.

The algorithm will then try to fit the model to $\mathbf{X}$ and $\mathbf{y}$ .

At test time, you will take some previously unseen data $\mathbf{X^{test}}$ and predict the output labels $\mathbf{\hat{y}^{test}}$ for these.

You then evaluate your model by comparing $\mathbf{\hat{y}^{test}}$ against the gold standard/ground truth labels $\mathbf{y^{test}}$ .

Here is a diagram of the pipeline from the Introduction to Machine Learning course. You have set this as your wallpaper or have it framed up in your room, haven’t you? 🥺

Supervised learning pipeline