Introduction to Scikit-learn
Chapter 2: Classification pipeline
Classification
For this module, I will only discuss scikit-learn with regards to the classification task.
As I mentioned in the Introduction to Machine Learning course, classification is one of the most common machine learning task.
It is also often tackled in a supervised setting, and we will assume this setting for this module.
As a quick reminder, an ML model (hypothesis) takes as input a feature vector X and outputs a predicted label \hat{y}.
In classification, \hat{y} is a categorical or discrete variable (e.g. “yes”, “no”, “cat”, “car”, “class 1”).
At training time, a supervised learning model takes as input a sequence of N feature vectors \mathbf{X} = \{X^{(i)}\}^N and the correct (gold standard/ground truth) labels \mathbf{y} = \{y^{(i)}\}^N for each of these N samples.
The algorithm will then try to fit the model to \mathbf{X} and \mathbf{y}.
At test time, you will take some previously unseen data \mathbf{X^{test}} and predict the output labels \mathbf{\hat{y}^{test}} for these.
You then evaluate your model by comparing \mathbf{\hat{y}^{test}} against the gold standard/ground truth labels \mathbf{y^{test}}.
Here is a diagram of the pipeline from the Introduction to Machine Learning course. You have set this as your wallpaper or have it framed up in your room, haven’t you? 🥺