Introduction to Scikit-learn
Chapter 3: Understanding your data
Class distribution?
Last question!
Question 4: How are the categories distributed?
It is also a good idea to check how the categories are distributed, e.g. whether the proportion of instances is biased towards one class over another.
You may have noticed that the elements in y
can be either 0
(representing “setosa”), 1
(“versicolor”), and 2
(“virginica”). So let us use some of your NumPy
prowess to count how many times each of these occur in y
.
>>> import numpy as np
>>> (unique_labels, counts) = np.unique(y, return_counts=True)
>>> print(unique_labels)
[0 1 2]
>>> print(counts)
[50 50 50]
You should figure out that the categories are evenly distributed (50 samples per category).
If you have printed y
, you may also have noticed that this dataset has already been sorted by categories (first 50 are category 0, next 50 are category 1, and last 50 are category 2). Most datasets are usually randomly shuffled though!
Examining raw data
Usually, I would also encourage you to examine the raw data itself (e.g. images of the flowers). This is so that you can gain insights into what might be useful to distinguish between the categories (colour? size?) Unfortunately, we are not provided with these, but only pre-processed features. So we can skip this step for our lesson!