Introduction to Scikit-learn > Class distribution? | Python Programming (70053 Autumn Term 2021/2022) | Department of Computing

Introduction to Scikit-learn

Chapter 3: Understanding your data

Class distribution?

face Josiah Wang

Last question!

Question 4: How are the categories distributed?

It is also a good idea to check how the categories are distributed, e.g. whether the proportion of instances is biased towards one class over another.

You may have noticed that the elements in y can be either 0 (representing “setosa”), 1 (“versicolor”), and 2 (“virginica”). So let us use some of your NumPy prowess to count how many times each of these occur in y.

>>> import numpy as np
>>> (unique_labels, counts) = np.unique(y, return_counts=True)
>>> print(unique_labels)
[0 1 2]
>>> print(counts)
[50 50 50]

You should figure out that the categories are evenly distributed (50 samples per category).

If you have printed y, you may also have noticed that this dataset has already been sorted by categories (first 50 are category 0, next 50 are category 1, and last 50 are category 2). Most datasets are usually randomly shuffled though!

Examining raw data

Usually, I would also encourage you to examine the raw data itself (e.g. images of the flowers). This is so that you can gain insights into what might be useful to distinguish between the categories (colour? size?) Unfortunately, we are not provided with these, but only pre-processed features. So we can skip this step for our lesson!