This is an archived version of the course and is no longer updated. Please find the latest version of the course on the main webpage.

Understanding your data

I will use the Iris dataset in our discussion. This is a classic dataset from 1936 often used for teaching machine learning techniques.

Conveniently, scikit-learn provides an function to access this dataset without having to download it separately. So let us just load the dataset!

from sklearn.datasets import load_iris
dataset = load_iris()

Understanding your data

The first step before you even start designing any machine learning system is to first examine and understand your data. This is very important and most often overlooked even by seasoned researchers.

Let us do that now!

First, for our own convenience, let us name the feature vectors as x, and the target labels as y.

x = dataset.data
y = dataset.target
print(x)
print(y)
print(type(x))
print(type(y))

You should have noticed that x and y are both np.ndarrays.

If you want to examine x and y side-by-side, use Python’s zip() function.

for row in zip(x, y):
    print(row)
## (array([5.1, 3.5, 1.4, 0.2]), 0)
## (array([4.9, 3. , 1.4, 0.2]), 0)
## (array([4.7, 3.2, 1.3, 0.2]), 0)
## ...

We usually keep x and y separate in machine learning implementations for easier Numpy array computation.

Extra: if you would like to use pandas and your version of scikit-learn is >= 0.23, you can also load the dataset as a pandas DataFrame.

df = load_iris(as_frame=True)
print(df.data)
print(df.target)

Question 1: How many instances/samples are there?

We assume that x and y are of the same length, so you can check either.

print(len(x))
print(len(y))

To enforce that they are the same length, we can use Python’s assert keyword. This is a simple way of testing your code. assert expression will result in an AssertionError is expression is False.

For example, you will get an error if you have assert 3==5. Otherwise, nothing will happen.

# Make sure that x and y are of the same length. Throws an AssertionError otherwise.
assert len(x) == len(y)

Question 2: How many features does each instance have?

It will also be useful to know how many features/attributes our dataset has. Since x is a NumPy array, let us use its .shape attribute to find out!

print(x.shape)  ## (150, 4)
print(y.shape)  ## (150,)

This actually answers both the first and second questions. Scikit-learn models expect the input x to be of size \(N \times K\), for \(N\) instances and \(K\) features. So we answered both questions! (\(N = 150\), \(K = 4\))

We also know that y has 150 labels (one label per instance). Sounds right!

Question 3: How many (and what) classes does the dataset have?

The next thing to find out is - how many classes does this dataset have? And what are these classes?

Luckily, scikit-learn also has that covered!

classes = dataset.target_names
print(classes)
print(len(classes))

If done correctly, you should see that the Iris dataset comprises three classes: “setosa”, “versicolor”, and “virginica”.

Iris setosa
Iris setosa.
CC BY-SA 3.0,
Link
Iris versicolor
Iris versicolor.
By D. Gordon E. Robertson - Own work,
CC BY-SA 3.0, Link
Iris virginica
Iris virginica.
By Eric Hunt - Own work,
CC BY-SA 4.0, Link

Question 4: How are the classes distributed?

It is also a good idea to check how the classes are distributed, e.g. whether the proportion of instances is any bias towards one class over another.

You may have noticed that the elements in y can be either 0 (representing “setosa”), 1 (representing “versicolor”), and 2 (“virginica”). So let us use some of your NumPy prowess to count how many times each of these occur in y.

import numpy as np

(unique_labels, counts) = np.unique(y, return_counts=True)
print(unique_labels)
print(counts)

You should figure out that the classes are evenly distributed (50 samples per class).

If you have printed y, you may also have noticed that this dataset has already been sorted by classes (first 50 are class 0, next 50 are class 1, and last 50 are class 2). Most datasets are usually randomly shuffled though!

Usually, I would also encourage you to examine the raw data itself (e.g. images of the flowers). This is so that you can gain insights into what might be useful to distinguish between the classes (colour? size?). Unfortunately, we are not provided with these, but only pre-processed features. So we can skip this step for our module!