Chapter 4: Understanding your features

Plotting histograms

face Josiah Wang

While the statistics you computed earlier (min, max, median, etc.) might be useful, sometimes you can get more insights by visualising the value of the features itself.

So let’s say we want to check the values of petal length (since it seems like a good feature), separately for the three classes. We can plot a histogram of the petal length distribution for each of the classes.

fig, ax = plt.subplots(1,3)

ax[0].hist(x[y==0, 2], color='r')
ax[0].set(title=classes[0])

ax[1].hist(x[y==1, 2], color='b')
ax[1].set(title=classes[1])

ax[2].hist(x[y==2, 2], color ='g')
ax[2].set(title=classes[2])

plt.show()
plt.close()

Distribution of petal length by class

You can see that "setosa" can clearly be distinguished from the other two classes by petal length. For "versicolor" and "virginica", there is a bit of an overlap when the petal length is between around 4.5-5.1. So there will be a bit of uncertainty here.

Using pandas to generate histograms

You can also use DataFrame’s .hist() method in Pandas to plot the histogram. You can actually plot histograms for multiple columns in one go.

import pandas as pd

df = pd.DataFrame(x)
df.columns = feature_names

fig = plt.figure(figsize=(8,8))
ax = fig.gca()
df.hist(ax=ax)

plt.show()

Histogram for all four features

Of course, you can always select only a subset of columns/rows to visualise.

fig = plt.figure(figsize=(8,8))
ax = fig.gca()
df[y==0][feature_names[2:4]].hist(ax=ax)
plt.show()

Histograms for subset of rows and features