This is an archived version of the course. Please find the latest version of the course on the main webpage.

Chapter 5: Preprocessing your features

Pre-processing your data

face Josiah Wang

Ok, I think you had enough of examining your data and features!

Now, let us get back to exploring scikit-learn itself.

It is sometimes necessary to do some pre-processing of data before running your training algorithm.

This is where scikit-learn starts to make your life easy! The sklearn.preprocessing package provides a bunch of utilities to modify your feature vectors into a more suitable representation.

For example, in the Introduction to Machine Learning module, I mentioned that standardisation is often applied to rescale the data to be zero mean and unit variance. This puts the values across features to be within about the same range. You can use the scale() function in sklearn.preprocessing for that!

from sklearn import preprocessing

x = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])

x_scaled = preprocessing.scale(x)

print(x_scaled)
## [[ 0.         -1.22474487  1.33630621]
##  [ 1.22474487  0.         -0.26726124]
##  [-1.22474487  1.22474487 -1.06904497]]

Or suppose you want to scale your values so that they are in the range of 0 to 1. There is a scikit-learn class called MinMaxScaler for that!

min_max_scaler = preprocessing.MinMaxScaler()
x_minmax = min_max_scaler.fit_transform(x)
print(x_minmax)
## [[0.5        0.         1.        ]
##  [1.         0.5        0.33333333]
##  [0.         1.         0.        ]]

See the official documentation for other preprocessing functions or classes.