This is an archived version of the course and is no longer updated. Please find the latest version of the course on the main webpage.

Pre-processing your data

Ok, now let us get back to exploring scikitlearn itself.

It is sometimes necessary to do some pre-procesing of data before running your training algorithm.

The sklearn.preprocessing package provides a bunch of utilities to modify your feature vectors into a more suitable representation.

For example, in the Introduction to Machine Learning course, I mentioned that standardisation is often applied to rescale the data to be zero mean and unit variance. This puts the values across features to be within about the same range.

from sklearn import preprocessing

x = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])

x_scaled = preprocessing.scale(x)

print(x_scaled)
## [[ 0.         -1.22474487  1.33630621]
##  [ 1.22474487  0.         -0.26726124]
##  [-1.22474487  1.22474487 -1.06904497]]

Or suppose you want to scale your values so that they are in the range of 0 to 1. There is a scikit-learn class for that!

min_max_scaler = preprocessing.MinMaxScaler()
x_minmax = min_max_scaler.fit_transform(x)
print(x_minmax)
## [[0.5        0.         1.        ]
##  [1.         0.5        0.33333333]
##  [0.         1.         0.        ]]

See the official documentation for other preprocessing functions or classes.

Splitting your dataset

While some datasets provide pre-splitted training and test datasets, others do not.

The Iris dataset for example has not been pre-splitted. So you will have to split this yourself. Remember, you need a disjoint dataset split for testing!

The good news is: there is a scikit-learn function to help you do just that!

Let us split the Iris dataset such that we have 80% for training and 20% for testing.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

dataset = load_iris()
x = dataset.data
y = dataset.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
print(len(x_train), len(y_train))  ## 120 120
print(len(x_test), len(y_test))    ## 30 30

The keyword argument test_size sepcifies the proportion of samples to reserve as the test set. random_state allows you to define a seed number for reproducibility.