This is an archived version of the course. Please find the latest version of the course on the main webpage.

Chapter 5: Preprocessing your features

Splitting your dataset

face Josiah Wang

While some datasets provide pre-split training and test datasets, others do not.

The Iris dataset for example has not been pre-split. So you will have to split this yourself. Remember that you need a disjoint dataset split for testing!

The good news is: there is a scikit-learn function to help you do just that!

Let us split the Iris dataset such that we have 80% for training and 20% for testing.

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> dataset = load_iris()
>>> x = dataset.data
>>> y = dataset.target
>>> x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, 
...                                                     random_state=42)
...
>>> print(len(x_train), len(y_train))
120 120
>>> print(len(x_test), len(y_test))
30 30

The keyword argument test_size specifies the proportion of samples to reserve as the test set.

The random_state argument allows you to define a seed number for reproducibility.