Introduction to Scikit-learn
Chapter 5: Preprocessing your features
Splitting your dataset
While some datasets provide pre-split training and test datasets, others do not.
The Iris dataset for example has not been pre-split. So you will have to split this yourself. Remember that you need a disjoint dataset split for testing!
The good news is: there is a scikit-learn
function to help you do just that!
Let us split the Iris dataset such that we have 80% for training and 20% for testing.
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> dataset = load_iris()
>>> x = dataset.data
>>> y = dataset.target
>>> x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20,
... random_state=42)
...
>>> print(len(x_train), len(y_train))
120 120
>>> print(len(x_test), len(y_test))
30 30
The keyword argument test_size
specifies the proportion of samples to reserve as the test set.
The random_state
argument allows you to define a seed number for reproducibility.