This is an archived version of the course and is no longer updated. Please find the latest version of the course on the main webpage.

Cross-validation

When you train your models, there is a risk of overfitting on your test dataset. For example, you might tweak your parameters until you achieve an optimal performance on your test set.

Is that really a good thing? Just because your model achieves superior performance on one test set does not mean that it will perform as well on another test set. You might only happen to have hit the jackpot with your tweaking on this one test set, but may perform miserably on others. It may be better to have your model to be robust enough to generalise to new, unseen data (remember bias-variance tradeoff!)

You will learn more about this topic in Week 4 of the Introduction to Machine Learning course when discussing Machine Learning Evaluation, where you will be presented with several ways to deal with this issue.

One of these methods is called \(K\)-fold cross-validation. To put it simply:

  • You divide your dataset into \(K\) non-overlapping subsets.
  • You will then perform \(K\) separate experiments:
    • In each experiment, you keep one of the \(K\) subsets as your test data, and use the remaining \(K-1\) as training.
  • You will end up with \(K\) scores, one for each of your \(K\) subset.

In summary, you are essentially testing on \(K\) different test datasets to ensure that your model can generalise well.

As you can guess, scikit-learn gives you a function to perform cross-validation without you having to implement it yourself. Imagine having to figure out how to evenly but randomly split your dataset, making sure that they do not overlap, and picking the correct subset inside multiple nested loops! Real fun! (I have actually done this myself when scikit-learn did not quite exist)

Let’s say we want to perform 5-fold cross-validation with our pipeline from the previous page.

from sklearn.model_selection import cross_validate

results_dict = cross_validate(pipeline, x_train, y_train, cv=5)

print(results_dict.keys()) ## dict_keys(['fit_time','score_time','test_score'])
print(results_dict["test_score"]) ## [0.95833333 1.         0.875      1.         0.95833333]
print(results_dict["test_score"].mean()) # 0.9583333333333334

pipeline can also be any classifier (like our knn_classifier or dt_classifier from earlier).

results_dict holds a dictionary of the results containing some statistics and also the test score for each “fold”.