Introduction to Scikit-learn
Chapter 5: Preprocessing your features
Pre-processing your data
Ok, I think you had enough of examining your data and features!
Now, let us get back to exploring scikit-learn itself.
It is sometimes necessary to do some pre-processing of data before running your training algorithm.
This is where scikit-learn starts to make your life easy! The sklearn.preprocessing
package provides a bunch of utilities to modify your feature vectors into a more suitable representation.
For example, in the Introduction to Machine Learning module, I mentioned that standardisation is often applied to rescale the data to be zero mean and unit variance. This puts the values across features to be within about the same range. You can use the scale()
function in sklearn.preprocessing
for that!
from sklearn import preprocessing
x = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
x_scaled = preprocessing.scale(x)
print(x_scaled)
## [[ 0. -1.22474487 1.33630621]
## [ 1.22474487 0. -0.26726124]
## [-1.22474487 1.22474487 -1.06904497]]
Or suppose you want to scale your values so that they are in the range of 0
to 1
. There is a scikit-learn class called MinMaxScaler
for that!
min_max_scaler = preprocessing.MinMaxScaler()
x_minmax = min_max_scaler.fit_transform(x)
print(x_minmax)
## [[0.5 0. 1. ]
## [1. 0.5 0.33333333]
## [0. 1. 0. ]]
See the official documentation for other preprocessing functions or classes.