Introduction to Scikit-learn > Understanding your data | Python Programming (70053 Autumn Term 2021/2022) | Department of Computing

Introduction to Scikit-learn

Chapter 3: Understanding your data

Understanding your data

face Josiah Wang

The first step before you even start designing any machine learning system is to first examine and understand your data. This is very important and most often overlooked even by seasoned researchers.

Let us do that now on the Iris dataset that you loaded earlier!

First, for our own convenience, let us name the feature vectors as x, and the target labels as y. It is also quite common to use a capital X since it represents a matrix, but we will stick with a small x.

>>> x = dataset.data
>>> y = dataset.target
>>> print(type(x))
<class 'numpy.ndarray'>
>>> print(type(y))
<class 'numpy.ndarray'>

You should have noticed that x and y are both np.ndarrays.

If you want to examine x and y side-by-side, use Python’s zip() function.

>>> for row in zip(x, y):
...     print(row)
...
(array([5.1, 3.5, 1.4, 0.2]), 0)
(array([4.9, 3. , 1.4, 0.2]), 0)
(array([4.7, 3.2, 1.3, 0.2]), 0)
## ...

We usually keep x and y separate in machine learning implementations for easier Numpy array computation.

If you would like to use pandas and your scikit-learn version is >= 0.23, you can also load the dataset as a pandas DataFrame.

>>> df = load_iris(as_frame=True)
>>> print(df.data)
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]
>>> print(df["target"])  # df.target works too
0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 150, dtype: int32