CO558: Python Programming | Department of Computing

Let us now discuss using DataFrame with a practical example, using the IMDB movie dataset.

Assumming you have downloaded IMDB-Movie-Data.csv to your current directory, let us try reading the CSV file as a DataFrame. We will also use the column “Rank” as the index.

df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Rank")

You can use DataFrame’s .head(N) method to read the first N rows (default is 5).

print(df.head())
##                         Title                     Genre  ... Revenue (Millions) Metascore
## Rank                                                     ...
## 1     Guardians of the Galaxy   Action,Adventure,Sci-Fi  ...             333.13      76.0
## 2                  Prometheus  Adventure,Mystery,Sci-Fi  ...             126.46      65.0
## 3                       Split           Horror,Thriller  ...             138.12      62.0
## 4                        Sing   Animation,Comedy,Family  ...             270.32      59.0
## 5               Suicide Squad  Action,Adventure,Fantasy  ...             325.02      40.0
##
## [5 rows x 11 columns]

print(df.head(3))
##                         Title                     Genre  ... Revenue (Millions) Metascore
## Rank                                                     ...
## 1     Guardians of the Galaxy   Action,Adventure,Sci-Fi  ...             333.13      76.0
## 2                  Prometheus  Adventure,Mystery,Sci-Fi  ...             126.46      65.0
## 3                       Split           Horror,Thriller  ...             138.12      62.0
##
## [3 rows x 11 columns]

You can also use another column as the index. Let’s use “Title” this time!

df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
print(df.head())
##                          Rank                     Genre  ... Revenue (Millions) Metascore
## Title                                                    ...
## Guardians of the Galaxy     1   Action,Adventure,Sci-Fi  ...             333.13      76.0
## Prometheus                  2  Adventure,Mystery,Sci-Fi  ...             126.46      65.0
## Split                       3           Horror,Thriller  ...             138.12      62.0
## Sing                        4   Animation,Comedy,Family  ...             270.32      59.0
## Suicide Squad               5  Action,Adventure,Fantasy  ...             325.02      40.0
##
## [5 rows x 11 columns]

If you want to see the last N rows, use the .tail() method (N defaults to 5).

print(df.tail(3))
##                         Rank                  Genre  ... Revenue (Millions) Metascore
## Title                                                ...
## Step Up 2: The Streets   998    Drama,Music,Romance  ...              58.01      50.0
## Search Party             999       Adventure,Comedy  ...                NaN      22.0
## Nine Lives              1000  Comedy,Family,Fantasy  ...              19.64      11.0
## 
## [3 rows x 11 columns]

Obtaining information about `DataFrame`

The .info() method will print out useful information about the DataFrame.

df.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 1000 entries, Guardians of the Galaxy to Nine Lives
## Data columns (total 11 columns):
##  #   Column              Non-Null Count  Dtype
## ---  ------              --------------  -----
##  0   Rank                1000 non-null   int64
##  1   Genre               1000 non-null   object
##  2   Description         1000 non-null   object
##  3   Director            1000 non-null   object
##  4   Actors              1000 non-null   object
##  5   Year                1000 non-null   int64
##  6   Runtime (Minutes)   1000 non-null   int64
##  7   Rating              1000 non-null   float64
##  8   Votes               1000 non-null   int64
##  9   Revenue (Millions)  872 non-null    float64
##  10  Metascore           936 non-null    float64
## dtypes: float64(3), int64(4), object(4)
## memory usage: 93.8+ KB

DataFrame’s .shape and .size attributes act just like NumPy array’s.

print(df.shape)  ## (1000, 11)
print(df.size)   ## 11000

The .values attribute returns the data as a NumPy array.

type(df.values)  ## <class 'numpy.ndarray'>
print(df.values) ## Too many values to display here! Try it yourself!

<< Previous Next >>

Python Programming

Module 9

DataFrame operations

Obtaining information about `DataFrame`

Obtaining information about DataFrame

Obtaining information about `DataFrame`