This is an archived version of the course and is no longer updated. Please find the latest version of the course on the main webpage.

DataFrame operations

Let us now discuss using DataFrame with a practical example, using the IMDB movie dataset.

Assumming you have downloaded IMDB-Movie-Data.csv to your current directory, let us try reading the CSV file as a DataFrame. We will also use the column “Rank” as the index.

df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Rank")

You can use DataFrame’s .head(N) method to read the first N rows (default is 5).

print(df.head())
##                         Title                     Genre  ... Revenue (Millions) Metascore
## Rank                                                     ...
## 1     Guardians of the Galaxy   Action,Adventure,Sci-Fi  ...             333.13      76.0
## 2                  Prometheus  Adventure,Mystery,Sci-Fi  ...             126.46      65.0
## 3                       Split           Horror,Thriller  ...             138.12      62.0
## 4                        Sing   Animation,Comedy,Family  ...             270.32      59.0
## 5               Suicide Squad  Action,Adventure,Fantasy  ...             325.02      40.0
##
## [5 rows x 11 columns]

print(df.head(3))
##                         Title                     Genre  ... Revenue (Millions) Metascore
## Rank                                                     ...
## 1     Guardians of the Galaxy   Action,Adventure,Sci-Fi  ...             333.13      76.0
## 2                  Prometheus  Adventure,Mystery,Sci-Fi  ...             126.46      65.0
## 3                       Split           Horror,Thriller  ...             138.12      62.0
##
## [3 rows x 11 columns]

You can also use another column as the index. Let’s use “Title” this time!

df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
print(df.head())
##                          Rank                     Genre  ... Revenue (Millions) Metascore
## Title                                                    ...
## Guardians of the Galaxy     1   Action,Adventure,Sci-Fi  ...             333.13      76.0
## Prometheus                  2  Adventure,Mystery,Sci-Fi  ...             126.46      65.0
## Split                       3           Horror,Thriller  ...             138.12      62.0
## Sing                        4   Animation,Comedy,Family  ...             270.32      59.0
## Suicide Squad               5  Action,Adventure,Fantasy  ...             325.02      40.0
##
## [5 rows x 11 columns]

If you want to see the last N rows, use the .tail() method (N defaults to 5).

print(df.tail(3))
##                         Rank                  Genre  ... Revenue (Millions) Metascore
## Title                                                ...
## Step Up 2: The Streets   998    Drama,Music,Romance  ...              58.01      50.0
## Search Party             999       Adventure,Comedy  ...                NaN      22.0
## Nine Lives              1000  Comedy,Family,Fantasy  ...              19.64      11.0
## 
## [3 rows x 11 columns]

Obtaining information about DataFrame

The .info() method will print out useful information about the DataFrame.

df.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 1000 entries, Guardians of the Galaxy to Nine Lives
## Data columns (total 11 columns):
##  #   Column              Non-Null Count  Dtype
## ---  ------              --------------  -----
##  0   Rank                1000 non-null   int64
##  1   Genre               1000 non-null   object
##  2   Description         1000 non-null   object
##  3   Director            1000 non-null   object
##  4   Actors              1000 non-null   object
##  5   Year                1000 non-null   int64
##  6   Runtime (Minutes)   1000 non-null   int64
##  7   Rating              1000 non-null   float64
##  8   Votes               1000 non-null   int64
##  9   Revenue (Millions)  872 non-null    float64
##  10  Metascore           936 non-null    float64
## dtypes: float64(3), int64(4), object(4)
## memory usage: 93.8+ KB

DataFrame’s .shape and .size attributes act just like NumPy array’s.

print(df.shape)  ## (1000, 11)
print(df.size)   ## 11000

The .values attribute returns the data as a NumPy array.

type(df.values)  ## <class 'numpy.ndarray'>
print(df.values) ## Too many values to display here! Try it yourself!