DataFrame operations
Let us now discuss using DataFrame
with a practical example, using the IMDB movie dataset.
Assumming you have downloaded IMDB-Movie-Data.csv
to your current directory, let us try reading the CSV file as a DataFrame
. We will also use the column “Rank” as the index.
df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Rank")
You can use DataFrame
’s .head(N)
method to read the first N
rows (default is 5).
print(df.head())
## Title Genre ... Revenue (Millions) Metascore
## Rank ...
## 1 Guardians of the Galaxy Action,Adventure,Sci-Fi ... 333.13 76.0
## 2 Prometheus Adventure,Mystery,Sci-Fi ... 126.46 65.0
## 3 Split Horror,Thriller ... 138.12 62.0
## 4 Sing Animation,Comedy,Family ... 270.32 59.0
## 5 Suicide Squad Action,Adventure,Fantasy ... 325.02 40.0
##
## [5 rows x 11 columns]
print(df.head(3))
## Title Genre ... Revenue (Millions) Metascore
## Rank ...
## 1 Guardians of the Galaxy Action,Adventure,Sci-Fi ... 333.13 76.0
## 2 Prometheus Adventure,Mystery,Sci-Fi ... 126.46 65.0
## 3 Split Horror,Thriller ... 138.12 62.0
##
## [3 rows x 11 columns]
You can also use another column as the index. Let’s use “Title” this time!
df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
print(df.head())
## Rank Genre ... Revenue (Millions) Metascore
## Title ...
## Guardians of the Galaxy 1 Action,Adventure,Sci-Fi ... 333.13 76.0
## Prometheus 2 Adventure,Mystery,Sci-Fi ... 126.46 65.0
## Split 3 Horror,Thriller ... 138.12 62.0
## Sing 4 Animation,Comedy,Family ... 270.32 59.0
## Suicide Squad 5 Action,Adventure,Fantasy ... 325.02 40.0
##
## [5 rows x 11 columns]
If you want to see the last N
rows, use the .tail()
method (N
defaults to 5).
print(df.tail(3))
## Rank Genre ... Revenue (Millions) Metascore
## Title ...
## Step Up 2: The Streets 998 Drama,Music,Romance ... 58.01 50.0
## Search Party 999 Adventure,Comedy ... NaN 22.0
## Nine Lives 1000 Comedy,Family,Fantasy ... 19.64 11.0
##
## [3 rows x 11 columns]
Obtaining information about DataFrame
The .info()
method will print out useful information about the DataFrame
.
df.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 1000 entries, Guardians of the Galaxy to Nine Lives
## Data columns (total 11 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Rank 1000 non-null int64
## 1 Genre 1000 non-null object
## 2 Description 1000 non-null object
## 3 Director 1000 non-null object
## 4 Actors 1000 non-null object
## 5 Year 1000 non-null int64
## 6 Runtime (Minutes) 1000 non-null int64
## 7 Rating 1000 non-null float64
## 8 Votes 1000 non-null int64
## 9 Revenue (Millions) 872 non-null float64
## 10 Metascore 936 non-null float64
## dtypes: float64(3), int64(4), object(4)
## memory usage: 93.8+ KB
DataFrame’s .shape
and .size
attributes act just like NumPy array’s.
print(df.shape) ## (1000, 11)
print(df.size) ## 11000
The .values
attribute returns the data as a NumPy array.
type(df.values) ## <class 'numpy.ndarray'>
print(df.values) ## Too many values to display here! Try it yourself!