This is an archived version of the course and is no longer updated. Please find the latest version of the course on the main webpage.

DataFrame filtering

You can filter the data by selecting a column and applying a condition on it. This will return a Series of True and False values.

# Is Ridley Scott the director in these rows?
condition_series = df["Director"] == "Ridley Scott"
print(condition_series.head())
## Title
## Guardians of the Galaxy    False
## Prometheus                  True
## Split                      False
## Sing                       False
## Suicide Squad              False
## Name: Director, dtype: bool

To return only rows where the condition is True, you can pass the operation into the DataFrame.

# Give me only rows where the director is Ridley Scott
filtered_df = df[df["Director"] == "Ridley Scott"]
print(filtered_df.head())
##                         Rank                     Genre  ... Revenue (Millions) Metascore
## Title                                                   ...
## Prometheus                 2  Adventure,Mystery,Sci-Fi  ...             126.46      65.0
## The Martian              103    Adventure,Drama,Sci-Fi  ...             228.43      80.0
## Robin Hood               388    Action,Adventure,Drama  ...             105.22      53.0
## American Gangster        471     Biography,Crime,Drama  ...             130.13      76.0
## Exodus: Gods and Kings   517    Action,Adventure,Drama  ...              65.01      52.0

More complicated examples (try understanding these yourself!):

# selecting movies directed by Nolan or Scott
print(df[df["Director"].isin(["Christopher Nolan", "Ridley Scott"])].head())
##                  Rank                     Genre  ... Revenue (Millions) Metascore
## Title                                            ...
## Prometheus          2  Adventure,Mystery,Sci-Fi  ...             126.46      65.0
## Interstellar       37    Adventure,Drama,Sci-Fi  ...             187.99      74.0
## The Dark Knight    55        Action,Crime,Drama  ...             533.32      82.0
## The Prestige       65      Drama,Mystery,Sci-Fi  ...              53.08      66.0
## Inception          81   Action,Adventure,Sci-Fi  ...             292.57      74.0

# Selecting movies released between 2008-2010 with a rating above 8.3 
# and returning only the year and rating
# (phew! That was a mouthful!) 
# Might be a good idea to split this into multiple statements!
selection = df[((df["Year"] >= 2008) & (df["Year"] <= 2010)) &
               (df["Rating"] >= 8.3)][["Year", "Rating"]]
print (selection)
##                       Year  Rating
## Title
## The Dark Knight       2008     9.0
## Inglourious Basterds  2009     8.3
## Inception             2010     8.8
## 3 Idiots              2009     8.4
## Up                    2009     8.3
## WALL·E                2008     8.4
## Toy Story 3           2010     8.3