Missing values
Your data may have missing or null values (these are placeholders for missing values).
To deal with null vaules, you can
- remove the rows/columns with nulls
- replace nulls with non-null values (this is called imputation)
Use df.isnull()
to return an array of bools
, indicating whether each corresonding element is missing.
df.isnull().sum()
will return the number of nulls in each column.
For our movie example, Revenue (Millions)
has 128 missing values, while Metascore
has 64 missing values.
print(df.isnull().sum())
## Rank 0
## Genre 0
## Description 0
## Director 0
## Actors 0
## Year 0
## Runtime (Minutes) 0
## Rating 0
## Votes 0
## Revenue (Millions) 128
## Metascore 64
## dtype: int64
Removing missing values
You can use df.dropna()
to remove missing values from your data.
print(df.shape) ## (1000, 11)
# drop rows where there are null values
clean_df = df.dropna()
print(clean_df.shape) ## (838, 11) - we have dropped rows where Revenue or Metascore is null
# drop columns where there are null values
clean_df = df.dropna(axis=1)
print(clean_df.shape) ## (1000, 9) - dropped Revenue and Metascore columns
# Adding inplace=True will modify df directly
df.dropna(inplace=True)
print(df.shape) ## (838, 11)
Replacing missing values
You can also replace the nulls with another value with df.fillna()
.
The nulls are generally replaced with the mean or median value of the column.
# Get back the original data
df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
revenues = df["Revenue (Millions)"]
print(revenues.head())
## Title
## Guardians of the Galaxy 333.13
## Prometheus 126.46
## Split 138.12
## Sing 270.32
## Suicide Squad 325.02
## Name: Revenue (Millions), dtype: float64
# Compute mean revenue
revenues_mean = revenues.mean()
print(revenues_mean) ## 82.95637614678898
# Replace missing revenues with mean
revenues.fillna(revenues_mean, inplace=True)
# Revenue has no more missing values
print(df.isnull().sum())
## Rank 0
## Genre 0
## Description 0
## Director 0
## Actors 0
## Year 0
## Runtime (Minutes) 0
## Rating 0
## Votes 0
## Revenue (Millions) 0
## Metascore 64
## dtype: int64