This is an archived version of the course and is no longer updated. Please find the latest version of the course on the main webpage.

Missing values

Your data may have missing or null values (these are placeholders for missing values).

To deal with null vaules, you can

  • remove the rows/columns with nulls
  • replace nulls with non-null values (this is called imputation)

Use df.isnull() to return an array of bools, indicating whether each corresonding element is missing.

df.isnull().sum() will return the number of nulls in each column.

For our movie example, Revenue (Millions) has 128 missing values, while Metascore has 64 missing values.

print(df.isnull().sum())
## Rank                    0
## Genre                   0
## Description             0
## Director                0
## Actors                  0
## Year                    0
## Runtime (Minutes)       0
## Rating                  0
## Votes                   0
## Revenue (Millions)    128
## Metascore              64
## dtype: int64

Removing missing values

You can use df.dropna() to remove missing values from your data.

print(df.shape)  ## (1000, 11)

# drop rows where there are null values
clean_df = df.dropna() 
print(clean_df.shape)  ## (838, 11) - we have dropped rows where Revenue or Metascore is null

# drop columns where there are null values
clean_df = df.dropna(axis=1)
print(clean_df.shape) ## (1000, 9) - dropped Revenue and Metascore columns

# Adding inplace=True will modify df directly
df.dropna(inplace=True)
print(df.shape)  ## (838, 11)

Replacing missing values

You can also replace the nulls with another value with df.fillna().

The nulls are generally replaced with the mean or median value of the column.

# Get back the original data
df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")

revenues = df["Revenue (Millions)"]
print(revenues.head())
## Title
## Guardians of the Galaxy    333.13
## Prometheus                 126.46
## Split                      138.12
## Sing                       270.32
## Suicide Squad              325.02
## Name: Revenue (Millions), dtype: float64

# Compute mean revenue
revenues_mean = revenues.mean()
print(revenues_mean)  ## 82.95637614678898

# Replace missing revenues with mean
revenues.fillna(revenues_mean, inplace=True)

# Revenue has no more missing values
print(df.isnull().sum())
## Rank                   0
## Genre                  0
## Description            0
## Director               0
## Actors                 0
## Year                   0
## Runtime (Minutes)      0
## Rating                 0
## Votes                  0
## Revenue (Millions)     0
## Metascore             64
## dtype: int64