Exercises
Ok, that is it for our guided tutorials.
Now, it’s over to you to try some exercises for using Pandas!
Many thanks to the following people for contributing to these exercises: Joe Stacey, Oana Cocarascu
Part A: Data cleaning and selection
Ex1a: Create the following Pandas DataFrame:
Name | Age | Twitter.followers | HoursPW.on.twitter |
---|---|---|---|
Annette | 23 | 23 | 2 |
Boris | 52 | 102 | 4 |
Christine | 65 | 57 | 0 |
Daniel | 45 | 453 | NA |
Eve | 25 | 753 | 3 |
Frances | 35 | Unknown | 1 |
Gabi | 21 | Unknown | 2 |
Ex1b: For the DataFrame created above, perform the following operations to see who has the most followers:
- If anyone has ‘Unknown’ number of followers, set this to be 0
- Remove any rows that contain NA
- Subtract 5 from each person’s Twitter.followers (to account for bots)
- Filter the table to only include people who have more than 50 followers
- Sort the table by the ascending number of twitter followers
Ex1c: For the updated dataframe, try also performing the following operations:
- Find the mean of each column
- Return the value from the 2nd row and the 3rd column
- Return the 3rd row from the column named ‘Name’
- Drop all columns except for ‘Name’ and ‘Twitter.followers’.
- How else could you only select these two columns?
Part B: Finding aggregate information from a Pandas DataFrame
Ex2a: Use the following commands to create a new Pandas DataFrame:
input_data = [['Europe','UK', 'London', 'Rain', 15, 5],['Europe','France', 'Paris', 'Cloudy', 22, 3], ['Europe','Spain', 'Madrid', 'Sunny', 25, 4],['Asia','Afghanistan', 'Kabul', 'Sunny', 37, 2], ['Asia','Pakistan', 'Islamabad', 'Sunny', 32, 4],['Europe','Germany', 'Berlin', 'Cloudy', 20, 5] ]
data = pd.DataFrame(input_data, columns = ['Continent', 'Country', 'Capital.city', 'Weather', 'Temperature', 'Visitor.rating'])
This should look like the table below:
Continent | Country | Capital.city | Weather | Temperature | Visitor.rating |
---|---|---|---|---|---|
Europe | UK | London | Rain | 15 | 5 |
Europe | France | Paris | Cloudy | 22 | 3 |
Europe | Spain | Madrid | Sunny | 25 | 4 |
Asia | Afghanistan | Kabul | Sunny | 37 | 2 |
Asia | Pakistan | Islamabad | Sunny | 32 | 4 |
Europe | Germany | Berlin | Cloudy | 20 | 5 |
Ex2b: Update the DataFrame based on the following:
- Create another row for a country of your choice, and merge
- Create another column with a 0 or a 1 based on if you’ve been to each capital city
Ex3: In Pandas we can easily group our data by one or more features, and then create a DataFrame with summary information about each group. We can do this using .groupby(col_name) followed by .agg() or .transform(). Use these commands to answer the following questions:
- Create a DataFrame with the average temperature for each Continent
- Create a DataFrame with the minimum, median and maximum visitor rating for each Continent
- Produce a copy of the original DataFrame, and without losing any rows add a column with the average temperature for the continent. E.g. for the first three rows this figure will be the same as all three rows relate to Europe.
Bonus question
Ex4: Pokemon data analysis
-
Download the Pokemon dataset at PokemonData
- Load the csv file into a dataframe and set the Name to be the index of the dataframe
- Print the first 20 pokemons
- Print the name of the columns
- Print Type 1 and Generation for all pokemons
- Get all the data for Pikachu
- Get the number of nulls in each column
- Use the unique function to display the unique values of column Type 1
- For each unique value of Type 1 , print Name , Type 1 , Generation , and Legendary of the first 5 pokemons
- Print all pokemons that have Water as Type 1 and Dragon as Type 2
- Print all pokemons that are either Type 2 Electric or Type 2 Ice