This is an archived version of the course and is no longer updated. Please find the latest version of the course on the main webpage.

Creating DataFrames

Now let us take a look at how to create DataFrames.

There are many different ways to do this.

Method 1: Create a DataFrame from a list or numpy array.

arr = [["UK", "London"], ["France", "Paris"], ["Italy", "Rome"]]
df = pd.DataFrame(arr, columns=["country", "capital"])
print(df)
##   country capital
## 0      UK  London
## 1  France   Paris
## 2   Italy    Rome

# This should also give the same results
arr = np.array([["UK", "London"], ["France", "Paris"], ["Italy", "Rome"]])
column_names = np.array(["country", "capital"])
df = pd.DataFrame(arr, columns=column_names)
print(df)
##   country capital
## 0      UK  London
## 1  France   Paris
## 2   Italy    Rome

Like Series, you can also provide a custom index (axis labels).

prefixes = ["+44", "+33", "+39"]
df = pd.DataFrame(arr, columns=column_names, index=prefixes)
print(df)
##     country capital
## +44      UK  London
## +33  France   Paris
## +39   Italy    Rome

You can also assign the index after you created the DataFrame.

df = pd.DataFrame(arr, columns=column_names)
df.index = prefixes

Method 2: Create a DataFrame from a dictionary.

data_dict = {"country": ["UK", "France", "Italy"], "capital": ["London", "Paris", "Rome"]}
df = pd.DataFrame(data_dict)
print(df)
##   country capital
## 0      UK  London
## 1  France   Paris
## 2   Italy    Rome

Method 3: Create a DataFrame from a dictionary of Series.

Of course, we can also construct a DataFrame from a bunch of Series.

country_series = pd.Series(np.array(["UK", "France", "Italy"]))
capital_series = pd.Series(np.array(["London", "Paris", "Rome"]))
data_dict = {"country": country_series, "capital": capital_series}
df = pd.DataFrame(data_dict)
print(df)

If you provide a custom index for the Series, then the output index for the DataFrame will be a union of the index of the different Series.

data = {"one": pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"]), 
        "two": pd.Series([5, 6, 7, 8, 9], index=["a", "b", "c", "e", "f"])
       }
df = pd.DataFrame(data)
print(df)
##    one  two
## a  1.0  5.0
## b  2.0  6.0
## c  3.0  7.0
## d  4.0  NaN
## e  NaN  8.0
## f  NaN  9.0

Method 4: Create a DataFrame from a CSV file

Assuming you have a CSV file called data.csv:

code,country,capital
+44,UK,London
+33,France,Paris
+39,Italy,Rome

Use pd.read_csv() to load a DataFrame from the file.

df = pd.read_csv("data.csv") 
print(df)
##    code country capital
## 0    44      UK  London
## 1    33  France   Paris
## 2    39   Italy    Rome

Oops, the function was too smart and intepreted the code as integers (we lost the + signs!) No need to worry, this can be fixed by getting pd.read_csv() to read the data in as a string.

df = pd.read_csv("data.csv", dtype=str) 
print(df)
##    code country capital
## 0   +44      UK  London
## 1   +33  France   Paris
## 2   +39   Italy    Rome

If you want code to act as the index, tell pd.read_csv() to use column 0 as the index!

df = pd.read_csv("data.csv", index_col=0)
print(df)
##      country capital
## code
## 44        UK  London
## 33    France   Paris
## 39     Italy    Rome

Method 5: Create a DataFrame from a JSON file

Assume that you have a JSON file called data.json:

{"country": ["UK", "France", "Italy"],
 "capital":["London", "Paris", "Rome"],
 "code": ["+44", "+33", "+39"]}

You can load a DataFrame from the JSON file with pd.load_json().

df = pd.read_json("data.json") 
df.set_index("code", inplace=True) 
print(df)
##      country capital
## code
## 44        UK  London
## 33    France   Paris
## 39     Italy    Rome

df.set_index() sets the DataFrame index to an existing column.

You can also specify what kind of JSON string format pd.read_json() is expecting to read with the orient keyword argument. There are many different possible formats, and I will not list them here. Feel free to check them out in the official documentation yourself!