Creating DataFrames
Now let us take a look at how to create DataFrames.
There are many different ways to do this.
Method 1: Create a DataFrame from a list or numpy array.
arr = [["UK", "London"], ["France", "Paris"], ["Italy", "Rome"]]
df = pd.DataFrame(arr, columns=["country", "capital"])
print(df)
## country capital
## 0 UK London
## 1 France Paris
## 2 Italy Rome
# This should also give the same results
arr = np.array([["UK", "London"], ["France", "Paris"], ["Italy", "Rome"]])
column_names = np.array(["country", "capital"])
df = pd.DataFrame(arr, columns=column_names)
print(df)
## country capital
## 0 UK London
## 1 France Paris
## 2 Italy Rome
Like Series, you can also provide a custom index (axis labels).
prefixes = ["+44", "+33", "+39"]
df = pd.DataFrame(arr, columns=column_names, index=prefixes)
print(df)
## country capital
## +44 UK London
## +33 France Paris
## +39 Italy Rome
You can also assign the index after you created the DataFrame.
df = pd.DataFrame(arr, columns=column_names)
df.index = prefixes
Method 2: Create a DataFrame from a dictionary.
data_dict = {"country": ["UK", "France", "Italy"], "capital": ["London", "Paris", "Rome"]}
df = pd.DataFrame(data_dict)
print(df)
## country capital
## 0 UK London
## 1 France Paris
## 2 Italy Rome
Method 3: Create a DataFrame from a dictionary of Series.
Of course, we can also construct a DataFrame from a bunch of Series.
country_series = pd.Series(np.array(["UK", "France", "Italy"]))
capital_series = pd.Series(np.array(["London", "Paris", "Rome"]))
data_dict = {"country": country_series, "capital": capital_series}
df = pd.DataFrame(data_dict)
print(df)
If you provide a custom index for the Series, then the output index for the DataFrame will be a union of the index of the different Series.
data = {"one": pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"]),
"two": pd.Series([5, 6, 7, 8, 9], index=["a", "b", "c", "e", "f"])
}
df = pd.DataFrame(data)
print(df)
## one two
## a 1.0 5.0
## b 2.0 6.0
## c 3.0 7.0
## d 4.0 NaN
## e NaN 8.0
## f NaN 9.0
Method 4: Create a DataFrame from a CSV file
Assuming you have a CSV file called data.csv:
code,country,capital
+44,UK,London
+33,France,Paris
+39,Italy,Rome
Use pd.read_csv() to load a DataFrame from the file.
df = pd.read_csv("data.csv")
print(df)
## code country capital
## 0 44 UK London
## 1 33 France Paris
## 2 39 Italy Rome
Oops, the function was too smart and intepreted the code as integers (we lost the + signs!) No need to worry, this can be fixed by getting pd.read_csv() to read the data in as a string.
df = pd.read_csv("data.csv", dtype=str)
print(df)
## code country capital
## 0 +44 UK London
## 1 +33 France Paris
## 2 +39 Italy Rome
If you want code to act as the index, tell pd.read_csv() to use column 0 as the index!
df = pd.read_csv("data.csv", index_col=0)
print(df)
## country capital
## code
## 44 UK London
## 33 France Paris
## 39 Italy Rome
Method 5: Create a DataFrame from a JSON file
Assume that you have a JSON file called data.json:
{"country": ["UK", "France", "Italy"],
"capital":["London", "Paris", "Rome"],
"code": ["+44", "+33", "+39"]}
You can load a DataFrame from the JSON file with pd.load_json().
df = pd.read_json("data.json")
df.set_index("code", inplace=True)
print(df)
## country capital
## code
## 44 UK London
## 33 France Paris
## 39 Italy Rome
df.set_index() sets the DataFrame index to an existing column.
You can also specify what kind of JSON string format pd.read_json() is expecting to read with the orient keyword argument. There are many different possible formats, and I will not list them here. Feel free to check them out in the official documentation yourself!