Creating DataFrames
Now let us take a look at how to create DataFrame
s.
There are many different ways to do this.
Method 1: Create a DataFrame
from a list or numpy array.
arr = [["UK", "London"], ["France", "Paris"], ["Italy", "Rome"]]
df = pd.DataFrame(arr, columns=["country", "capital"])
print(df)
## country capital
## 0 UK London
## 1 France Paris
## 2 Italy Rome
# This should also give the same results
arr = np.array([["UK", "London"], ["France", "Paris"], ["Italy", "Rome"]])
column_names = np.array(["country", "capital"])
df = pd.DataFrame(arr, columns=column_names)
print(df)
## country capital
## 0 UK London
## 1 France Paris
## 2 Italy Rome
Like Series
, you can also provide a custom index
(axis labels).
prefixes = ["+44", "+33", "+39"]
df = pd.DataFrame(arr, columns=column_names, index=prefixes)
print(df)
## country capital
## +44 UK London
## +33 France Paris
## +39 Italy Rome
You can also assign the index
after you created the DataFrame
.
df = pd.DataFrame(arr, columns=column_names)
df.index = prefixes
Method 2: Create a DataFrame
from a dictionary.
data_dict = {"country": ["UK", "France", "Italy"], "capital": ["London", "Paris", "Rome"]}
df = pd.DataFrame(data_dict)
print(df)
## country capital
## 0 UK London
## 1 France Paris
## 2 Italy Rome
Method 3: Create a DataFrame
from a dictionary of Series
.
Of course, we can also construct a DataFrame
from a bunch of Series
.
country_series = pd.Series(np.array(["UK", "France", "Italy"]))
capital_series = pd.Series(np.array(["London", "Paris", "Rome"]))
data_dict = {"country": country_series, "capital": capital_series}
df = pd.DataFrame(data_dict)
print(df)
If you provide a custom index for the Series
, then the output index for the DataFrame
will be a union of the index of the different Series
.
data = {"one": pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"]),
"two": pd.Series([5, 6, 7, 8, 9], index=["a", "b", "c", "e", "f"])
}
df = pd.DataFrame(data)
print(df)
## one two
## a 1.0 5.0
## b 2.0 6.0
## c 3.0 7.0
## d 4.0 NaN
## e NaN 8.0
## f NaN 9.0
Method 4: Create a DataFrame
from a CSV file
Assuming you have a CSV file called data.csv
:
code,country,capital
+44,UK,London
+33,France,Paris
+39,Italy,Rome
Use pd.read_csv()
to load a DataFrame
from the file.
df = pd.read_csv("data.csv")
print(df)
## code country capital
## 0 44 UK London
## 1 33 France Paris
## 2 39 Italy Rome
Oops, the function was too smart and intepreted the code as integers (we lost the +
signs!) No need to worry, this can be fixed by getting pd.read_csv()
to read the data in as a string.
df = pd.read_csv("data.csv", dtype=str)
print(df)
## code country capital
## 0 +44 UK London
## 1 +33 France Paris
## 2 +39 Italy Rome
If you want code
to act as the index, tell pd.read_csv()
to use column 0 as the index!
df = pd.read_csv("data.csv", index_col=0)
print(df)
## country capital
## code
## 44 UK London
## 33 France Paris
## 39 Italy Rome
Method 5: Create a DataFrame
from a JSON file
Assume that you have a JSON file called data.json
:
{"country": ["UK", "France", "Italy"],
"capital":["London", "Paris", "Rome"],
"code": ["+44", "+33", "+39"]}
You can load a DataFrame
from the JSON file with pd.load_json()
.
df = pd.read_json("data.json")
df.set_index("code", inplace=True)
print(df)
## country capital
## code
## 44 UK London
## 33 France Paris
## 39 Italy Rome
df.set_index()
sets the DataFrame
index to an existing column.
You can also specify what kind of JSON string format pd.read_json()
is expecting to read with the orient
keyword argument. There are many different possible formats, and I will not list them here. Feel free to check them out in the official documentation yourself!