Create a New DataFrame Using Existing DataFrame
Contents
4.4. Create a New DataFrame Using Existing DataFrame¶
This section covers some pandas methods to use an existing DataFrame to create a new DataFrame with different functionalities.
4.4.1. Fix Unnamed:0 When Reading a CSV in pandas¶
Sometimes, when reading a CSV in pandas, you will get an Unnamed:0
column.
import pandas as pd
df = pd.read_csv('data2.csv')
print(df)
Unnamed: 0 a b
0 0 1 4
1 1 2 5
2 2 3 6
To fix this, add index_col=0
to pandas.read_csv
.
df = pd.read_csv('data2.csv', index_col=0)
print(df)
a b
0 1 4
1 2 5
2 3 6
4.4.2. Read HTML Tables Using Pandas¶
If you want to quickly extract a table on a website and turn it into a Pandas DataFrame, use pd.read_html
. In the code below, I extracted the table from a Wikipedia page in one line of code.
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Poverty')
df[1]
Region | $1 per day | $1.25 per day[94] | $1.90 per day[95] | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Region | 1990 | 2002 | 2004 | 1981 | 2008 | 1981 | 1990 | 2000 | 2010 | 2015 | 2018 | |
0 | East Asia and Pacific | 15.4% | 12.3% | 9.1% | 77.2% | 14.3% | 80.2% | 60.9% | 34.8% | 10.8% | 2.1% | 1.2% |
1 | Europe and Central Asia | 3.6% | 1.3% | 1.0% | 1.9% | 0.5% | — | — | 7.3% | 2.4% | 1.5% | 1.1% |
2 | Latin America and the Caribbean | 9.6% | 9.1% | 8.6% | 11.9% | 6.5% | 13.7% | 15.5% | 12.7% | 6% | 3.7% | 3.7% |
3 | Middle East and North Africa | 2.1% | 1.7% | 1.5% | 9.6% | 2.7% | — | 6.5% | 3.5% | 2% | 4.3% | 7% |
4 | South Asia | 35.0% | 33.4% | 30.8% | 61.1% | 36% | 58% | 49.1% | — | 26% | — | — |
5 | Sub-Saharan Africa | 46.1% | 42.6% | 41.1% | 51.5% | 47.5% | — | 54.9% | 58.4% | 46.6% | 42.3% | 40.4% |
6 | World | — | — | — | 52.2% | 22.4% | 42.7% | 36.2% | 27.8% | 16% | 10.1% | — |
4.4.3. DataFrame.copy(): Make a Copy of a DataFrame¶
Have you ever tried to make a copy of a DataFrame using =
? You will not get a copy but a reference to the original DataFrame. Thus, changing the new DataFrame will also change the original DataFrame.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
df
col1 | col2 | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
df2 = df
df2['col1'] = [7, 8, 9]
df
col1 | col2 | |
---|---|---|
0 | 7 | 4 |
1 | 8 | 5 |
2 | 9 | 6 |
A better way to make a copy is to use df.copy()
. Now, changing the copy will not affect the original DataFrame.
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
# Create a copy of the original DataFrame
df3 = df.copy()
# Change the value of the copy
df3['col1'] = [7, 8, 9]
# Check if the original DataFrame has been changed
df
col1 | col2 | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
4.4.4. pandas.DataFrame.agg: Aggregate over Columns or Rows Using Multiple Operations¶
If you want to aggregate over columns or rows using one or more operations, try pd.DataFrame.agg
.
from collections import Counter
import pandas as pd
def count_two(nums: list):
return Counter(nums)[2]
df = pd.DataFrame({"coll": [1, 3, 5], "col2": [2, 4, 6]})
df.agg(["sum", count_two])
coll | col2 | |
---|---|---|
sum | 9 | 12 |
count_two | 0 | 1 |
4.4.5. pandas.DataFrame.agg: Apply Different Aggregations to Different Columns¶
If you want to apply different aggregations to different columns, insert a dictionary of column and aggregation methods to the pd.DataFrame.agg
method.
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df.agg({"a": ["sum", "mean"], "b": ["min", "max"]})
a | b | |
---|---|---|
sum | 10.0 | NaN |
mean | 2.5 | NaN |
min | NaN | 2.0 |
max | NaN | 5.0 |
4.4.6. Group DataFrame’s Rows into a List Using groupby¶
It is common to use groupby
to get the statistics of rows in the same group such as count, mean, median, etc. If you want to group rows into a list instead, use lambda x: list(x)
.
import pandas as pd
df = pd.DataFrame(
{
"col1": [1, 2, 3, 4, 3],
"col2": ["a", "a", "b", "b", "c"],
"col3": ["d", "e", "f", "g", "h"],
}
)
df.groupby(["col2"]).agg({"col1": "mean", "col3": lambda x: list(x)})
col1 | col3 | |
---|---|---|
col2 | ||
a | 1.5 | [d, e] |
b | 3.5 | [f, g] |
c | 3.0 | [h] |
4.4.7. Assign Name to a Pandas Aggregation¶
By default, aggregating a column returns the name of that column.
import pandas as pd
df = pd.DataFrame({"size": ["S", "S", "M", "L"], "price": [2, 3, 4, 5]})
print(df.groupby('size').agg({'price': 'mean'}))
price
size
L 5.0
M 4.0
S 2.5
If you want to assign a new name to an aggregation, add name = (column, agg_method)
to agg
.
df.groupby('size').agg(mean_price=('price', 'mean'))
mean_price | |
---|---|
size | |
L | 5.0 |
M | 4.0 |
S | 2.5 |
4.4.8. pandas.pivot_table: Turn Your DataFrame Into a Pivot Table¶
A pivot table is useful to summarize and analyze the patterns in your data. If you want to turn your DataFrame into a pivot table, use pandas.pivot_table
.
import pandas as pd
df = pd.DataFrame(
{
"item": ["apple", "apple", "apple", "apple", "apple"],
"size": ["small", "small", "large", "large", "large"],
"location": ["Walmart", "Aldi", "Walmart", "Aldi", "Aldi"],
"price": [3, 2, 4, 3, 2.5],
}
)
df
item | size | location | price | |
---|---|---|---|---|
0 | apple | small | Walmart | 3.0 |
1 | apple | small | Aldi | 2.0 |
2 | apple | large | Walmart | 4.0 |
3 | apple | large | Aldi | 3.0 |
4 | apple | large | Aldi | 2.5 |
pivot = pd.pivot_table(
df, values="price", index=["item", "size"], columns=["location"], aggfunc="mean"
)
pivot
location | Aldi | Walmart | |
---|---|---|---|
item | size | ||
apple | large | 2.75 | 4.0 |
small | 2.00 | 3.0 |
4.4.9. DataFrame.groupby.sample: Get a Random Sample of Items from Each Category in a Column¶
If you want to get a random sample of items from each category in a column, use pandas.DataFrame.groupby.sample
.This method is useful when you want to get a subset of a DataFrame while keeping all categories in a column.
import pandas as pd
df = pd.DataFrame({"col1": ["a", "a", "b", "c", "c", "d"], "col2": [4, 5, 6, 7, 8, 9]})
df.groupby("col1").sample(n=1)
col1 | col2 | |
---|---|---|
0 | a | 4 |
2 | b | 6 |
4 | c | 8 |
5 | d | 9 |
To get 2 items from each category, use n=2
.
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "b", "c", "c", "d", "d"],
"col2": [4, 5, 6, 7, 8, 9, 10, 11, 12],
}
)
df.groupby("col1").sample(n=2)
col1 | col2 | |
---|---|---|
0 | a | 4 |
1 | a | 5 |
4 | b | 8 |
2 | b | 6 |
5 | c | 9 |
6 | c | 10 |
8 | d | 12 |
7 | d | 11 |
4.4.10. size: Compute the Size of Each Group¶
If you want to get the count of elements in one column, use groupby
and count
.
import pandas as pd
df = pd.DataFrame(
{"col1": ["a", "b", "b", "c", "c", "d"], "col2": ["S", "S", "M", "L", "L", "L"]}
)
df.groupby(['col1']).count()
col2 | |
---|---|
col1 | |
a | 1 |
b | 2 |
c | 2 |
d | 1 |
If you want to get the size of groups composed of 2 or more columns, use groupby
and size
instead.
df.groupby(['col1', 'col2']).size()
col1 col2
a S 1
b M 1
S 1
c L 2
d L 1
dtype: int64
4.4.11. pandas.melt: Unpivot a DataFrame¶
If you want to unpivot a DataFrame from wide to long format, use pandas.melt.
For example, you can use pandas.melt to turn multiple columns (Aldi
, Walmart
, Costco
) into values of one column (store
).
import pandas as pd
df = pd.DataFrame(
{"fruit": ["apple", "orange"], "Aldi": [1, 2], "Walmart": [3, 4], "Costco": [5, 6]}
)
df
fruit | Aldi | Walmart | Costco | |
---|---|---|---|---|
0 | apple | 1 | 3 | 5 |
1 | orange | 2 | 4 | 6 |
df.melt(id_vars=["fruit"], value_vars=["Aldi", "Walmart", "Costco"], var_name='store')
fruit | store | value | |
---|---|---|---|
0 | apple | Aldi | 1 |
1 | orange | Aldi | 2 |
2 | apple | Walmart | 3 |
3 | orange | Walmart | 4 |
4 | apple | Costco | 5 |
5 | orange | Costco | 6 |
4.4.12. pandas.crosstab: Create a Cross Tabulation¶
Cross tabulation allows you to analyze the relationship between multiple variables. To turn a pandas DataFrame into a cross tabulation, use pandas.crosstab
.
import pandas as pd
network = [
("Ben", "Khuyen"),
("Ben", "Josh"),
("Lauren", "Thinh"),
("Lauren", "Khuyen"),
("Khuyen", "Josh"),
]
# Create a dataframe of the network
friends1 = pd.DataFrame(network, columns=["person1", "person2"])
# Reverse the order of the columns
friends2 = pd.DataFrame(network, columns=["person2", "person1"])
# Create a symmetric dataframe
friends = pd.concat([friends1, friends2])
# Create a cross tabulation
print(pd.crosstab(friends.person1, friends.person2))
person2 Ben Josh Khuyen Lauren Thinh
person1
Ben 0 1 1 0 0
Josh 1 0 1 0 0
Khuyen 1 1 0 1 0
Lauren 0 0 1 0 1
Thinh 0 0 0 1 0