4.4. Transform a DataFrame#
This section covers some pandas methods to transform a DataFrame into another form using methods such as aggregation and groupby.
4.4.1. pandas.DataFrame.agg: Aggregate over Columns or Rows Using Multiple Operations#
If you want to aggregate over columns or rows using one or more operations, try pd.DataFrame.agg
.
from collections import Counter
import pandas as pd
def count_two(nums: list):
return Counter(nums)[2]
df = pd.DataFrame({"coll": [1, 3, 5], "col2": [2, 4, 6]})
df.agg(["sum", count_two])
coll | col2 | |
---|---|---|
sum | 9 | 12 |
count_two | 0 | 1 |
4.4.2. pandas.DataFrame.agg: Apply Different Aggregations to Different Columns#
If you want to apply different aggregations to different columns, insert a dictionary of column and aggregation methods to the pd.DataFrame.agg
method.
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
df.agg({"a": ["sum", "mean"], "b": ["min", "max"]})
a | b | |
---|---|---|
sum | 10.0 | NaN |
mean | 2.5 | NaN |
min | NaN | 2.0 |
max | NaN | 5.0 |
4.4.3. Group DataFrame’s Rows into a List Using groupby#
It is common to use groupby
to get the statistics of rows in the same group such as count, mean, median, etc. If you want to group rows into a list instead, use lambda x: list(x)
.
import pandas as pd
df = pd.DataFrame(
{
"col1": [1, 2, 3, 4, 3],
"col2": ["a", "a", "b", "b", "c"],
"col3": ["d", "e", "f", "g", "h"],
}
)
df.groupby(["col2"]).agg({"col1": "mean", "col3": lambda x: list(x)})
col1 | col3 | |
---|---|---|
col2 | ||
a | 1.5 | [d, e] |
b | 3.5 | [f, g] |
c | 3.0 | [h] |
4.4.4. Get the N Largest Values for Each Category in a DataFrame#
If you want to get the n
largest values for each category in a pandas DataFrame, use the combination of groupby
and nlargest
.
import pandas as pd
df = pd.DataFrame({"type": ["a", "a", "a", "b", "b"], "value": [1, 2, 3, 1, 2]})
# Get n largest values for each type
(
df.groupby("type")
.apply(lambda df_: df_.nlargest(2, "value"))
.reset_index(drop=True)
)
type | value | |
---|---|---|
0 | a | 3 |
1 | a | 2 |
2 | b | 2 |
3 | b | 1 |
4.4.5. Assign Name to a Pandas Aggregation#
By default, aggregating a column returns the name of that column.
import pandas as pd
df = pd.DataFrame({"size": ["S", "S", "M", "L"], "price": [2, 3, 4, 5]})
df.groupby('size').agg({'price': 'mean'})
price | |
---|---|
size | |
L | 5.0 |
M | 4.0 |
S | 2.5 |
If you want to assign a new name to an aggregation, add name = (column, agg_method)
to agg
.
df.groupby('size').agg(mean_price=('price', 'mean'))
mean_price | |
---|---|
size | |
L | 5.0 |
M | 4.0 |
S | 2.5 |
4.4.6. pandas.pivot_table: Turn Your DataFrame Into a Pivot Table#
A pivot table is useful to summarize and analyze the patterns in your data. If you want to turn your DataFrame into a pivot table, use pandas.pivot_table
.
import pandas as pd
df = pd.DataFrame(
{
"item": ["apple", "apple", "apple", "apple", "apple"],
"size": ["small", "small", "large", "large", "large"],
"location": ["Walmart", "Aldi", "Walmart", "Aldi", "Aldi"],
"price": [3, 2, 4, 3, 2.5],
}
)
df
item | size | location | price | |
---|---|---|---|---|
0 | apple | small | Walmart | 3.0 |
1 | apple | small | Aldi | 2.0 |
2 | apple | large | Walmart | 4.0 |
3 | apple | large | Aldi | 3.0 |
4 | apple | large | Aldi | 2.5 |
pivot = pd.pivot_table(
df, values="price", index=["item", "size"], columns=["location"], aggfunc="mean"
)
pivot
location | Aldi | Walmart | |
---|---|---|---|
item | size | ||
apple | large | 2.75 | 4.0 |
small | 2.00 | 3.0 |
4.4.7. DataFrame.groupby.sample: Get a Random Sample of Items from Each Category in a Column#
If you want to get a random sample of items from each category in a column, use pandas.DataFrame.groupby.sample
.This method is useful when you want to get a subset of a DataFrame while keeping all categories in a column.
import pandas as pd
df = pd.DataFrame({"col1": ["a", "a", "b", "c", "c", "d"], "col2": [4, 5, 6, 7, 8, 9]})
df.groupby("col1").sample(n=1)
col1 | col2 | |
---|---|---|
0 | a | 4 |
2 | b | 6 |
4 | c | 8 |
5 | d | 9 |
To get 2 items from each category, use n=2
.
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "b", "c", "c", "d", "d"],
"col2": [4, 5, 6, 7, 8, 9, 10, 11, 12],
}
)
df.groupby("col1").sample(n=2)
col1 | col2 | |
---|---|---|
0 | a | 4 |
1 | a | 5 |
4 | b | 8 |
2 | b | 6 |
5 | c | 9 |
6 | c | 10 |
8 | d | 12 |
7 | d | 11 |
4.4.8. size: Compute the Size of Each Group#
If you want to get the count of elements in one column, use groupby
and count
.
import pandas as pd
df = pd.DataFrame(
{"col1": ["a", "b", "b", "c", "c", "d"], "col2": ["S", "S", "M", "L", "L", "L"]}
)
df.groupby(['col1']).count()
col2 | |
---|---|
col1 | |
a | 1 |
b | 2 |
c | 2 |
d | 1 |
If you want to get the size of groups composed of 2 or more columns, use groupby
and size
instead.
df.groupby(['col1', 'col2']).size()
col1 col2
a S 1
b M 1
S 1
c L 2
d L 1
dtype: int64
4.4.9. pandas.melt: Unpivot a DataFrame#
If you want to unpivot a DataFrame from wide to long format, use pandas.melt.
For example, you can use pandas.melt to turn multiple columns (Aldi
, Walmart
, Costco
) into values of one column (store
).
import pandas as pd
df = pd.DataFrame(
{"fruit": ["apple", "orange"], "Aldi": [1, 2], "Walmart": [3, 4], "Costco": [5, 6]}
)
df
fruit | Aldi | Walmart | Costco | |
---|---|---|---|---|
0 | apple | 1 | 3 | 5 |
1 | orange | 2 | 4 | 6 |
df.melt(id_vars=["fruit"], value_vars=["Aldi", "Walmart", "Costco"], var_name='store')
fruit | store | value | |
---|---|---|---|
0 | apple | Aldi | 1 |
1 | orange | Aldi | 2 |
2 | apple | Walmart | 3 |
3 | orange | Walmart | 4 |
4 | apple | Costco | 5 |
5 | orange | Costco | 6 |
4.4.10. pandas.crosstab: Create a Cross Tabulation#
Cross tabulation allows you to analyze the relationship between multiple variables. To turn a pandas DataFrame into a cross tabulation, use pandas.crosstab
.
import pandas as pd
network = [
("Ben", "Khuyen"),
("Ben", "Josh"),
("Lauren", "Thinh"),
("Lauren", "Khuyen"),
("Khuyen", "Josh"),
]
# Create a dataframe of the network
friends1 = pd.DataFrame(network, columns=["person1", "person2"])
# Reverse the order of the columns
friends2 = pd.DataFrame(network, columns=["person2", "person1"])
# Create a symmetric dataframe
friends = pd.concat([friends1, friends2])
# Create a cross tabulation
pd.crosstab(friends.person1, friends.person2)
person2 | Ben | Josh | Khuyen | Lauren | Thinh |
---|---|---|---|---|---|
person1 | |||||
Ben | 0 | 1 | 1 | 0 | 0 |
Josh | 1 | 0 | 1 | 0 | 0 |
Khuyen | 1 | 1 | 0 | 1 | 0 |
Lauren | 0 | 0 | 1 | 0 | 1 |
Thinh | 0 | 0 | 0 | 1 | 0 |
4.4.11. Stack Columns into Rows in Pandas#
If you want to stack the columns into rows, use DataFrame.stack()
.
import pandas as pd
# Create a DataFrame with two columns and a MultiIndex
df = pd.DataFrame(
{"A": [1, 2, 3], "B": [4, 5, 6]}, index=["x", "y", "z"]
)
# Original DataFrame
print("Original DataFrame:")
print(df)
Original DataFrame:
A B
x 1 4
y 2 5
z 3 6
# Stacked DataFrame
stacked_df = df.stack()
print("\nStacked DataFrame:")
print(stacked_df)
Stacked DataFrame:
x A 1
B 4
y A 2
B 5
z A 3
B 6
dtype: int64
4.4.12. Turn a pandas Series into a pandas DataFrame#
If you want to turn a pandas Series into a pandas DataFrame, use str.get_dummies()
.
import pandas as pd
s = pd.Series(["a", "b", "a,b", "a,c"])
s
0 a
1 b
2 a,b
3 a,c
dtype: object
# Split the string by ","
s.str.get_dummies(sep=",")
a | b | c | |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 1 | 0 |
3 | 1 | 0 | 1 |