4.4. Transform a DataFrame#

This section covers some pandas methods to transform a DataFrame into another form using methods such as aggregation and groupby.

4.4.1. pandas.DataFrame.agg: Aggregate over Columns or Rows Using Multiple Operations#

If you want to aggregate over columns or rows using one or more operations, try pd.DataFrame.agg.

from collections import Counter
import pandas as pd


def count_two(nums: list):
    return Counter(nums)[2]


df = pd.DataFrame({"coll": [1, 3, 5], "col2": [2, 4, 6]})
df.agg(["sum", count_two])
coll col2
sum 9 12
count_two 0 1

4.4.2. pandas.DataFrame.agg: Apply Different Aggregations to Different Columns#

If you want to apply different aggregations to different columns, insert a dictionary of column and aggregation methods to the pd.DataFrame.agg method.

import pandas as pd 

df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})

df.agg({"a": ["sum", "mean"], "b": ["min", "max"]})
a b
sum 10.0 NaN
mean 2.5 NaN
min NaN 2.0
max NaN 5.0

4.4.3. Group DataFrame’s Rows into a List Using groupby#

It is common to use groupby to get the statistics of rows in the same group such as count, mean, median, etc. If you want to group rows into a list instead, use lambda x: list(x).

import pandas as pd

df = pd.DataFrame(
    {
        "col1": [1, 2, 3, 4, 3],
        "col2": ["a", "a", "b", "b", "c"],
        "col3": ["d", "e", "f", "g", "h"],
    }
)

df.groupby(["col2"]).agg({"col1": "mean", "col3": lambda x: list(x)})
col1 col3
col2
a 1.5 [d, e]
b 3.5 [f, g]
c 3.0 [h]

4.4.4. Get the N Largest Values for Each Category in a DataFrame#

If you want to get the n largest values for each category in a pandas DataFrame, use the combination of groupby and nlargest.

import pandas as pd

df = pd.DataFrame({"type": ["a", "a", "a", "b", "b"], "value": [1, 2, 3, 1, 2]})

# Get n largest values for each type
(
    df.groupby("type")
    .apply(lambda df_: df_.nlargest(2, "value"))
    .reset_index(drop=True)
)
type value
0 a 3
1 a 2
2 b 2
3 b 1

4.4.5. Assign Name to a Pandas Aggregation#

By default, aggregating a column returns the name of that column.

import pandas as pd 

df = pd.DataFrame({"size": ["S", "S", "M", "L"], "price": [2, 3, 4, 5]})

df.groupby('size').agg({'price': 'mean'})
price
size
L 5.0
M 4.0
S 2.5

If you want to assign a new name to an aggregation, add name = (column, agg_method) to agg.

df.groupby('size').agg(mean_price=('price', 'mean'))
mean_price
size
L 5.0
M 4.0
S 2.5

4.4.6. pandas.pivot_table: Turn Your DataFrame Into a Pivot Table#

A pivot table is useful to summarize and analyze the patterns in your data. If you want to turn your DataFrame into a pivot table, use pandas.pivot_table.

import pandas as pd 

df = pd.DataFrame(
    {
        "item": ["apple", "apple", "apple", "apple", "apple"],
        "size": ["small", "small", "large", "large", "large"],
        "location": ["Walmart", "Aldi", "Walmart", "Aldi", "Aldi"],
        "price": [3, 2, 4, 3, 2.5],
    }
)

df
item size location price
0 apple small Walmart 3.0
1 apple small Aldi 2.0
2 apple large Walmart 4.0
3 apple large Aldi 3.0
4 apple large Aldi 2.5
pivot = pd.pivot_table(
    df, values="price", index=["item", "size"], columns=["location"], aggfunc="mean"
)
pivot
location Aldi Walmart
item size
apple large 2.75 4.0
small 2.00 3.0

4.4.7. DataFrame.groupby.sample: Get a Random Sample of Items from Each Category in a Column#

If you want to get a random sample of items from each category in a column, use pandas.DataFrame.groupby.sample.This method is useful when you want to get a subset of a DataFrame while keeping all categories in a column.

import pandas as pd 

df = pd.DataFrame({"col1": ["a", "a", "b", "c", "c", "d"], "col2": [4, 5, 6, 7, 8, 9]})
df.groupby("col1").sample(n=1)
col1 col2
0 a 4
2 b 6
4 c 8
5 d 9

To get 2 items from each category, use n=2.

df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "b", "c", "c", "d", "d"],
        "col2": [4, 5, 6, 7, 8, 9, 10, 11, 12],
    }
)
df.groupby("col1").sample(n=2)
col1 col2
0 a 4
1 a 5
4 b 8
2 b 6
5 c 9
6 c 10
8 d 12
7 d 11

4.4.8. size: Compute the Size of Each Group#

If you want to get the count of elements in one column, use groupby and count.

import pandas as pd

df = pd.DataFrame(
    {"col1": ["a", "b", "b", "c", "c", "d"], "col2": ["S", "S", "M", "L", "L", "L"]}
)

df.groupby(['col1']).count()
col2
col1
a 1
b 2
c 2
d 1

If you want to get the size of groups composed of 2 or more columns, use groupby and size instead.

df.groupby(['col1', 'col2']).size()
col1  col2
a     S       1
b     M       1
      S       1
c     L       2
d     L       1
dtype: int64

4.4.9. pandas.melt: Unpivot a DataFrame#

If you want to unpivot a DataFrame from wide to long format, use pandas.melt.

For example, you can use pandas.melt to turn multiple columns (Aldi, Walmart, Costco) into values of one column (store).

import pandas as pd

df = pd.DataFrame(
    {"fruit": ["apple", "orange"], "Aldi": [1, 2], "Walmart": [3, 4], "Costco": [5, 6]}
)
df
fruit Aldi Walmart Costco
0 apple 1 3 5
1 orange 2 4 6
df.melt(id_vars=["fruit"], value_vars=["Aldi", "Walmart", "Costco"], var_name='store')
fruit store value
0 apple Aldi 1
1 orange Aldi 2
2 apple Walmart 3
3 orange Walmart 4
4 apple Costco 5
5 orange Costco 6

4.4.10. pandas.crosstab: Create a Cross Tabulation#

Cross tabulation allows you to analyze the relationship between multiple variables. To turn a pandas DataFrame into a cross tabulation, use pandas.crosstab.

import pandas as pd  

network = [
    ("Ben", "Khuyen"),
    ("Ben", "Josh"),
    ("Lauren", "Thinh"),
    ("Lauren", "Khuyen"),
    ("Khuyen", "Josh"),
]

# Create a dataframe of the network
friends1 = pd.DataFrame(network, columns=["person1", "person2"])

# Reverse the order of the columns
friends2 = pd.DataFrame(network, columns=["person2", "person1"])

# Create a symmetric dataframe
friends = pd.concat([friends1, friends2])

# Create a cross tabulation 
pd.crosstab(friends.person1, friends.person2)
person2 Ben Josh Khuyen Lauren Thinh
person1
Ben 0 1 1 0 0
Josh 1 0 1 0 0
Khuyen 1 1 0 1 0
Lauren 0 0 1 0 1
Thinh 0 0 0 1 0

4.4.11. Stack Columns into Rows in Pandas#

If you want to stack the columns into rows, use DataFrame.stack().

import pandas as pd

# Create a DataFrame with two columns and a MultiIndex
df = pd.DataFrame(
    {"A": [1, 2, 3], "B": [4, 5, 6]}, index=["x", "y", "z"]
)

# Original DataFrame
print("Original DataFrame:")
print(df)
Original DataFrame:
   A  B
x  1  4
y  2  5
z  3  6
# Stacked DataFrame
stacked_df = df.stack()

print("\nStacked DataFrame:")
print(stacked_df)
Stacked DataFrame:
x  A    1
   B    4
y  A    2
   B    5
z  A    3
   B    6
dtype: int64

4.4.12. Turn a pandas Series into a pandas DataFrame#

If you want to turn a pandas Series into a pandas DataFrame, use str.get_dummies().

import pandas as pd

s = pd.Series(["a", "b", "a,b", "a,c"])
s 
0      a
1      b
2    a,b
3    a,c
dtype: object
# Split the string by ","
s.str.get_dummies(sep=",")
a b c
0 1 0 0
1 0 1 0
2 1 1 0
3 1 0 1

4.4.13. Align Pandas Objects for Effective Data Manipulation#

To perform operations between two pandas objects, it’s often necessary to ensure that two pandas objects have the same row or column labels. The df.align method allows you to align two pandas objects along specified axes.

import pandas as pd

df1 = pd.DataFrame([[1, 2], [4, 5]], columns=["b", "a"])
df2 = pd.DataFrame([[4, 3, 2], [6, 5, 4]], columns=["a", "b", "c"])

print("df1:\n", df1, "\n")
print("df2:\n", df2)
df1:
    b  a
0  1  2
1  4  5 

df2:
    a  b  c
0  4  3  2
1  6  5  4
# ensure df1 and df2 have the same column labels
# by including all unique column labels from both objects
left, right = df1.align(df2, join='outer', axis=1)
print("df1:\n", left, "\n")
print("df2:\n", right)
df1:
    a  b   c
0  2  1 NaN
1  5  4 NaN 

df2:
    a  b  c
0  4  3  2
1  6  5  4