Combine Multiple DataFrames

4.6. Combine Multiple DataFrames#

4.6.1. pandas.DataFrame.combine_first: Update Null Elements Using Another DataFrame#

If you want to fill null values in one DataFrame with non-null values at the same locations from another DataFrame, use pandas.DataFrame.combine_first.

import pandas as pd

store1 = pd.DataFrame({"orange": [None, 5], "apple": [None, 1]})
store2 = pd.DataFrame({"orange": [1, 3], "apple": [4, 2]})

print(store1)

   orange  apple
0     NaN    NaN
1     5.0    1.0

print(store2)

   orange  apple
0       1      4
1       3      2

In the code below, the values at the first row of store1 are updated with the values at the first row of store2.

print(store1.combine_first(store2))

   orange  apple
0     1.0    4.0
1     5.0    1.0

4.6.2. df.merge: Merge DataFrame Based on Columns#

If you want to merge on DataFrame with another DataFrame based on the similarity between 2 columns, use df.merge.

import pandas as pd

df1 = pd.DataFrame({"key1": ["a", "a", "b", "b", "a"], "value": [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({"key2": ["a", "b", "c"], "value": [6, 7, 8]})

df1 

	key1	value
0	a	1
1	a	2
2	b	3
3	b	4
4	a	5

df2

	key2	value
0	a	6
1	b	7
2	c	8

df1.merge(df2, left_on="key1", right_on="key2")

	key1	value_x	key2	value_y
0	a	1	a	6
1	a	2	a	6
2	a	5	a	6
3	b	3	b	7
4	b	4	b	7

4.6.3. Enhancing Readability in DataFrame Merging With Custom Suffixes#

When merging two DataFrames with overlapping columns, the default behavior is to add suffixes “_x” and “_y” to the column names.

import pandas as pd

df1 = pd.DataFrame({"id": [1, 2, 3], "val": ["A", "B", "C"]})
df2 = pd.DataFrame({"id": [2, 3, 4], "val": ["X", "Y", "Z"]})

merged_df = pd.merge(df1, df2, on="id")

print(merged_df)

   id val_x val_y
0   2     B     X
1   3     C     Y

To improve readability, you can specify custom suffixes.

merged_df = pd.merge(df1, df2, on="id", suffixes=("_df1", "_df2"))
print(merged_df)

   id val_df1 val_df2
0   2       B       X
1   3       C       Y

4.6.4. Include All Rows When Merging Two DataFrames#

df.merge only includes rows with matching values in both DataFrames. If you want to include all rows from both DataFrames, use how='outer'.

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B'], 'v1': [1, 2]})
df2 = pd.DataFrame({'key': ['B', 'C'], 'v2': [3, 4]})

## Keep only rows with matching keys
pd.merge(df1, df2, on='key')

	key	v1	v2
0	B	2	3

## Keep all rows
pd.merge(df1, df2, on='key', how='outer')

	key	v1	v2
0	A	1.0	NaN
1	B	2.0	3.0
2	C	NaN	4.0

4.6.5. DataFrame.insert: Insert a Column Into a DataFrame at a Specified Location#

If you want to insert a column into a DataFrame at a specified location, use df.insert. In the code below, I insert column c at location 0.

import pandas as pd 

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.insert(loc=0, column='c', value=[5, 6])

df

	c	a	b
0	5	1	3
1	6	2	4