4.1. Transform Columns¶
4.1.1. pandas.DataFrame.pipe: Increase the Readability of your Code when Applying Multiple Functions to a DataFrame¶
!pip install textblob
If you want to increase the readability of your code when applying multiple functions to a DataFrame, use pands.DataFrame.pipe
method.
from textblob import TextBlob
import pandas as pd
def remove_white_space(df: pd.DataFrame):
df['text'] = df['text'].apply(lambda row: row.strip())
return df
def get_sentiment(df: pd.DataFrame):
df['sentiment'] = df['text'].apply(lambda row:
TextBlob(row).sentiment[0])
return df
df = pd.DataFrame({'text': ["It is a beautiful day today ",
" This movie is terrible"]})
df = (df.pipe(remove_white_space)
.pipe(get_sentiment)
)
df
text | sentiment | |
---|---|---|
0 | It is a beautiful day today | 0.85 |
1 | This movie is terrible | -1.00 |
4.1.2. pandas.Series.map: Change Values of a Pandas Series Using a Dictionary¶
If you want to change values of a pandas Series using a dictionary, use pd.Series.map
.
import pandas as pd
s = pd.Series(["a", "b", "c"])
s.map({"a": 1, "b": 2, "c": 3})
0 1
1 2
2 3
dtype: int64
4.1.3. Assign Values to Multiple New Columns¶
If you want to assign values to multiple new columns, instead of assigning them separately, you can do everything in one line of code with df.assign
.
In the code below, I first created col3
then use col3
to create col4
. Everything is in one line of code.
import pandas as pd
df = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})
df = df.assign(col3=lambda x: x.col1 * 100 + x.col2).assign(
col4=lambda x: x.col2 * x.col3
)
df
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 1 | 3 | 103 | 309 |
1 | 2 | 4 | 204 | 816 |
4.1.4. pandas.DataFrame.explode: Transform Each Element in an Iterable to a Row¶
When working with pandas DataFrame
, if you want to transform each element in an iterable to a row, use explode
.
import pandas as pd
df = pd.DataFrame({"a": [[1, 2], [4, 5]], "b": [11, 13]})
df
a | b | |
---|---|---|
0 | [1, 2] | 11 |
1 | [4, 5] | 13 |
df.explode("a")
a | b | |
---|---|---|
0 | 1 | 11 |
0 | 2 | 11 |
1 | 4 | 13 |
1 | 5 | 13 |
4.1.5. Split a String into Multiple Rows¶
Sometimes, you might have a column whose values are strings representing different items such as "1, 2"
.
import pandas as pd
df = pd.DataFrame({"a": ["1,2", "4,5"], "b": [11, 13]})
df
a | b | |
---|---|---|
0 | 1,2 | 11 |
1 | 4,5 | 13 |
To turn each string into a list, use Series.str.split()
:
# Split by comma
df.a = df.a.str.split(",")
df
a | b | |
---|---|---|
0 | [1, 2] | 11 |
1 | [4, 5] | 13 |
Now you can split elements in the list into multiple rows using explode
.
df.explode('a')
a | b | |
---|---|---|
0 | 1 | 11 |
0 | 2 | 11 |
1 | 4 | 13 |
1 | 5 | 13 |
4.1.6. pandas.cut: Bin a DataFrame’s values into Discrete Intervals¶
If you want to bin your Dataframe’s values into discrete intervals, use pandas.cut
.
import pandas as pd
df = pd.DataFrame({"a": [1, 3, 7, 11, 14, 17]})
bins = [0, 5, 10, 15, 20]
df["binned"] = pd.cut(df["a"], bins=bins)
print(df)
a binned
0 1 (0, 5]
1 3 (0, 5]
2 7 (5, 10]
3 11 (10, 15]
4 14 (10, 15]
5 17 (15, 20]
4.1.7. pandas.qcut: Bin a DataFrame’s Values into Equal-Sized Intervals¶
If you want to bin a column’s values into intervals that contain roughly the same number of elements, use pandas.qcut
.
In the example below, the values of a
are separated into 3 intervals, each of which contains 2 elements.
import pandas as pd
df = pd.DataFrame({"a": [1, 3, 7, 11, 14, 17]})
df["binned"] = pd.qcut(df["a"], q=3)
df
a | binned | |
---|---|---|
0 | 1 | (0.999, 5.667] |
1 | 3 | (0.999, 5.667] |
2 | 7 | (5.667, 12.0] |
3 | 11 | (5.667, 12.0] |
4 | 14 | (12.0, 17.0] |
5 | 17 | (12.0, 17.0] |
df.binned.value_counts()
(0.999, 5.667] 2
(5.667, 12.0] 2
(12.0, 17.0] 2
Name: binned, dtype: int64
4.1.8. Forward Fill in pandas: Use the Previous Value to Fill the Current Missing Value¶
If you want to use the previous value in a column or a row to fill the current missing value in a pandas DataFrame, use df.fillna(method=’ffill’)
. ffill
stands for forward fill.
import numpy as np
import pandas as pd
df = pd.DataFrame({"a": [1, np.nan, 3], "b": [4, 5, np.nan], "c": [1, 2, 3]})
df
a | b | c | |
---|---|---|---|
0 | 1.0 | 4.0 | 1 |
1 | NaN | 5.0 | 2 |
2 | 3.0 | NaN | 3 |
df = df.fillna(method="ffill")
df
a | b | c | |
---|---|---|---|
0 | 1.0 | 4.0 | 1 |
1 | 1.0 | 5.0 | 2 |
2 | 3.0 | 5.0 | 3 |
4.1.9. pandas.clip: Exclude Outliers¶
Outliers are unusual values in your dataset, and they can distort statistical analyses.
import pandas as pd
data = {"col0": [9, -3, 0, -1, 5]}
df = pd.DataFrame(data)
df
col0 | |
---|---|
0 | 9 |
1 | -3 |
2 | 0 |
3 | -1 |
4 | 5 |
If you want to trim values that the outliers, one of the methods is to use df.clip
.
Below is how to use the 0.5-quantile as the lower threshold and .95-quantile as the upper threshold
lower = df.col0.quantile(0.05)
upper = df.col0.quantile(0.95)
df.clip(lower=lower, upper=upper)
col0 | |
---|---|
0 | 8.2 |
1 | -2.6 |
2 | 0.0 |
3 | -1.0 |
4 | 5.0 |