6.9. Tools to Speed Up Code#
This section covers some tools to speed up your code.
6.9.1. Fastai’s df_shrink: Shrink DataFrame’s Memory Usage in One Line of Code#
Show code cell content
!pip install fastai
Data analysts often struggle with large datasets that consume excessive memory, making it challenging to work efficiently, especially on machines with limited resources.
The df_shrink
method in fastai
helps address this issue by:
Automatically reducing the memory usage of a pandas DataFrame
Downcasting numeric columns to the smallest possible dtype without losing information
Here’s a short code example to demonstrate the utility of df_shrink
:
from fastai.tabular.core import df_shrink
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [1.0, 2.0, 3.0]})
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 3 non-null int64
1 col2 3 non-null float64
dtypes: float64(1), int64(1)
memory usage: 176.0 bytes
None
new_df = df_shrink(df)
print(new_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 3 non-null int8
1 col2 3 non-null float32
dtypes: float32(1), int8(1)
memory usage: 143.0 bytes
None
In this example, the memory usage of the DataFrame decreases from 200 bytes to 146 bytes.
6.9.2. Swifter: Add One Word to Make Your Pandas Apply 23 Times Faster#
Show code cell content
!pip install swifter
To have faster pandas apply when working with large data, use swifter. To use swifter, simply add .swifter
before .apply
. Everything else is the same.
In the code below, I compared the speed of Pandas’ apply
and the speed of swifter’s apply
using the California housing dataset of 20640 rows.
from time import time
from sklearn.datasets import fetch_california_housing
from scipy.special import boxcox1p
import swifter
import timeit
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
def pandas_apply():
X["AveRooms"].apply(lambda x: boxcox1p(x, 0.25))
def swifter_apply():
X["AveRooms"].swifter.apply(lambda x: boxcox1p(x, 0.25))
num_experiments = 100
pandas_time = timeit.timeit(pandas_apply, number=num_experiments)
swifter_time = timeit.timeit(swifter_apply, number=num_experiments)
pandas_vs_swifter = round(pandas_time / swifter_time, 2)
print(f"Swifter apply is {pandas_vs_swifter} times faster than Pandas apply")
Swifter apply is 16.82 times faster than Pandas apply
Using swifter apply is 23.56 times faster than Pandas apply! This ratio is calculated by taking the average run time of each method after 100 experiments.
6.9.3. pyinstrument: Readable Python Profiler#
Show code cell content
!pip install pyinstrument
Identifying performance bottlenecks in Python code can be challenging, especially with complex applications or time-consuming processes. While cProfile and profile are useful, their outputs can be lengthy and difficult to interpret, particularly when using high-level libraries like pandas.
pyinstrument
helps solve this problem by:
Providing a low-overhead profiler that shows where time is being spent in Python programs
Generating easy-to-read, hierarchical output that highlights the most time-consuming parts of the code
Here’s a short code example to demonstrate the utility of pyinstrument
:
%%writefile pyinstrument_example.py
from pyinstrument import Profiler
import pandas as pd
import numpy as np
df = pd.DataFrame({'nums': np.random.randint(0, 100, 10000)})
def is_even(num: int) -> int:
return num % 2 == 0
profiler = Profiler()
profiler.start()
df = df.assign(is_even=lambda df_: is_even(df_.nums))
profiler.stop()
profiler.print()
Writing pyinstrument_example.py
On your terminal, type:
$ pyinstrument pyinstrument_example.py
… and you should see an output like below:
Show code cell source
!pyinstrument pyinstrument_example.py
_ ._ __/__ _ _ _ _ _/_ Recorded: 09:04:59 Samples: 1
/_//_/// /_\ / //_// / //_'/ // Duration: 0.001 CPU time: 0.001
/ _/ v4.0.3
Program: pyinstrument_example.py
0.001 <module> pyinstrument_example.py:1
└─ 0.001 assign pandas/core/frame.py:4416
[2 frames hidden] pandas
0.001 apply_if_callable pandas/core/common.py:346
└─ 0.001 <lambda> pyinstrument_example.py:12
└─ 0.001 is_even pyinstrument_example.py:6
└─ 0.001 new_method pandas/core/ops/common.py:54
[9 frames hidden] pandas, <built-in>
0.001 mod <built-in>:0
_ ._ __/__ _ _ _ _ _/_ Recorded: 09:04:59 Samples: 225
/_//_/// /_\ / //_// / //_'/ // Duration: 0.265 CPU time: 1.897
/ _/ v4.0.3
Program: pyinstrument_example.py
0.265 <module> <string>:1
[4 frames hidden] <string>, runpy
0.265 _run_code runpy.py:64
└─ 0.265 <module> pyinstrument_example.py:1
└─ 0.261 <module> pandas/__init__.py:3
[650 frames hidden] pandas, pyarrow, <built-in>, textwrap...
To view this report with different options, run:
pyinstrument --load-prev 2021-09-15T09-04-59 [options]