6.9. Tools to Speed Up Code#

This section covers some tools to speed up your code.

6.9.1. Fastai’s df_shrink: Shrink DataFrame’s Memory Usage in One Line of Code#

Hide code cell content
!pip install fastai

Data analysts often struggle with large datasets that consume excessive memory, making it challenging to work efficiently, especially on machines with limited resources.

The df_shrink method in fastai helps address this issue by:

  • Automatically reducing the memory usage of a pandas DataFrame

  • Downcasting numeric columns to the smallest possible dtype without losing information

Here’s a short code example to demonstrate the utility of df_shrink:

from fastai.tabular.core import df_shrink
import pandas as pd

df = pd.DataFrame({"col1": [1, 2, 3], "col2": [1.0, 2.0, 3.0]})
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    3 non-null      int64  
 1   col2    3 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 176.0 bytes
None
new_df = df_shrink(df)
print(new_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    3 non-null      int8   
 1   col2    3 non-null      float32
dtypes: float32(1), int8(1)
memory usage: 143.0 bytes
None

In this example, the memory usage of the DataFrame decreases from 200 bytes to 146 bytes.

Link to Fastai.

6.9.2. Swifter: Add One Word to Make Your Pandas Apply 23 Times Faster#

Hide code cell content
!pip install swifter

To have faster pandas apply when working with large data, use swifter. To use swifter, simply add .swifter before .apply. Everything else is the same.

In the code below, I compared the speed of Pandas’ apply and the speed of swifter’s apply using the California housing dataset of 20640 rows.

from time import time
from sklearn.datasets import fetch_california_housing
from scipy.special import boxcox1p
import swifter
import timeit

X, y = fetch_california_housing(return_X_y=True, as_frame=True)


def pandas_apply():
    X["AveRooms"].apply(lambda x: boxcox1p(x, 0.25))


def swifter_apply():
    X["AveRooms"].swifter.apply(lambda x: boxcox1p(x, 0.25))


num_experiments = 100
pandas_time = timeit.timeit(pandas_apply, number=num_experiments)
swifter_time = timeit.timeit(swifter_apply, number=num_experiments)

pandas_vs_swifter = round(pandas_time / swifter_time, 2)
print(f"Swifter apply is {pandas_vs_swifter} times faster than Pandas apply")
Swifter apply is 16.82 times faster than Pandas apply

Using swifter apply is 23.56 times faster than Pandas apply! This ratio is calculated by taking the average run time of each method after 100 experiments.

Link to swifter.

6.9.3. pyinstrument: Readable Python Profiler#

Hide code cell content
!pip install pyinstrument 

Identifying performance bottlenecks in Python code can be challenging, especially with complex applications or time-consuming processes. While cProfile and profile are useful, their outputs can be lengthy and difficult to interpret, particularly when using high-level libraries like pandas.

pyinstrument helps solve this problem by:

  • Providing a low-overhead profiler that shows where time is being spent in Python programs

  • Generating easy-to-read, hierarchical output that highlights the most time-consuming parts of the code

Here’s a short code example to demonstrate the utility of pyinstrument:

%%writefile pyinstrument_example.py
from pyinstrument import Profiler
import pandas as pd
import numpy as np

df = pd.DataFrame({'nums': np.random.randint(0, 100, 10000)})
def is_even(num: int) -> int:
    return num % 2 == 0

profiler = Profiler()
profiler.start()

df = df.assign(is_even=lambda df_: is_even(df_.nums))

profiler.stop()
profiler.print()
Writing pyinstrument_example.py

On your terminal, type:

$ pyinstrument pyinstrument_example.py

… and you should see an output like below:

Hide code cell source
!pyinstrument pyinstrument_example.py
  _     ._   __/__   _ _  _  _ _/_   Recorded: 09:04:59  Samples:  1
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.001     CPU time: 0.001
/   _/                      v4.0.3

Program: pyinstrument_example.py

0.001 <module>  pyinstrument_example.py:1
└─ 0.001 assign  pandas/core/frame.py:4416
      [2 frames hidden]  pandas
         0.001 apply_if_callable  pandas/core/common.py:346
         └─ 0.001 <lambda>  pyinstrument_example.py:12
            └─ 0.001 is_even  pyinstrument_example.py:6
               └─ 0.001 new_method  pandas/core/ops/common.py:54
                     [9 frames hidden]  pandas, <built-in>
                        0.001 mod  <built-in>:0



  _     ._   __/__   _ _  _  _ _/_   Recorded: 09:04:59  Samples:  225
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.265     CPU time: 1.897
/   _/                      v4.0.3

Program: pyinstrument_example.py

0.265 <module>  <string>:1
   [4 frames hidden]  <string>, runpy
      0.265 _run_code  runpy.py:64
      └─ 0.265 <module>  pyinstrument_example.py:1
         └─ 0.261 <module>  pandas/__init__.py:3
               [650 frames hidden]  pandas, pyarrow, <built-in>, textwrap...

To view this report with different options, run:
    pyinstrument --load-prev 2021-09-15T09-04-59 [options]

Link to pyinstrument