6.14. Polars#

6.14.1. Polars: Blazing Fast DataFrame Library#

Hide code cell content
!pip install polars

If you want data manipulation library that’s both fast and memory-efficient, try Polars. Polars provides a high-level API similar to Pandas but with better performance for large datasets.

The code below compares the performance of Polars and pandas.

import pandas as pd
import polars as pl
import numpy as np
import time

# Create two Pandas DataFrames with 1 million rows each
pandas_df1 = pd.DataFrame({
    'key': np.random.randint(0, 1000, size=1_000_000),
    'value1': np.random.rand(1_000_000)
})

pandas_df2 = pd.DataFrame({
    'key': np.random.randint(0, 1000, size=1_000_000),
    'value2': np.random.rand(1000000)
})

# Create two Polars DataFrames from the Pandas DataFrames
polars_df1 = pl.from_pandas(pandas_df1)
polars_df2 = pl.from_pandas(pandas_df2)

# Merge the two DataFrames on the 'key' column
start_time = time.time()
pandas_merged = pd.merge(pandas_df1, pandas_df2, on='key')
pandas_time = time.time() - start_time

start_time = time.time()
polars_merged = polars_df1.join(polars_df2, on='key')
polars_time = time.time() - start_time

print(f"Pandas time: {pandas_time:.6f} seconds")
print(f"Polars time: {polars_time:.6f} seconds")
Pandas time: 127.604390 seconds
Polars time: 41.079080 seconds
print(f"Polars is {pandas_time/polars_time:.2f} times faster than Pandas")
Polars is 3.11 times faster than Pandas

Link to polars

6.14.2. Polars: Speed Up Data Processing 12x with Lazy Execution#

Hide code cell content
!pip install polars

Polars is a lightning-fast DataFrame library that utilizes all available cores on your machine.

Polars has two APIs: an eager API and a lazy API.

The eager execution is similar to Pandas, which executes code immediately.

In contrast, the lazy execution defers computations until the collect() method is called. This approach avoids unnecessary computations, making lazy execution potentially more efficient than eager execution.

The code following code shows filter operations on a DataFrame containing 10 million rows. Running polars with lazy execution is 12 times faster than using pandas.

Hide code cell content
import numpy as np

# Create a random seed for reproducibility
np.random.seed(42)

# Number of rows in the dataset
num_rows = 10_000_000

# Sample data for categorical columns
categories = ["a", "b", "c", "d"]

# Generate random data for the dataset
data = {
    "Cat1": np.random.choice(categories, size=num_rows),
    "Cat2": np.random.choice(categories, size=num_rows),
    "Num1": np.random.randint(1, 100, size=num_rows),
    "Num2": np.random.randint(1000, 10000, size=num_rows),
}

Create a pandas DataFrame and filter the DataFrame.

import pandas as pd


df = pd.DataFrame(data)
df.head()
Cat1 Cat2 Num1 Num2
0 c a 40 7292
1 d b 45 7849
2 a a 93 6940
3 c a 46 1265
4 c a 98 2509
%timeit df[(df['Cat1'] == 'a') & (df['Cat2'] == 'b') & (df['Num1'] >= 70)]
706 ms Β± 75.4 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

Create a polars DataFrame and filter the DataFrame.

import polars as pl

pl_df = pl.DataFrame(data)
%timeit pl_df.lazy().filter((pl.col('Cat1') == 'a') & (pl.col('Cat2') == 'b') & (pl.col('Num1') >= 70)).collect()
58.1 ms Β± 428 Β΅s per loop (mean Β± std. dev. of 7 runs, 10 loops each)

Link to polars

6.14.3. Polars vs. Pandas for CSV Loading and Filtering#

Hide code cell content
!pip install polars
Hide code cell content
!wget -O airport-codes.csv "https://datahub.io/core/airport-codes/r/0.csv"

The read_csv method in Pandas loads all rows of the dataset into the DataFrame before filtering to remove all unwanted rows.

On the other hand, the scan_csv method in Polars delays execution and optimizes the operation until the collect method is called. This approach accelerates code execution, particularly when handling large datasets.

In the code below, it is 25.5 times faster to use Polars instead of Pandas to read a subset of CSV file containing 57k rows.

import pandas as pd
import polars as pl 
%%timeit
df = pd.read_csv("airport-codes.csv")
df[(df["type"] == "heliport") & (df["continent"] == "EU")]
143 ms Β± 8.3 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)
%%timeit
pl.scan_csv("airport-codes.csv").filter(
    (pl.col("type") == "heliport") & (pl.col("continent") == "EU")
).collect()
5.6 ms Β± 594 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)

6.14.4. Pandas vs Polars: Harnessing Parallelism for Faster Data Processing#

Hide code cell content
!pip install polars

Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask.

import pandas as pd
import multiprocessing as mp
import dask.dataframe as dd


df = pd.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

# Perform the groupby and sum operation in parallel 
ddf = dd.from_pandas(df, npartitions=mp.cpu_count())
result = ddf.groupby("A").sum().compute()

Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration.

import polars as pl

df = pl.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

# Perform the groupby and sum operation in parallel 
result = df.group_by("A").sum()

Link to Polars.

6.14.5. Simple and Expressive Data Transformation with Polars#

Extract features and select only relevant features for each time series.

Hide code cell content
!pip install polars

Compared to pandas, Polars provides a more expressive syntax for creating complex data transformation pipelines. Every expression in Polars produces a new expression, and these expressions can be piped together.

import pandas as pd

df = pd.DataFrame(
    {"A": [1, 2, 6], "B": ["a", "b", "c"], "C": [True, False, True]}
)
integer_columns = df.select_dtypes("int64")
other_columns = df[["B"]]
pd.concat([integer_columns, other_columns], axis=1)
A B
0 1 a
1 2 b
2 6 c
import polars as pl

pl_df = pl.DataFrame(
    {"A": [1, 2, 6], "B": ["a", "b", "c"], "C": [True, False, True]}
)
pl_df.select([pl.col(pl.Int64), "B"])
shape: (3, 2)
AB
i64str
1"a"
2"b"
6"c"

6.14.6. Harness Polars and Delta Lake for Blazing Fast Performance#

Hide code cell content
!pip install polars deltalake

Polars is a Rust-based DataFrame library that is designed for high-performance data manipulation and analysis. Delta Lake is a storage format that offers a range of benefits, including ACID transactions, time travel, schema enforcement, and more. It’s designed to work seamlessly with big data processing engines like Apache Spark and can handle large amounts of data with ease.

When you combine Polars and Delta Lake, you get a powerful data processing system. Polars does the heavy lifting of processing your data, while Delta Lake keeps everything organized and up-to-date.

Imagine you have a huge dataset with millions of rows. You want to group the data by category and calculate the sum of a certain column. With Polars and Delta Lake, you can do this quickly and easily.

First, you create a sample dataset:

import pandas as pd
import numpy as np

# Create a sample dataset
num_rows = 10_000_000
data = {
    "Cat1": np.random.choice(['A', 'B', 'C'], size=num_rows),
    'Num1': np.random.randint(low=1, high=100, size=num_rows)
}

df = pd.DataFrame(data)
df.head()
Cat1 Num1
0 A 84
1 C 63
2 B 11
3 A 73
4 B 57

Next, you save the dataset to Delta Lake:

from deltalake.writer import write_deltalake

save_path = "tmp/data"

write_deltalake(save_path, df)

Then, you can use Polars to read the data from Delta Lake and perform the grouping operation:

import polars as pl 

pl_df = pl.read_delta(save_path)

print(pl_df.group_by("Cat1").sum())
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cat1 ┆ Num1      β”‚
β”‚ ---  ┆ ---       β”‚
β”‚ str  ┆ i64       β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════════║
β”‚ B    ┆ 166653474 β”‚
β”‚ C    ┆ 166660653 β”‚
β”‚ A    ┆ 166597835 β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Let’s say you want to append some new data to the existing dataset:

new_data = pd.DataFrame({"Cat1": ["B", "C"], "Num1": [2, 3]})

write_deltalake(save_path, new_data, mode="append")

Now, you can use Polars to read the updated data from Delta Lake:

updated_pl_df = pl.read_delta(save_path)
print(updated_pl_df.tail())
shape: (5, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ Cat1 ┆ Num1 β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ i64  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ A    ┆ 29   β”‚
β”‚ A    ┆ 41   β”‚
β”‚ A    ┆ 49   β”‚
β”‚ B    ┆ 2    β”‚
β”‚ C    ┆ 3    β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

But what if you want to go back to the previous version of the data? With Delta Lake, you can easily do that by specifying the version number:

previous_pl_df = pl.read_delta(save_path, version=0)
print(previous_pl_df.tail())
shape: (5, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ Cat1 ┆ Num1 β”‚
β”‚ ---  ┆ ---  β”‚
β”‚ str  ┆ i64  β”‚
β•žβ•β•β•β•β•β•β•ͺ══════║
β”‚ A    ┆ 90   β”‚
β”‚ C    ┆ 83   β”‚
β”‚ A    ┆ 29   β”‚
β”‚ A    ┆ 41   β”‚
β”‚ A    ┆ 49   β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

Link to polars

Link to Delta Lake.

6.14.7. Parallel Execution of Multiple Files with Polars#

Hide code cell content
!pip install polars

If you have multiple files to process, Polars enables you to construct a query plan for each file beforehand. This allows for the efficient execution of multiple files concurrently, maximizing processing speed.

import glob

import polars as pl

# Construct a query plan for each file
queries = []
for file in glob.glob("test_data/*.csv"):
    q = pl.scan_csv(file).group_by("Cat").agg(pl.sum("Num"))
    queries.append(q)

# Execute files in parallel
dataframes = pl.collect_all(queries)
dataframes
[shape: (3, 2)
 β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
 β”‚ Cat ┆ Num β”‚
 β”‚ --- ┆ --- β”‚
 β”‚ str ┆ i64 β”‚
 β•žβ•β•β•β•β•β•ͺ═════║
 β”‚ A   ┆ 2   β”‚
 β”‚ C   ┆ 6   β”‚
 β”‚ B   ┆ 4   β”‚
 β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜,
 shape: (3, 2)
 β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
 β”‚ Cat ┆ Num β”‚
 β”‚ --- ┆ --- β”‚
 β”‚ str ┆ i64 β”‚
 β•žβ•β•β•β•β•β•ͺ═════║
 β”‚ B   ┆ 5   β”‚
 β”‚ A   ┆ 1   β”‚
 β”‚ C   ┆ 1   β”‚
 β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜,
 shape: (3, 2)
 β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
 β”‚ Cat ┆ Num β”‚
 β”‚ --- ┆ --- β”‚
 β”‚ str ┆ i64 β”‚
 β•žβ•β•β•β•β•β•ͺ═════║
 β”‚ C   ┆ 4   β”‚
 β”‚ A   ┆ 4   β”‚
 β”‚ B   ┆ 1   β”‚
 β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜]

Link to polars

6.14.8. Polars’ Streaming Mode: A Solution for Large Data Sets#

Hide code cell content
!pip install polars

The default collect method in Polars processes your data as a single batch, which means that all the data must fit into your available memory.

If your data requires more memory than you have available, Polars can process it in batches using streaming mode. To use streaming mode, simply pass the streaming=True argument to the collect method.

import polars as pl

df = (
    pl.scan_csv("reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .collect(streaming=True)
)

Learn more about Streaming API in Polars.

6.14.9. Pandas vs Polars: Syntax Comparison for Data Scientists#

As a data scientist, you’re likely familiar with the popular data analysis libraries Pandas and Polars. Both provide powerful tools for working with tabular data, but how do their syntaxes compare?

To begin, we’ll create equivalent dataframes in both Pandas and Polars:

import pandas as pd
import polars as pl

# Create a Pandas DataFrame
data = {
    "Category": ["Electronics", "Clothing", "Electronics", "Clothing", "Electronics"],
    "Quantity": [5, 2, 3, 10, 4],
    "Price": [200, 30, 150, 20, 300],
}
pandas_df = pd.DataFrame(data)
polars_df = pl.DataFrame(data)

Key Operations Comparison:

pandas_df[["Category", "Price"]]
Category Price
0 Electronics 200
1 Clothing 30
2 Electronics 150
3 Clothing 20
4 Electronics 300
polars_df.select(["Category", "Price"])
shape: (5, 2)
CategoryPrice
stri64
"Electronics"200
"Clothing"30
"Electronics"150
"Clothing"20
"Electronics"300
# Filtering rows where Quantity > 3
pandas_df[pandas_df["Quantity"] > 3]
Category Quantity Price
0 Electronics 5 200
3 Clothing 10 20
4 Electronics 4 300
polars_df.filter(pl.col("Quantity") > 3)
shape: (3, 3)
CategoryQuantityPrice
stri64i64
"Electronics"5200
"Clothing"1020
"Electronics"4300
pandas_df.groupby("Category").agg(
    {
        "Quantity": "sum", 
        "Price": "mean", 
    }
)
Quantity Price
Category
Clothing 12 25.000000
Electronics 12 216.666667
polars_df.group_by("Category").agg(
    [
        pl.col("Quantity").sum(),
        pl.col("Price").mean(),
    ]
)
shape: (2, 3)
CategoryQuantityPrice
stri64f64
"Clothing"1225.0
"Electronics"12216.666667

6.14.10. Faster Data Analysis with Polars: A Guide to Lazy Execution#

When processing data, the execution approach significantly impacts performance. Pandas, a popular Python data manipulation library, uses eager execution by default, processing data immediately and loading everything into memory. This works well for small to medium-sized datasets but can lead to slow computations and high memory usage with large datasets.

In contrast, Polars, a modern data processing library, offers both eager and lazy execution. In lazy mode, a query optimizer evaluates operations and determines the most efficient execution plan, which may involve reordering operations or dropping redundant calculations.

Let’s consider an example where we:

  • Group a DataFrame by β€˜region’

  • Calculate two aggregations: sum of β€˜revenue’ and count of β€˜orders’

  • Filter for only β€˜North’ and β€˜South’ regions

With eager execution, Pandas will:

  • Execute operations immediately, loading all data into memory

  • Keep intermediate results in memory during each step

  • Execute operations in the exact order written

import numpy as np

# Generate sample data
N = 10_000_000

data = {
    "region": np.random.choice(["North", "South", "East", "West"], N),
    "revenue": np.random.uniform(100, 10000, N),
    "orders": np.random.randint(1, 100, N),
}
import pandas as pd


def analyze_sales_pandas(df):
    # Loads and processes everything in memory
    return (
        df.groupby("region")
        .agg({"revenue": "sum"})
        .loc[["North", "South"]]
    )


pd_df = pd.DataFrame(data)
%timeit analyze_sales_pandas(pd_df)
367 ms Β± 47 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

As shown above, the eager execution approach used by Pandas results in a execution time of approximately 367 milliseconds.

With lazy execution, Polars will:

  • Create an execution plan first, optimizing the entire chain before processing any data

  • Only process data once at .collect(), reducing memory overhead

  • Rearrange operations for optimal performance (pushing filters before groupby)

import polars as pl


def analyze_sales_polars(df):
    # Creates execution plan, no data processed yet
    result = (
        df.lazy()
        .group_by("region")
        .agg(pl.col("revenue").sum())
        .filter(pl.col("region").is_in(["North", "South"]))
        .collect()  # Only now data is processed
    )
    return result


pl_df = pl.DataFrame(data)
%timeit analyze_sales_polars(pl_df)
170 ms Β± 15.8 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

In contrast, the lazy execution approach with Polars takes approximately 170 milliseconds to complete, which is about 53.68% faster than the eager execution approach with Pandas.