Polars

6.14. Polars#

6.14.1. Polars: Blazing Fast DataFrame Library#

If you want data manipulation library that’s both fast and memory-efficient, try Polars. Polars provides a high-level API similar to Pandas but with better performance for large datasets.

The code below compares the performance of Polars and pandas.

import time

import numpy as np
import pandas as pd
import polars as pl

# Create two Pandas DataFrames with 1 million rows each
pandas_df1 = pd.DataFrame(
    {
        "key": np.random.randint(0, 1000, size=1_000_000),
        "value1": np.random.rand(1_000_000),
    }
)

pandas_df2 = pd.DataFrame(
    {
        "key": np.random.randint(0, 1000, size=1_000_000),
        "value2": np.random.rand(1000000),
    }
)

# Create two Polars DataFrames from the Pandas DataFrames
polars_df1 = pl.from_pandas(pandas_df1)
polars_df2 = pl.from_pandas(pandas_df2)

## Merge the two DataFrames on the 'key' column
start_time = time.time()
pandas_merged = pd.merge(pandas_df1, pandas_df2, on="key")
pandas_time = time.time() - start_time

start_time = time.time()
polars_merged = polars_df1.join(polars_df2, on="key")
polars_time = time.time() - start_time

print(f"Pandas time: {pandas_time:.6f} seconds")
print(f"Polars time: {polars_time:.6f} seconds")

Pandas time: 127.604390 seconds
Polars time: 41.079080 seconds

print(f"Polars is {pandas_time/polars_time:.2f} times faster than Pandas")

Polars is 3.11 times faster than Pandas

Link to polars

6.14.2. Polars: Speed Up Data Processing 12x with Lazy Execution#

Polars is a lightning-fast DataFrame library that utilizes all available cores on your machine.

Polars has two APIs: an eager API and a lazy API.

The eager execution is similar to Pandas, which executes code immediately.

In contrast, the lazy execution defers computations until the collect() method is called. This approach avoids unnecessary computations, making lazy execution potentially more efficient than eager execution.

The code following code shows filter operations on a DataFrame containing 10 million rows. Running polars with lazy execution is 12 times faster than using pandas.

Create a pandas DataFrame and filter the DataFrame.

import pandas as pd

df = pd.DataFrame(data)
df.head()

	Cat1	Cat2	Num1	Num2
0	c	a	40	7292
1	d	b	45	7849
2	a	a	93	6940
3	c	a	46	1265
4	c	a	98	2509

%timeit df[(df['Cat1'] == 'a') & (df['Cat2'] == 'b') & (df['Num1'] >= 70)]

706 ms ± 75.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Create a polars DataFrame and filter the DataFrame.

import polars as pl

pl_df = pl.DataFrame(data)

%timeit pl_df.lazy().filter((pl.col('Cat1') == 'a') & (pl.col('Cat2') == 'b') & (pl.col('Num1') >= 70)).collect()

58.1 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Link to polars

6.14.3. Polars vs. Pandas for CSV Loading and Filtering#

The read_csv method in Pandas loads all rows of the dataset into the DataFrame before filtering to remove all unwanted rows.

On the other hand, the scan_csv method in Polars delays execution and optimizes the operation until the collect method is called. This approach accelerates code execution, particularly when handling large datasets.

In the code below, it is 25.5 times faster to use Polars instead of Pandas to read a subset of CSV file containing 57k rows.

import pandas as pd
import polars as pl

%%timeit
df = pd.read_csv("airport-codes.csv")
df[(df["type"] == "heliport") & (df["continent"] == "EU")]

143 ms ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
pl.scan_csv("airport-codes.csv").filter(
    (pl.col("type") == "heliport") & (pl.col("continent") == "EU")
).collect()

5.6 ms ± 594 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

6.14.4. Pandas vs Polars: Harnessing Parallelism for Faster Data Processing#

Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask.

import multiprocessing as mp

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

## Perform the groupby and sum operation in parallel
ddf = dd.from_pandas(df, npartitions=mp.cpu_count())
result = ddf.groupby("A").sum().compute()

Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration.

import polars as pl

df = pl.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

## Perform the groupby and sum operation in parallel
result = df.group_by("A").sum()

Link to Polars.

6.14.5. Simple and Expressive Data Transformation with Polars#

Extract features and select only relevant features for each time series.

Compared to pandas, Polars provides a more expressive syntax for creating complex data transformation pipelines. Every expression in Polars produces a new expression, and these expressions can be piped together.

import pandas as pd

df = pd.DataFrame({"A": [1, 2, 6], "B": ["a", "b", "c"], "C": [True, False, True]})
integer_columns = df.select_dtypes("int64")
other_columns = df[["B"]]
pd.concat([integer_columns, other_columns], axis=1)

	A	B
0	1	a
1	2	b
2	6	c

import polars as pl

pl_df = pl.DataFrame({"A": [1, 2, 6], "B": ["a", "b", "c"], "C": [True, False, True]})
pl_df.select([pl.col(pl.Int64), "B"])

shape: (3, 2)

A	B
i64	str
1	"a"
2	"b"
6	"c"

6.14.6. Harness Polars and Delta Lake for Blazing Fast Performance#

Polars is a Rust-based DataFrame library that is designed for high-performance data manipulation and analysis. Delta Lake is a storage format that offers a range of benefits, including ACID transactions, time travel, schema enforcement, and more. It’s designed to work seamlessly with big data processing engines like Apache Spark and can handle large amounts of data with ease.

When you combine Polars and Delta Lake, you get a powerful data processing system. Polars does the heavy lifting of processing your data, while Delta Lake keeps everything organized and up-to-date.

Imagine you have a huge dataset with millions of rows. You want to group the data by category and calculate the sum of a certain column. With Polars and Delta Lake, you can do this quickly and easily.

First, you create a sample dataset:

import numpy as np
import pandas as pd

# Create a sample dataset
num_rows = 10_000_000
data = {
    "Cat1": np.random.choice(["A", "B", "C"], size=num_rows),
    "Num1": np.random.randint(low=1, high=100, size=num_rows),
}

df = pd.DataFrame(data)
df.head()

	Cat1	Num1
0	A	84
1	C	63
2	B	11
3	A	73
4	B	57

Next, you save the dataset to Delta Lake:

from deltalake.writer import write_deltalake

save_path = "tmp/data"

write_deltalake(save_path, df)

Then, you can use Polars to read the data from Delta Lake and perform the grouping operation:

import polars as pl

pl_df = pl.read_delta(save_path)

print(pl_df.group_by("Cat1").sum())

shape: (3, 2)
┌──────┬───────────┐
│ Cat1 ┆ Num1      │
│ ---  ┆ ---       │
│ str  ┆ i64       │
╞══════╪═══════════╡
│ B    ┆ 166653474 │
│ C    ┆ 166660653 │
│ A    ┆ 166597835 │
└──────┴───────────┘

Let’s say you want to append some new data to the existing dataset:

new_data = pd.DataFrame({"Cat1": ["B", "C"], "Num1": [2, 3]})

write_deltalake(save_path, new_data, mode="append")

Now, you can use Polars to read the updated data from Delta Lake:

updated_pl_df = pl.read_delta(save_path)
print(updated_pl_df.tail())

shape: (5, 2)
┌──────┬──────┐
│ Cat1 ┆ Num1 │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 29   │
│ A    ┆ 41   │
│ A    ┆ 49   │
│ B    ┆ 2    │
│ C    ┆ 3    │
└──────┴──────┘

But what if you want to go back to the previous version of the data? With Delta Lake, you can easily do that by specifying the version number:

previous_pl_df = pl.read_delta(save_path, version=0)
print(previous_pl_df.tail())

shape: (5, 2)
┌──────┬──────┐
│ Cat1 ┆ Num1 │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 90   │
│ C    ┆ 83   │
│ A    ┆ 29   │
│ A    ┆ 41   │
│ A    ┆ 49   │
└──────┴──────┘

Link to polars

Link to Delta Lake.

6.14.7. Optimize Multiple DataFrame Computations with Polars collect_all#

Data engineers frequently need to process multiple related datasets together. When using pandas, each DataFrame is typically processed sequentially, which can be inefficient and time-consuming.

Here’s a common inefficient approach with pandas:

import numpy as np
import pandas as pd


def add_metric_scaled(df, metric_column):
    return df.assign(
        metric_scaled=lambda x: (x[metric_column] - x[metric_column].mean())
        / x[metric_column].std()
    )


# Create the first DataFrame with purchases data
df1 = pd.DataFrame(
    {"user_id": range(1000), "purchases": np.random.randint(1, 100, 1000)}
)
df1 = add_metric_scaled(df1, "purchases")

# Create the second DataFrame with clicks data
df2 = pd.DataFrame({"user_id": range(1000), "clicks": np.random.randint(1, 500, 1000)})
df2 = add_metric_scaled(df2, "clicks")

# Create the third DataFrame with page_views data
df3 = pd.DataFrame(
    {"user_id": range(1000), "page_views": np.random.randint(1, 1000, 1000)}
)
df3 = add_metric_scaled(df3, "page_views")

Let’s solve this using Polars’ parallel processing capabilities:

import numpy as np
import polars as pl


def add_metric_scaled(df, metric_column):
    return df.with_columns(
        [
            (pl.col(metric_column) - pl.col(metric_column).mean())
            / pl.col(metric_column).std().alias("metric_scaled")
        ]
    )


# Create the first LazyFrame with purchases data
lazy_frame1 = add_metric_scaled(
    pl.DataFrame(
        {"user_id": range(1000), "purchases": np.random.randint(1, 100, 1000)}
    ).lazy(),
    "purchases",
)

# Create the second LazyFrame with clicks data
lazy_frame2 = add_metric_scaled(
    pl.DataFrame(
        {"user_id": range(1000), "clicks": np.random.randint(1, 500, 1000)}
    ).lazy(),
    "clicks",
)

# Create the third LazyFrame with page_views data
lazy_frame3 = add_metric_scaled(
    pl.DataFrame(
        {"user_id": range(1000), "page_views": np.random.randint(1, 1000, 1000)}
    ).lazy(),
    "page_views",
)

## Process all frames in parallel
results = pl.collect_all([lazy_frame1, lazy_frame2, lazy_frame3])
print(results)

[shape: (1_000, 2)
┌─────────┬───────────┐
│ user_id ┆ purchases │
│ ---     ┆ ---       │
│ i64     ┆ f64       │
╞═════════╪═══════════╡
│ 0       ┆ -1.553524 │
│ 1       ┆ -0.528352 │
│ 2       ┆ -1.200017 │
│ 3       ┆ -1.093965 │
│ 4       ┆ -1.412121 │
│ …       ┆ …         │
│ 995     ┆ 1.027081  │
│ 996     ┆ -1.553524 │
│ 997     ┆ -0.669755 │
│ 998     ┆ -0.705106 │
│ 999     ┆ 0.03726   │
└─────────┴───────────┘, shape: (1_000, 2)
┌─────────┬───────────┐
│ user_id ┆ clicks    │
│ ---     ┆ ---       │
│ i64     ┆ f64       │
╞═════════╪═══════════╡
│ 0       ┆ -1.32932  │
│ 1       ┆ 1.250184  │
│ 2       ┆ -0.560815 │
│ 3       ┆ 0.047306  │
│ 4       ┆ 1.31701   │
│ …       ┆ …         │
│ 995     ┆ 1.611047  │
│ 996     ┆ 1.169992  │
│ 997     ┆ 0.354708  │
│ 998     ┆ -0.914995 │
│ 999     ┆ 1.136579  │
└─────────┴───────────┘, shape: (1_000, 2)
┌─────────┬────────────┐
│ user_id ┆ page_views │
│ ---     ┆ ---        │
│ i64     ┆ f64        │
╞═════════╪════════════╡
│ 0       ┆ 0.042274   │
│ 1       ┆ 1.50377    │
│ 2       ┆ -0.368771  │
│ 3       ┆ -1.72487   │
│ 4       ┆ -1.742436  │
│ …       ┆ …          │
│ 995     ┆ -0.814949  │
│ 996     ┆ 1.531876   │
│ 997     ┆ -1.728383  │
│ 998     ┆ -0.249322  │
│ 999     ┆ 0.741403   │
└─────────┴────────────┘]

Link to Polars

6.14.8. Polars’ Streaming Mode: A Solution for Large Data Sets#

The default collect method in Polars processes your data as a single batch, which means that all the data must fit into your available memory.

If your data requires more memory than you have available, Polars can process it in batches using streaming mode. To use streaming mode, simply pass the streaming=True argument to the collect method.

import polars as pl

df = (
    pl.scan_csv("reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .collect(streaming=True)
)

Learn more about Streaming API in Polars.

6.14.9. Pandas vs Polars: Syntax Comparison for Data Scientists#

As a data scientist, you’re likely familiar with the popular data analysis libraries Pandas and Polars. Both provide powerful tools for working with tabular data, but how do their syntaxes compare?

To begin, we’ll create equivalent dataframes in both Pandas and Polars:

import pandas as pd
import polars as pl

# Create a Pandas DataFrame
data = {
    "Category": ["Electronics", "Clothing", "Electronics", "Clothing", "Electronics"],
    "Quantity": [5, 2, 3, 10, 4],
    "Price": [200, 30, 150, 20, 300],
}
pandas_df = pd.DataFrame(data)
polars_df = pl.DataFrame(data)

Key Operations Comparison:

pandas_df[["Category", "Price"]]

	Category	Price
0	Electronics	200
1	Clothing	30
2	Electronics	150
3	Clothing	20
4	Electronics	300

polars_df.select(["Category", "Price"])

shape: (5, 2)

Category	Price
str	i64
"Electronics"	200
"Clothing"	30
"Electronics"	150
"Clothing"	20
"Electronics"	300

## Filtering rows where Quantity > 3
pandas_df[pandas_df["Quantity"] > 3]

	Category	Quantity	Price
0	Electronics	5	200
3	Clothing	10	20
4	Electronics	4	300

polars_df.filter(pl.col("Quantity") > 3)

shape: (3, 3)

Category	Quantity	Price
str	i64	i64
"Electronics"	5	200
"Clothing"	10	20
"Electronics"	4	300

pandas_df.groupby("Category").agg(
    {
        "Quantity": "sum",
        "Price": "mean",
    }
)

	Quantity	Price
Category
Clothing	12	25.000000
Electronics	12	216.666667

polars_df.group_by("Category").agg(
    [
        pl.col("Quantity").sum(),
        pl.col("Price").mean(),
    ]
)

shape: (2, 3)

Category	Quantity	Price
str	i64	f64
"Clothing"	12	25.0
"Electronics"	12	216.666667

6.14.10. Faster Data Analysis with Polars: A Guide to Lazy Execution#

When processing data, the execution approach significantly impacts performance. Pandas, a popular Python data manipulation library, uses eager execution by default, processing data immediately and loading everything into memory. This works well for small to medium-sized datasets but can lead to slow computations and high memory usage with large datasets.

In contrast, Polars, a modern data processing library, offers both eager and lazy execution. In lazy mode, a query optimizer evaluates operations and determines the most efficient execution plan, which may involve reordering operations or dropping redundant calculations.

Let’s consider an example where we:

Group a DataFrame by ‘region’
Calculate two aggregations: sum of ‘revenue’ and count of ‘orders’
Filter for only ‘North’ and ‘South’ regions

With eager execution, Pandas will:

Execute operations immediately, loading all data into memory
Keep intermediate results in memory during each step
Execute operations in the exact order written

import numpy as np

## Generate sample data
N = 10_000_000

data = {
    "region": np.random.choice(["North", "South", "East", "West"], N),
    "revenue": np.random.uniform(100, 10000, N),
    "orders": np.random.randint(1, 100, N),
}

import pandas as pd


def analyze_sales_pandas(df):
    # Loads and processes everything in memory
    return df.groupby("region").agg({"revenue": "sum"}).loc[["North", "South"]]


pd_df = pd.DataFrame(data)
%timeit analyze_sales_pandas(pd_df)

367 ms ± 47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As shown above, the eager execution approach used by Pandas results in a execution time of approximately 367 milliseconds.

With lazy execution, Polars will:

Create an execution plan first, optimizing the entire chain before processing any data
Only process data once at .collect(), reducing memory overhead
Rearrange operations for optimal performance (pushing filters before groupby)

import polars as pl


def analyze_sales_polars(df):
    # Creates execution plan, no data processed yet
    result = (
        df.lazy()
        .group_by("region")
        .agg(pl.col("revenue").sum())
        .filter(pl.col("region").is_in(["North", "South"]))
        .collect()  # Only now data is processed
    )
    return result


pl_df = pl.DataFrame(data)
%timeit analyze_sales_polars(pl_df)

170 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In contrast, the lazy execution approach with Polars takes approximately 170 milliseconds to complete, which is about 53.68% faster than the eager execution approach with Pandas.

6.14.11. Efficiently Join Large Datasets with Polars#

Joining large datasets can be slow and memory-intensive, especially when using traditional tools like Pandas. This can lead to performance bottlenecks when handling millions of rows or performing complex join operations. Let’s compare the execution time of joins in Pandas and Polars using Jupyter Notebook magic commands.

# Example of a slow join in Pandas
import pandas as pd

# Create two large DataFrames
df1 = pd.DataFrame({"id": range(1, 1000001), "value": range(1000000)})
df2 = pd.DataFrame({"id": range(500000, 1500000), "value": range(500000, 1500000)})

%%time
result_pandas = pd.merge(df1, df2, on="id", how="inner")
print(result_pandas.head())

       id  value_x  value_y
0  500000   499999   500000
1  500001   500000   500001
2  500002   500001   500002
3  500003   500002   500003
4  500004   500003   500004
CPU times: user 13.4 ms, sys: 8.63 ms, total: 22 ms
Wall time: 23 ms

In this example, Pandas performs the join operation, but it can be slow and memory-intensive for large datasets.

Polars provides a high-performance alternative for joining large datasets. Its optimized engine and memory-efficient design make it ideal for such tasks.

import polars as pl

# Create two large DataFrames
df1 = pl.DataFrame({"id": range(1, 1000001), "value": range(1000000)})
df2 = pl.DataFrame({"id": range(500000, 1500000), "value": range(500000, 1500000)})

%%time
result_polars = df1.join(df2, on="id", how="inner")
print(result_polars.head())

shape: (5, 3)
┌────────┬────────┬─────────────┐
│ id     ┆ value  ┆ value_right │
│ ---    ┆ ---    ┆ ---         │
│ i64    ┆ i64    ┆ i64         │
╞════════╪════════╪═════════════╡
│ 500000 ┆ 499999 ┆ 500000      │
│ 500001 ┆ 500000 ┆ 500001      │
│ 500002 ┆ 500001 ┆ 500002      │
│ 500003 ┆ 500002 ┆ 500003      │
│ 500004 ┆ 500003 ┆ 500004      │
└────────┴────────┴─────────────┘
CPU times: user 6.81 ms, sys: 9.12 ms, total: 15.9 ms
Wall time: 5.29 ms

Polars processes the join operation faster and uses less memory compared to Pandas. Its efficient algorithms and off-heap memory management ensure smooth handling of large datasets.

6.14.12. Simplify Aggregations with Polars’ Declarative Expressions#

In pandas, performing multiple aggregations on a DataFrame can be cumbersome. The agg() method requires you to use tuples to specify the column name and the aggregation function. This approach can become verbose and harder to read when dealing with multiple columns and aggregations.

For example, consider the following pandas code:

import pandas as pd

# Sample data
data = {
    "category": ["A", "A", "B", "B"],
    "value1": [10, 20, 30, 40],
    "value2": [5, 15, 25, 35],
}
df = pd.DataFrame(data)

# Aggregations in pandas
result = df.groupby("category").agg(
    value1_sum=("value1", "sum"),
    value2_mean=("value2", "mean"),
)
print(result)

          value1_sum  value2_mean
category                         
A                 30         10.0
B                 70         30.0

Here, you must explicitly define each aggregation using tuples, which can be tedious for larger datasets or more complex operations.

Polars simplifies this process with its declarative approach using pl.col.

import polars as pl

# Sample data
data = {
    "category": ["A", "A", "B", "B"],
    "value1": [10, 20, 30, 40],
    "value2": [5, 15, 25, 35],
}
df = pl.DataFrame(data)

# Aggregations in Polars
result = df.group_by("category").agg(
    [
        pl.col("value1").sum().alias("value1_sum"),
        pl.col("value2").mean().alias("value2_mean"),
    ]
)
print(result)

shape: (2, 3)
┌──────────┬────────────┬─────────────┐
│ category ┆ value1_sum ┆ value2_mean │
│ ---      ┆ ---        ┆ ---         │
│ str      ┆ i64        ┆ f64         │
╞══════════╪════════════╪═════════════╡
│ B        ┆ 70         ┆ 30.0        │
│ A        ┆ 30         ┆ 10.0        │
└──────────┴────────────┴─────────────┘

In this example, Polars allows you to define aggregations directly using expressions like pl.col("value1").sum() and pl.col("value2").mean(). This approach eliminates the need for tuples and makes the code more intuitive.

Link to Polars.

Polars

Contents

6.14. Polars#

6.14.1. Polars: Blazing Fast DataFrame Library#

6.14.2. Polars: Speed Up Data Processing 12x with Lazy Execution#

6.14.3. Polars vs. Pandas for CSV Loading and Filtering#

6.14.4. Pandas vs Polars: Harnessing Parallelism for Faster Data Processing#

6.14.5. Simple and Expressive Data Transformation with Polars#

6.14.6. Harness Polars and Delta Lake for Blazing Fast Performance#

6.14.7. Optimize Multiple DataFrame Computations with Polars collect_all#

6.14.8. Polars’ Streaming Mode: A Solution for Large Data Sets#

6.14.9. Pandas vs Polars: Syntax Comparison for Data Scientists#

6.14.10. Faster Data Analysis with Polars: A Guide to Lazy Execution#

6.14.11. Efficiently Join Large Datasets with Polars#

6.14.12. Simplify Aggregations with Polars’ Declarative Expressions#