6.14. Polars#
6.14.1. Polars: Blazing Fast DataFrame Library#
Show code cell content
!pip install polars
If you want data manipulation library thatβs both fast and memory-efficient, try Polars. Polars provides a high-level API similar to Pandas but with better performance for large datasets.
The code below compares the performance of Polars and pandas.
import pandas as pd
import polars as pl
import numpy as np
import time
# Create two Pandas DataFrames with 1 million rows each
pandas_df1 = pd.DataFrame({
'key': np.random.randint(0, 1000, size=1_000_000),
'value1': np.random.rand(1_000_000)
})
pandas_df2 = pd.DataFrame({
'key': np.random.randint(0, 1000, size=1_000_000),
'value2': np.random.rand(1000000)
})
# Create two Polars DataFrames from the Pandas DataFrames
polars_df1 = pl.from_pandas(pandas_df1)
polars_df2 = pl.from_pandas(pandas_df2)
# Merge the two DataFrames on the 'key' column
start_time = time.time()
pandas_merged = pd.merge(pandas_df1, pandas_df2, on='key')
pandas_time = time.time() - start_time
start_time = time.time()
polars_merged = polars_df1.join(polars_df2, on='key')
polars_time = time.time() - start_time
print(f"Pandas time: {pandas_time:.6f} seconds")
print(f"Polars time: {polars_time:.6f} seconds")
Pandas time: 127.604390 seconds
Polars time: 41.079080 seconds
print(f"Polars is {pandas_time/polars_time:.2f} times faster than Pandas")
Polars is 3.11 times faster than Pandas
6.14.2. Polars: Speed Up Data Processing 12x with Lazy Execution#
Show code cell content
!pip install polars
Polars is a lightning-fast DataFrame library that utilizes all available cores on your machine.
Polars has two APIs: an eager API and a lazy API.
The eager execution is similar to Pandas, which executes code immediately.
In contrast, the lazy execution defers computations until the collect()
method is called. This approach avoids unnecessary computations, making lazy execution potentially more efficient than eager execution.
The code following code shows filter operations on a DataFrame containing 10 million rows. Running polars with lazy execution is 12 times faster than using pandas.
Show code cell content
import numpy as np
# Create a random seed for reproducibility
np.random.seed(42)
# Number of rows in the dataset
num_rows = 10_000_000
# Sample data for categorical columns
categories = ["a", "b", "c", "d"]
# Generate random data for the dataset
data = {
"Cat1": np.random.choice(categories, size=num_rows),
"Cat2": np.random.choice(categories, size=num_rows),
"Num1": np.random.randint(1, 100, size=num_rows),
"Num2": np.random.randint(1000, 10000, size=num_rows),
}
Create a pandas DataFrame and filter the DataFrame.
import pandas as pd
df = pd.DataFrame(data)
df.head()
Cat1 | Cat2 | Num1 | Num2 | |
---|---|---|---|---|
0 | c | a | 40 | 7292 |
1 | d | b | 45 | 7849 |
2 | a | a | 93 | 6940 |
3 | c | a | 46 | 1265 |
4 | c | a | 98 | 2509 |
%timeit df[(df['Cat1'] == 'a') & (df['Cat2'] == 'b') & (df['Num1'] >= 70)]
706 ms Β± 75.4 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
Create a polars DataFrame and filter the DataFrame.
import polars as pl
pl_df = pl.DataFrame(data)
%timeit pl_df.lazy().filter((pl.col('Cat1') == 'a') & (pl.col('Cat2') == 'b') & (pl.col('Num1') >= 70)).collect()
58.1 ms Β± 428 Β΅s per loop (mean Β± std. dev. of 7 runs, 10 loops each)
6.14.3. Polars vs. Pandas for CSV Loading and Filtering#
Show code cell content
!pip install polars
Show code cell content
!wget -O airport-codes.csv "https://datahub.io/core/airport-codes/r/0.csv"
The read_csv
method in Pandas loads all rows of the dataset into the DataFrame before filtering to remove all unwanted rows.
On the other hand, the scan_csv
method in Polars delays execution and optimizes the operation until the collect
method is called. This approach accelerates code execution, particularly when handling large datasets.
In the code below, it is 25.5 times faster to use Polars instead of Pandas to read a subset of CSV file containing 57k rows.
import pandas as pd
import polars as pl
%%timeit
df = pd.read_csv("airport-codes.csv")
df[(df["type"] == "heliport") & (df["continent"] == "EU")]
143 ms Β± 8.3 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)
%%timeit
pl.scan_csv("airport-codes.csv").filter(
(pl.col("type") == "heliport") & (pl.col("continent") == "EU")
).collect()
5.6 ms Β± 594 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
6.14.4. Pandas vs Polars: Harnessing Parallelism for Faster Data Processing#
Show code cell content
!pip install polars
Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask.
import pandas as pd
import multiprocessing as mp
import dask.dataframe as dd
df = pd.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})
# Perform the groupby and sum operation in parallel
ddf = dd.from_pandas(df, npartitions=mp.cpu_count())
result = ddf.groupby("A").sum().compute()
Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration.
import polars as pl
df = pl.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})
# Perform the groupby and sum operation in parallel
result = df.group_by("A").sum()
6.14.5. Simple and Expressive Data Transformation with Polars#
Extract features and select only relevant features for each time series.
Show code cell content
!pip install polars
Compared to pandas, Polars provides a more expressive syntax for creating complex data transformation pipelines. Every expression in Polars produces a new expression, and these expressions can be piped together.
import pandas as pd
df = pd.DataFrame(
{"A": [1, 2, 6], "B": ["a", "b", "c"], "C": [True, False, True]}
)
integer_columns = df.select_dtypes("int64")
other_columns = df[["B"]]
pd.concat([integer_columns, other_columns], axis=1)
A | B | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 6 | c |
import polars as pl
pl_df = pl.DataFrame(
{"A": [1, 2, 6], "B": ["a", "b", "c"], "C": [True, False, True]}
)
pl_df.select([pl.col(pl.Int64), "B"])
A | B |
---|---|
i64 | str |
1 | "a" |
2 | "b" |
6 | "c" |
6.14.6. Harness Polars and Delta Lake for Blazing Fast Performance#
Show code cell content
!pip install polars deltalake
Polars is a Rust-based DataFrame library that is designed for high-performance data manipulation and analysis. Delta Lake is a storage format that offers a range of benefits, including ACID transactions, time travel, schema enforcement, and more. Itβs designed to work seamlessly with big data processing engines like Apache Spark and can handle large amounts of data with ease.
When you combine Polars and Delta Lake, you get a powerful data processing system. Polars does the heavy lifting of processing your data, while Delta Lake keeps everything organized and up-to-date.
Imagine you have a huge dataset with millions of rows. You want to group the data by category and calculate the sum of a certain column. With Polars and Delta Lake, you can do this quickly and easily.
First, you create a sample dataset:
import pandas as pd
import numpy as np
# Create a sample dataset
num_rows = 10_000_000
data = {
"Cat1": np.random.choice(['A', 'B', 'C'], size=num_rows),
'Num1': np.random.randint(low=1, high=100, size=num_rows)
}
df = pd.DataFrame(data)
df.head()
Cat1 | Num1 | |
---|---|---|
0 | A | 84 |
1 | C | 63 |
2 | B | 11 |
3 | A | 73 |
4 | B | 57 |
Next, you save the dataset to Delta Lake:
from deltalake.writer import write_deltalake
save_path = "tmp/data"
write_deltalake(save_path, df)
Then, you can use Polars to read the data from Delta Lake and perform the grouping operation:
import polars as pl
pl_df = pl.read_delta(save_path)
print(pl_df.group_by("Cat1").sum())
shape: (3, 2)
ββββββββ¬ββββββββββββ
β Cat1 β Num1 β
β --- β --- β
β str β i64 β
ββββββββͺββββββββββββ‘
β B β 166653474 β
β C β 166660653 β
β A β 166597835 β
ββββββββ΄ββββββββββββ
Letβs say you want to append some new data to the existing dataset:
new_data = pd.DataFrame({"Cat1": ["B", "C"], "Num1": [2, 3]})
write_deltalake(save_path, new_data, mode="append")
Now, you can use Polars to read the updated data from Delta Lake:
updated_pl_df = pl.read_delta(save_path)
print(updated_pl_df.tail())
shape: (5, 2)
ββββββββ¬βββββββ
β Cat1 β Num1 β
β --- β --- β
β str β i64 β
ββββββββͺβββββββ‘
β A β 29 β
β A β 41 β
β A β 49 β
β B β 2 β
β C β 3 β
ββββββββ΄βββββββ
But what if you want to go back to the previous version of the data? With Delta Lake, you can easily do that by specifying the version number:
previous_pl_df = pl.read_delta(save_path, version=0)
print(previous_pl_df.tail())
shape: (5, 2)
ββββββββ¬βββββββ
β Cat1 β Num1 β
β --- β --- β
β str β i64 β
ββββββββͺβββββββ‘
β A β 90 β
β C β 83 β
β A β 29 β
β A β 41 β
β A β 49 β
ββββββββ΄βββββββ
6.14.7. Parallel Execution of Multiple Files with Polars#
Show code cell content
!pip install polars
If you have multiple files to process, Polars enables you to construct a query plan for each file beforehand. This allows for the efficient execution of multiple files concurrently, maximizing processing speed.
import glob
import polars as pl
# Construct a query plan for each file
queries = []
for file in glob.glob("test_data/*.csv"):
q = pl.scan_csv(file).group_by("Cat").agg(pl.sum("Num"))
queries.append(q)
# Execute files in parallel
dataframes = pl.collect_all(queries)
dataframes
[shape: (3, 2)
βββββββ¬ββββββ
β Cat β Num β
β --- β --- β
β str β i64 β
βββββββͺββββββ‘
β A β 2 β
β C β 6 β
β B β 4 β
βββββββ΄ββββββ,
shape: (3, 2)
βββββββ¬ββββββ
β Cat β Num β
β --- β --- β
β str β i64 β
βββββββͺββββββ‘
β B β 5 β
β A β 1 β
β C β 1 β
βββββββ΄ββββββ,
shape: (3, 2)
βββββββ¬ββββββ
β Cat β Num β
β --- β --- β
β str β i64 β
βββββββͺββββββ‘
β C β 4 β
β A β 4 β
β B β 1 β
βββββββ΄ββββββ]
6.14.8. Polarsβ Streaming Mode: A Solution for Large Data Sets#
Show code cell content
!pip install polars
The default collect method in Polars processes your data as a single batch, which means that all the data must fit into your available memory.
If your data requires more memory than you have available, Polars can process it in batches using streaming mode. To use streaming mode, simply pass the streaming=True
argument to the collect
method.
import polars as pl
df = (
pl.scan_csv("reddit.csv")
.with_columns(pl.col("name").str.to_uppercase())
.filter(pl.col("comment_karma") > 0)
.collect(streaming=True)
)
6.14.9. Pandas vs Polars: Syntax Comparison for Data Scientists#
As a data scientist, youβre likely familiar with the popular data analysis libraries Pandas and Polars. Both provide powerful tools for working with tabular data, but how do their syntaxes compare?
To begin, weβll create equivalent dataframes in both Pandas and Polars:
import pandas as pd
import polars as pl
# Create a Pandas DataFrame
data = {
"Category": ["Electronics", "Clothing", "Electronics", "Clothing", "Electronics"],
"Quantity": [5, 2, 3, 10, 4],
"Price": [200, 30, 150, 20, 300],
}
pandas_df = pd.DataFrame(data)
polars_df = pl.DataFrame(data)
Key Operations Comparison:
pandas_df[["Category", "Price"]]
Category | Price | |
---|---|---|
0 | Electronics | 200 |
1 | Clothing | 30 |
2 | Electronics | 150 |
3 | Clothing | 20 |
4 | Electronics | 300 |
polars_df.select(["Category", "Price"])
Category | Price |
---|---|
str | i64 |
"Electronics" | 200 |
"Clothing" | 30 |
"Electronics" | 150 |
"Clothing" | 20 |
"Electronics" | 300 |
# Filtering rows where Quantity > 3
pandas_df[pandas_df["Quantity"] > 3]
Category | Quantity | Price | |
---|---|---|---|
0 | Electronics | 5 | 200 |
3 | Clothing | 10 | 20 |
4 | Electronics | 4 | 300 |
polars_df.filter(pl.col("Quantity") > 3)
Category | Quantity | Price |
---|---|---|
str | i64 | i64 |
"Electronics" | 5 | 200 |
"Clothing" | 10 | 20 |
"Electronics" | 4 | 300 |
pandas_df.groupby("Category").agg(
{
"Quantity": "sum",
"Price": "mean",
}
)
Quantity | Price | |
---|---|---|
Category | ||
Clothing | 12 | 25.000000 |
Electronics | 12 | 216.666667 |
polars_df.group_by("Category").agg(
[
pl.col("Quantity").sum(),
pl.col("Price").mean(),
]
)
Category | Quantity | Price |
---|---|---|
str | i64 | f64 |
"Clothing" | 12 | 25.0 |
"Electronics" | 12 | 216.666667 |
6.14.10. Faster Data Analysis with Polars: A Guide to Lazy Execution#
When processing data, the execution approach significantly impacts performance. Pandas, a popular Python data manipulation library, uses eager execution by default, processing data immediately and loading everything into memory. This works well for small to medium-sized datasets but can lead to slow computations and high memory usage with large datasets.
In contrast, Polars, a modern data processing library, offers both eager and lazy execution. In lazy mode, a query optimizer evaluates operations and determines the most efficient execution plan, which may involve reordering operations or dropping redundant calculations.
Letβs consider an example where we:
Group a DataFrame by βregionβ
Calculate two aggregations: sum of βrevenueβ and count of βordersβ
Filter for only βNorthβ and βSouthβ regions
With eager execution, Pandas will:
Execute operations immediately, loading all data into memory
Keep intermediate results in memory during each step
Execute operations in the exact order written
import numpy as np
# Generate sample data
N = 10_000_000
data = {
"region": np.random.choice(["North", "South", "East", "West"], N),
"revenue": np.random.uniform(100, 10000, N),
"orders": np.random.randint(1, 100, N),
}
import pandas as pd
def analyze_sales_pandas(df):
# Loads and processes everything in memory
return (
df.groupby("region")
.agg({"revenue": "sum"})
.loc[["North", "South"]]
)
pd_df = pd.DataFrame(data)
%timeit analyze_sales_pandas(pd_df)
367 ms Β± 47 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
As shown above, the eager execution approach used by Pandas results in a execution time of approximately 367 milliseconds.
With lazy execution, Polars will:
Create an execution plan first, optimizing the entire chain before processing any data
Only process data once at .collect(), reducing memory overhead
Rearrange operations for optimal performance (pushing filters before groupby)
import polars as pl
def analyze_sales_polars(df):
# Creates execution plan, no data processed yet
result = (
df.lazy()
.group_by("region")
.agg(pl.col("revenue").sum())
.filter(pl.col("region").is_in(["North", "South"]))
.collect() # Only now data is processed
)
return result
pl_df = pl.DataFrame(data)
%timeit analyze_sales_polars(pl_df)
170 ms Β± 15.8 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
In contrast, the lazy execution approach with Polars takes approximately 170 milliseconds to complete, which is about 53.68% faster than the eager execution approach with Pandas.