6.17. DuckDB#
6.17.1. SQL Analytics with DuckDB#
Show code cell content
!pip install psycopg2 duckdb
SQL operations on data frames typically require setting up and maintaining separate database servers, adding complexity to analytical workflows:
import pandas as pd
# Create sample data
df = pd.DataFrame(
{"customer_id": range(1000), "revenue": range(1000), "segment": ["A", "B"] * 500}
)
import psycopg2
from sqlalchemy import create_engine
# Create a connection to PostgreSQL
engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres")
# Write the DataFrame to a PostgreSQL table
df.to_sql("customers", engine, if_exists="replace", index=False)
# Execute SQL query against the PostgreSQL database
with engine.connect() as conn:
result = pd.read_sql(
"""
SELECT
segment,
AVG(revenue) as avg_revenue,
COUNT(*) as customer_count
FROM customers
GROUP BY segment
ORDER BY avg_revenue DESC
""",
conn,
)
print("Segment Analysis:")
print(result)
Segment Analysis:
segment avg_revenue customer_count
0 B 500.0 500
1 A 499.0 500
DuckDB simplifies this by providing direct SQL operations on DataFrames without server setup:
import duckdb
# Direct SQL operations on DataFrame - no server needed
result = duckdb.sql(
"""
SELECT
segment,
AVG(revenue) as avg_revenue,
COUNT(*) as customer_count
FROM df
GROUP BY segment
ORDER BY avg_revenue DESC
"""
).df()
print("Segment Analysis:")
print(result)
Segment Analysis:
segment avg_revenue customer_count
0 B 500.0 500
1 A 499.0 500
6.17.2. Efficient SQL Operations with DuckDB on Pandas DataFrames#
!pip install --quiet duckdb
!wget -q https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet
Using SQL with pandas empowers data scientists to leverage SQLβs powerful querying capabilities alongside the data manipulation functionalities of pandas.
In the code below, aggregating data using DuckDB is nearly 6 times faster compared to aggregating with pandas.
import duckdb
import pandas as pd
df = pd.read_parquet("lineitemsf1.snappy.parquet")
%%timeit
df.groupby("l_returnflag").agg(
Sum=("l_extendedprice", "sum"),
Min=("l_extendedprice", "min"),
Max=("l_extendedprice", "max"),
Avg=("l_extendedprice", "mean"),
)
226 ms Β± 4.63 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
%%timeit
duckdb.query(
"""
SELECT
l_returnflag,
SUM(l_extendedprice),
MIN(l_extendedprice),
MAX(l_extendedprice),
AVG(l_extendedprice)
FROM df
GROUP BY
l_returnflag
"""
).to_df()
37 ms Β± 2.41 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)
6.17.3. Efficiently Handle Large Datasets with DuckDB and PyArrow#
!pip install deltalake duckdb
!wget -q https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet
DuckDB leverages various optimizations for query execution, while PyArrow efficiently handles in-memory data processing and storage. Combining DuckDB and PyArrow allows you to efficiently process datasets larger than memory on a single machine.
In the code below, we convert a Delta Lake table with over 6 million rows to a pandas DataFrame and a PyArrow dataset, which are then used by DuckDB.
Running DuckDB on PyArrow dataset is approximately 2906 times faster than running DuckDB on pandas.
import duckdb
import pandas as pd
from deltalake.writer import write_deltalake
df = pd.read_parquet("lineitemsf1.snappy.parquet")
write_deltalake("delta_lake", df)
from deltalake import DeltaTable
table = DeltaTable("delta_lake")
%%timeit
quack = duckdb.df(table.to_pandas())
quack.filter("l_quantity > 50")
2.77 s Β± 108 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
%%timeit
quack = duckdb.arrow(table.to_pyarrow_dataset())
quack.filter("l_quantity > 50")
954 Β΅s Β± 32.2 Β΅s per loop (mean Β± std. dev. of 7 runs, 1,000 loops each)
6.17.4. Simplify CSV Data Management with DuckDB#
Show code cell content
!pip install duckdb
Traditional database systems require a predefined table schema and a subsequent data import process when working with CSV data.
To demonstrate this, letβs create a CSV file called customer.csv
:
import pandas as pd
# Create a sample dataframe
data = {
"name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"age": [25, 32, 45, 19, 38],
"city": ["New York", "London", "Paris", "Berlin", "Tokyo"],
}
df = pd.DataFrame(data)
# Save the dataframe as a CSV file
df.to_csv("customers.csv", index=False)
To load this CSV file in Postgres, you need to run the following query:
-- Create the table
CREATE TABLE customers (
name VARCHAR(100),
age INT,
city VARCHAR(100)
);
-- Load data from CSV
COPY customers
FROM 'customers.csv'
DELIMITER ','
CSV HEADER;
In contrast, DuckDB allows for direct reading of CSV files from disk, eliminating the need for explicit table creation and data loading.
import duckdb
duckdb.sql("SELECT * FROM 'customers.csv'")
βββββββββββ¬ββββββββ¬βββββββββββ
β name β age β city β
β varchar β int64 β varchar β
βββββββββββΌββββββββΌβββββββββββ€
β Alice β 25 β New York β
β Bob β 32 β London β
β Charlie β 45 β Paris β
β David β 19 β Berlin β
β Eve β 38 β Tokyo β
βββββββββββ΄ββββββββ΄βββββββββββ
6.17.5. Automate CSV Parsing with DuckDBβs read_csv#
Show code cell content
!pip install duckdb
Reading a CSV file using pandas without specifying parameters like the delimiter or header can lead to incorrect parsing, especially when the structure has custom delimiters or specific formatting.
import pandas as pd
# Example CSV content with a custom delimiter
csv_content = """FlightDate|UniqueCarrier|OriginCityName|DestCityName
1988-01-01|AA|New York, NY|Los Angeles, CA
1988-01-02|AA|New York, NY|Los Angeles, CA
1988-01-03|AA|New York, NY|Los Angeles, CA
"""
## Writing the CSV content to a file
with open("example.csv", "w") as f:
f.write(csv_content)
## Reading the CSV file with pandas without specifying the delimiter
df = pd.read_csv("example.csv")
print(df)
FlightDate|UniqueCarrier|OriginCityName|DestCityName
1988-01-01|AA|New York NY|Los Angeles CA
1988-01-02|AA|New York NY|Los Angeles CA
1988-01-03|AA|New York NY|Los Angeles CA
The pandas code above does not specify the delimiter, leading to incorrect parsing. The default delimiter (,
) is assumed, which results in the entire row being treated as a single column.
DuckDBβs read_csv
feature addresses this issue by automatically detecting the structure of the CSV file, including delimiters, headers, and column types.
import duckdb
## Use DuckDB to automatically detect and read the CSV structure
result = duckdb.query("SELECT * FROM read_csv('example.csv')").to_df()
print(result)
FlightDate UniqueCarrier OriginCityName DestCityName
0 1988-01-01 AA New York, NY Los Angeles, CA
1 1988-01-02 AA New York, NY Los Angeles, CA
2 1988-01-03 AA New York, NY Los Angeles, CA
In this example:
read_csv
automatically detects the pipe (|) delimiter, identifies the column headers, and infers the correct data types for each column.duckdb.query
runs the SQL query to read the file and returns the result as a DataFrame using.to_df()
.
The output shows that DuckDB automatically detected the correct delimiter (|
) and correctly parsed the data into columns.
6.17.6. Multiple CSV Files Processing with DuckDB#
Show code cell content
!pip install duckdb
Processing multiple CSV files is a common task in data analysis and engineering workflows. Traditionally, handling multiple CSV files requires writing loops or list comprehensions, which can be verbose and error-prone.
To demonstrate, letβs assume we have 2 CSV files in the βfolderβ directory:
from pathlib import Path
import pandas as pd
# Create example dataframe for first file
df1 = pd.DataFrame(
{
"Date": ["2023-01-01", "2023-01-02", "2023-01-03"],
"Product": ["Laptop", "Phone", "Tablet"],
"Sales": [1200, 800, 600],
}
)
# Create example dataframe for second file
df2 = pd.DataFrame(
{
"Date": ["2023-02-01", "2023-02-02", "2023-02-03"],
"Product": ["Laptop", "Monitor", "Mouse"],
"Sales": [1500, 400, 50],
}
)
Path("data").mkdir()
df1.to_csv("data/sales_jan.csv", index=False)
df2.to_csv("data/sales_feb.csv", index=False)
import glob
import os
import pandas as pd
## Read all CSV files and concatenate them
csv_files = glob.glob("data/*.csv")
dfs = []
for file in csv_files:
df = pd.read_csv(file)
dfs.append(df)
## Concatenate all dataframes
combined_df = pd.concat(dfs, ignore_index=True)
## Perform grouping and aggregation
result = (
combined_df.groupby("Date")["Sales"]
.sum()
.reset_index()
.rename(columns={"Sales": "total_sales"})
.sort_values("Date")
)
result
Date | total_sales | |
---|---|---|
0 | 2023-01-01 | 1200 |
1 | 2023-01-02 | 800 |
2 | 2023-01-03 | 600 |
3 | 2023-02-01 | 1500 |
4 | 2023-02-02 | 400 |
5 | 2023-02-03 | 50 |
DuckDB simplifies reading multiple CSV files with a single line of code that efficiently processes the files:
import duckdb
## Read and analyze all CSV files at once
result = duckdb.sql(
"""
SELECT
date,
SUM(sales) as total_sales
FROM 'data/*.csv'
GROUP BY date
ORDER BY date
"""
).df()
result
Date | total_sales | |
---|---|---|
0 | 2023-01-01 | 1200.0 |
1 | 2023-01-02 | 800.0 |
2 | 2023-01-03 | 600.0 |
3 | 2023-02-01 | 1500.0 |
4 | 2023-02-02 | 400.0 |
5 | 2023-02-03 | 50.0 |
6.17.7. DuckDB: Query Pandas DataFrames Faster with Columnar Storage#
Show code cell content
!pip install duckdb
When analyzing data with operations like GROUP BY, SUM, or AVG on specific columns, row-based storage results in reading unnecessary data and inefficient memory usage since entire rows must be loaded even when only a few columns are needed.
Example using SQLite (row-based):
import sqlite3
import pandas as pd
customer = pd.DataFrame(
{"id": [1, 2, 3], "name": ["Alex", "Ben", "Chase"], "age": [25, 30, 35]}
)
## Load data to SQLite and query
conn = sqlite3.connect(":memory:")
customer.to_sql("customer", conn, index=False)
## Must read all columns internally even though we only need 'age'
query = "SELECT age FROM customer"
result = pd.read_sql(query, conn)
DuckDB uses columnar storage, allowing you to efficiently read and process only the columns needed for your analysis. This improves both query speed and memory usage:
import duckdb
import pandas as pd
customer = pd.DataFrame(
{"id": [1, 2, 3], "name": ["Alex", "Ben", "Chase"], "age": [25, 30, 35]}
)
query = "SELECT age FROM customer"
result = duckdb.sql(query).df()
result
age | |
---|---|
0 | 25 |
1 | 30 |
2 | 35 |
In this example, DuckDB only needs to access the βageβ column in memory, while SQLite must read all columns (βidβ, βnameβ, βageβ) internally even though only βageβ is selected. DuckDB also provides a simpler workflow by querying pandas DataFrames directly.
6.17.8. Query Nested Parquet Files Easily Using DuckDB#
Show code cell content
!pip install duckdb pandas pyarrow
Analyzing large, nested Parquet files often requires pre-processing to flatten the data or writing complex scripts to extract nested fields, which can be time-consuming and error-prone.
import pandas as pd
# Create a nested dataset and save it as a Parquet file
data = {
"id": [1, 2],
"details": [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}],
}
## Convert to a DataFrame
df = pd.DataFrame(data)
# Save as a nested Parquet file
df.to_parquet("nested_data.parquet")
print(df)
id details
0 1 {'name': 'Alice', 'age': 25}
1 2 {'name': 'Bob', 'age': 30}
## Read the DataFrame from Parquet file
df = pd.read_parquet("nested_data.parquet")
# Create a new DataFrame with the flattened structure
flat_df = pd.DataFrame(
{
"id": df["id"],
"name": [detail["name"] for detail in df["details"]],
"age": [detail["age"] for detail in df["details"]],
}
)
print(flat_df)
id name age
0 1 Alice 25
1 2 Bob 30
DuckDB allows querying nested Parquet files directly using SQL without needing to flatten or preprocess the data.
To load and query nested Parquet files with DuckDB:
import duckdb
## Query the nested Parquet file directly
query_result = duckdb.query(
"""
SELECT
id,
details.name AS name,
details.age AS age
FROM read_parquet('nested_data.parquet')
"""
).to_df()
print(query_result)
id name age
0 1 Alice 25
1 2 Bob 30
In this example:
read_parquet('nested_data.parquet')
reads the nested Parquet file.SQL syntax allows you to access nested fields using dot notation (e.g.,
details.name
).
The output is a flattened representation of the nested data, directly queried from the Parquet file without additional preprocessing.