DuckDB

6.17. DuckDB#

6.17.1. SQL Analytics with DuckDB#

SQL operations on data frames typically require setting up and maintaining separate database servers, adding complexity to analytical workflows:

import pandas as pd

# Create sample data
df = pd.DataFrame(
    {"customer_id": range(1000), "revenue": range(1000), "segment": ["A", "B"] * 500}
)

import psycopg2
from sqlalchemy import create_engine

# Create a connection to PostgreSQL
engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres")

# Write the DataFrame to a PostgreSQL table
df.to_sql("customers", engine, if_exists="replace", index=False)

# Execute SQL query against the PostgreSQL database
with engine.connect() as conn:
    result = pd.read_sql(
        """
        SELECT 
            segment,
            AVG(revenue) as avg_revenue,
            COUNT(*) as customer_count
        FROM customers
        GROUP BY segment
        ORDER BY avg_revenue DESC
    """,
        conn,
    )

print("Segment Analysis:")
print(result)

Segment Analysis:
  segment  avg_revenue  customer_count
0       B        500.0             500
1       A        499.0             500

DuckDB simplifies this by providing direct SQL operations on DataFrames without server setup:

import duckdb

# Direct SQL operations on DataFrame - no server needed
result = duckdb.sql(
    """
    SELECT 
        segment,
        AVG(revenue) as avg_revenue,
        COUNT(*) as customer_count
    FROM df
    GROUP BY segment
    ORDER BY avg_revenue DESC
"""
).df()

print("Segment Analysis:")
print(result)

Segment Analysis:
  segment  avg_revenue  customer_count
0       B        500.0             500
1       A        499.0             500

Link to DuckDB

6.17.2. Efficient SQL Operations with DuckDB on Pandas DataFrames#

!pip install --quiet duckdb

!wget -q https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet

Using SQL with pandas empowers data scientists to leverage SQL’s powerful querying capabilities alongside the data manipulation functionalities of pandas.

In the code below, aggregating data using DuckDB is nearly 6 times faster compared to aggregating with pandas.

import duckdb
import pandas as pd

df = pd.read_parquet("lineitemsf1.snappy.parquet")

%%timeit
df.groupby("l_returnflag").agg(
    Sum=("l_extendedprice", "sum"),
    Min=("l_extendedprice", "min"),
    Max=("l_extendedprice", "max"),
    Avg=("l_extendedprice", "mean"),
)

226 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
duckdb.query(
    """
SELECT
      l_returnflag,
      SUM(l_extendedprice),
      MIN(l_extendedprice),
      MAX(l_extendedprice),
      AVG(l_extendedprice)
FROM df
GROUP BY
        l_returnflag
"""
).to_df()

37 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Link to DuckDB.

6.17.3. Efficiently Handle Large Datasets with DuckDB and PyArrow#

!pip install deltalake duckdb

!wget -q https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet

DuckDB leverages various optimizations for query execution, while PyArrow efficiently handles in-memory data processing and storage. Combining DuckDB and PyArrow allows you to efficiently process datasets larger than memory on a single machine.

In the code below, we convert a Delta Lake table with over 6 million rows to a pandas DataFrame and a PyArrow dataset, which are then used by DuckDB.

Running DuckDB on PyArrow dataset is approximately 2906 times faster than running DuckDB on pandas.

import duckdb
import pandas as pd
from deltalake.writer import write_deltalake

df = pd.read_parquet("lineitemsf1.snappy.parquet")
write_deltalake("delta_lake", df)

from deltalake import DeltaTable

table = DeltaTable("delta_lake")

%%timeit
quack = duckdb.df(table.to_pandas())
quack.filter("l_quantity > 50")

2.77 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
quack = duckdb.arrow(table.to_pyarrow_dataset())
quack.filter("l_quantity > 50")

954 µs ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Link to DuckDB.

6.17.4. Simplify CSV Data Management with DuckDB#

Traditional database systems require a predefined table schema and a subsequent data import process when working with CSV data.

To demonstrate this, let’s create a CSV file called customer.csv:

import pandas as pd

# Create a sample dataframe
data = {
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "age": [25, 32, 45, 19, 38],
    "city": ["New York", "London", "Paris", "Berlin", "Tokyo"],
}

df = pd.DataFrame(data)

# Save the dataframe as a CSV file
df.to_csv("customers.csv", index=False)

To load this CSV file in Postgres, you need to run the following query:

-- Create the table
CREATE TABLE customers (
    name VARCHAR(100),
    age INT,
    city VARCHAR(100)
);

-- Load data from CSV
COPY customers
FROM 'customers.csv'
DELIMITER ','
CSV HEADER;

In contrast, DuckDB allows for direct reading of CSV files from disk, eliminating the need for explicit table creation and data loading.

import duckdb

duckdb.sql("SELECT * FROM 'customers.csv'")

┌─────────┬───────┬──────────┐
│  name   │  age  │   city   │
│ varchar │ int64 │ varchar  │
├─────────┼───────┼──────────┤
│ Alice   │    25 │ New York │
│ Bob     │    32 │ London   │
│ Charlie │    45 │ Paris    │
│ David   │    19 │ Berlin   │
│ Eve     │    38 │ Tokyo    │
└─────────┴───────┴──────────┘

Link to DuckDB.

6.17.5. Automate CSV Parsing with DuckDB’s read_csv#

Reading a CSV file using pandas without specifying parameters like the delimiter or header can lead to incorrect parsing, especially when the structure has custom delimiters or specific formatting.

import pandas as pd

# Example CSV content with a custom delimiter
csv_content = """FlightDate|UniqueCarrier|OriginCityName|DestCityName
1988-01-01|AA|New York, NY|Los Angeles, CA
1988-01-02|AA|New York, NY|Los Angeles, CA
1988-01-03|AA|New York, NY|Los Angeles, CA
"""

## Writing the CSV content to a file
with open("example.csv", "w") as f:
    f.write(csv_content)

## Reading the CSV file with pandas without specifying the delimiter
df = pd.read_csv("example.csv")
print(df)

                                       FlightDate|UniqueCarrier|OriginCityName|DestCityName
1988-01-01|AA|New York  NY|Los Angeles                                                 CA  
1988-01-02|AA|New York  NY|Los Angeles                                                 CA  
1988-01-03|AA|New York  NY|Los Angeles                                                 CA  

The pandas code above does not specify the delimiter, leading to incorrect parsing. The default delimiter (,) is assumed, which results in the entire row being treated as a single column.

DuckDB’s read_csv feature addresses this issue by automatically detecting the structure of the CSV file, including delimiters, headers, and column types.

import duckdb

## Use DuckDB to automatically detect and read the CSV structure
result = duckdb.query("SELECT * FROM read_csv('example.csv')").to_df()
print(result)

  FlightDate UniqueCarrier OriginCityName     DestCityName
1988-01-01            AA   New York, NY  Los Angeles, CA
1988-01-02            AA   New York, NY  Los Angeles, CA
1988-01-03            AA   New York, NY  Los Angeles, CA

In this example:

read_csv automatically detects the pipe (|) delimiter, identifies the column headers, and infers the correct data types for each column.
duckdb.query runs the SQL query to read the file and returns the result as a DataFrame using .to_df().

The output shows that DuckDB automatically detected the correct delimiter (|) and correctly parsed the data into columns.

Link to DuckDB.

6.17.6. Multiple CSV Files Processing with DuckDB#

Processing multiple CSV files is a common task in data analysis and engineering workflows. Traditionally, handling multiple CSV files requires writing loops or list comprehensions, which can be verbose and error-prone.

To demonstrate, let’s assume we have 2 CSV files in the “folder” directory:

from pathlib import Path

import pandas as pd

# Create example dataframe for first file
df1 = pd.DataFrame(
    {
        "Date": ["2023-01-01", "2023-01-02", "2023-01-03"],
        "Product": ["Laptop", "Phone", "Tablet"],
        "Sales": [1200, 800, 600],
    }
)

# Create example dataframe for second file
df2 = pd.DataFrame(
    {
        "Date": ["2023-02-01", "2023-02-02", "2023-02-03"],
        "Product": ["Laptop", "Monitor", "Mouse"],
        "Sales": [1500, 400, 50],
    }
)

Path("data").mkdir()
df1.to_csv("data/sales_jan.csv", index=False)
df2.to_csv("data/sales_feb.csv", index=False)

import glob
import os

import pandas as pd

## Read all CSV files and concatenate them
csv_files = glob.glob("data/*.csv")
dfs = []

for file in csv_files:
    df = pd.read_csv(file)
    dfs.append(df)

## Concatenate all dataframes
combined_df = pd.concat(dfs, ignore_index=True)

## Perform grouping and aggregation
result = (
    combined_df.groupby("Date")["Sales"]
    .sum()
    .reset_index()
    .rename(columns={"Sales": "total_sales"})
    .sort_values("Date")
)

result

	Date	total_sales
0	2023-01-01	1200
1	2023-01-02	800
2	2023-01-03	600
3	2023-02-01	1500
4	2023-02-02	400
5	2023-02-03	50

DuckDB simplifies reading multiple CSV files with a single line of code that efficiently processes the files:

import duckdb

## Read and analyze all CSV files at once
result = duckdb.sql(
    """
    SELECT 
        date,
        SUM(sales) as total_sales
    FROM 'data/*.csv'
    GROUP BY date
    ORDER BY date
"""
).df()
result

	Date	total_sales
0	2023-01-01	1200.0
1	2023-01-02	800.0
2	2023-01-03	600.0
3	2023-02-01	1500.0
4	2023-02-02	400.0
5	2023-02-03	50.0

Link to DuckDB.

6.17.7. DuckDB: Query Pandas DataFrames Faster with Columnar Storage#

When analyzing data with operations like GROUP BY, SUM, or AVG on specific columns, row-based storage results in reading unnecessary data and inefficient memory usage since entire rows must be loaded even when only a few columns are needed.

Example using SQLite (row-based):

import sqlite3

import pandas as pd

customer = pd.DataFrame(
    {"id": [1, 2, 3], "name": ["Alex", "Ben", "Chase"], "age": [25, 30, 35]}
)

## Load data to SQLite and query
conn = sqlite3.connect(":memory:")
customer.to_sql("customer", conn, index=False)

## Must read all columns internally even though we only need 'age'
query = "SELECT age FROM customer"
result = pd.read_sql(query, conn)

DuckDB uses columnar storage, allowing you to efficiently read and process only the columns needed for your analysis. This improves both query speed and memory usage:

import duckdb
import pandas as pd

customer = pd.DataFrame(
    {"id": [1, 2, 3], "name": ["Alex", "Ben", "Chase"], "age": [25, 30, 35]}
)


query = "SELECT age FROM customer"
result = duckdb.sql(query).df()
result

	age
0	25
1	30
2	35

In this example, DuckDB only needs to access the ‘age’ column in memory, while SQLite must read all columns (‘id’, ‘name’, ‘age’) internally even though only ‘age’ is selected. DuckDB also provides a simpler workflow by querying pandas DataFrames directly.

Link to DuckDB.

6.17.8. Query Nested Parquet Files Easily Using DuckDB#

Analyzing large, nested Parquet files often requires pre-processing to flatten the data or writing complex scripts to extract nested fields, which can be time-consuming and error-prone.

import pandas as pd

# Create a nested dataset and save it as a Parquet file
data = {
    "id": [1, 2],
    "details": [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}],
}

## Convert to a DataFrame
df = pd.DataFrame(data)

# Save as a nested Parquet file
df.to_parquet("nested_data.parquet")
print(df)

   id                       details
0   1  {'name': 'Alice', 'age': 25}
1   2    {'name': 'Bob', 'age': 30}

## Read the DataFrame from Parquet file
df = pd.read_parquet("nested_data.parquet")

# Create a new DataFrame with the flattened structure
flat_df = pd.DataFrame(
    {
        "id": df["id"],
        "name": [detail["name"] for detail in df["details"]],
        "age": [detail["age"] for detail in df["details"]],
    }
)

print(flat_df)

   id   name  age
0   1  Alice   25
1   2    Bob   30

DuckDB allows querying nested Parquet files directly using SQL without needing to flatten or preprocess the data.

To load and query nested Parquet files with DuckDB:

import duckdb

## Query the nested Parquet file directly
query_result = duckdb.query(
    """
    SELECT 
        id, 
        details.name AS name, 
        details.age AS age 
    FROM read_parquet('nested_data.parquet')
"""
).to_df()

print(query_result)

   id   name  age
0   1  Alice   25
1   2    Bob   30

In this example:

read_parquet('nested_data.parquet') reads the nested Parquet file.
SQL syntax allows you to access nested fields using dot notation (e.g., details.name).

The output is a flattened representation of the nested data, directly queried from the Parquet file without additional preprocessing.

Link to DuckDB.

DuckDB

Contents

6.17. DuckDB#

6.17.1. SQL Analytics with DuckDB#

6.17.2. Efficient SQL Operations with DuckDB on Pandas DataFrames#

6.17.3. Efficiently Handle Large Datasets with DuckDB and PyArrow#

6.17.4. Simplify CSV Data Management with DuckDB#

6.17.5. Automate CSV Parsing with DuckDB’s read_csv#

6.17.6. Multiple CSV Files Processing with DuckDB#

6.17.7. DuckDB: Query Pandas DataFrames Faster with Columnar Storage#

6.17.8. Query Nested Parquet Files Easily Using DuckDB#