6.14. SQL Libraries#

6.14.1. Create Dynamic SQL Statements with Python string Template#

If you want to create dynamic SQL statements with Python variables, use Python string Template.

string Template supports $-based substitutions.

%%writefile query.sql
SELECT
    *
FROM
    my_table
LIMIT
    $limit
WHERE
    start_date > $start_date;
Writing query.sql
import pathlib
from string import Template

# Read the query from the file
query = pathlib.Path("query.sql").read_text()

# Substitute the placeholders with the values
t = Template(query)
substitutions = {"limit": 10, "start_date": "2021-01-01"}
print(t.substitute(substitutions))
SELECT
    *
FROM
    my_table
LIMIT
    10
WHERE
    start_date > 2021-01-01;

6.14.2. Read Data From a SQL Table#

Loading SQL tables into DataFrames allows you to analyze and preprocess the data using the rich functionality of pandas.

To read a SQL table into a pandas DataFrame, pass the database connection obtained from the SQLAlchemy Engine to the pandas.read_sql method.

import pandas as pd
import sqlalchemy

# Create a SQLAlchemy engine
engine = sqlalchemy.create_engine(
    "postgresql://username:password@host:port/database_name"
)


# Read a SQL table into a DataFrame
df = pd.read_sql("SELECT * FROM table_name", engine)

6.14.3. FugueSQL: Use SQL to Work with Pandas, Spark, and Dask DataFrames#

Hide code cell content
!pip install fugue 

Do you like to use both Python and SQL to manipulate data? FugueSQL is an interface that allows users to use SQL to work with Pandas, Spark, and Dask DataFrames.

import pandas as pd
from fugue_sql import fsql

input_df = pd.DataFrame({"price": [2, 1, 3], "fruit": (["apple", "banana", "orange"])})

query = """
SELECT price, fruit FROM input_df
WHERE price > 1
PRINT
"""

fsql(query).run()
PandasDataFrame
price:long|fruit:str
----------+---------
2         |apple    
3         |orange   
Total count: 2
DataFrames()

Link to fugue.

6.14.4. SQLModel: Simplify SQL Database Interactions in Python#

Hide code cell content
!pip install sqlmodel

Interacting with SQL databases from Python code can often be challenging to write and comprehend.

import sqlite3

# Connect to the database
conn = sqlite3.connect('users.db')

# Create a cursor object
cursor = conn.cursor()

# Define the SQL statement for creating the table
create_table_sql = '''
    CREATE TABLE IF NOT EXISTS membership (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        username TEXT,
        age INTEGER,
        active INTEGER
    )
'''

# Execute the SQL statement to create the table
cursor.execute(create_table_sql)

# Define the SQL statement for inserting rows
insert_rows_sql = '''
    INSERT INTO membership (username, age, active)
    VALUES (?, ?, ?)
'''

# Define the rows to be inserted
rows = [
    ('John', 25, 1),
    ('Jane', 30, 0),
    ('Mike', 35, 1)
]

# Execute the SQL statement for each row
for row in rows:
    cursor.execute(insert_rows_sql, row)

# Commit the changes to the database
conn.commit()

# Close the cursor and the database connection
cursor.close()
conn.close()

However, by utilizing SQLModel, you can harness Pydantic-like classes that leverage Python type annotations, making the code more intuitive to write and easier to understand.

from typing import Optional

from sqlmodel import Field, Session, SQLModel, create_engine


class Membership(SQLModel, table=True):
    id: Optional[int] = Field(default=None, primary_key=True)
    username: str 
    age: int 
    active: int
    
# age is converted from str to int through type coercion
user1 = Membership(username="John", age="25", active=1) 
user2 = Membership(username="Jane", age="30", active=0)
user3 = Membership(username="Mike", age="35", active=1)


engine = create_engine("sqlite:///users.db")


SQLModel.metadata.create_all(engine)

with Session(engine) as session:
    session.add(user1)
    session.add(user2)
    session.add(user3)
    session.commit()

Link to SQLModel.

6.14.5. SQLFluff: A Linter and Auto-Formatter for Your SQL Code#

Hide code cell content
!pip install sqlfluff

Linting helps ensure that code follows consistent style conventions, making it easier to understand and maintain. With SQLFluff, you can automatically lint your SQL code and correct most linting errors, freeing you up to focus on more important tasks.

SQLFluff supports various SQL dialects such as ANSI, MySQL, PostgreSQL, BigQuery, Databricks, Oracle, Teradata, etc.

In the code below, we use SQLFLuff to lint and fix the SQL code in the file sqlfluff_example.sql.

%%writefile sqlfluff_example.sql
SELECT a+b  AS foo,
c AS bar from my_table
$ sqlfluff lint sqlfluff_example.sql --dialect postgres
!sqlfluff lint sqlfluff_example.sql --dialect postgres
== [sqlfluff_example.sql] FAIL                            
L:   1 | P:   1 | LT09 | Select targets should be on a new line unless there is
                       | only one select target.
                       | [layout.select_targets]
L:   1 | P:   1 | ST06 | Select wildcards then simple targets before calculations
                       | and aggregates. [structure.column_order]
L:   1 | P:   7 | LT02 | Expected line break and indent of 4 spaces before 'a'.
                       | [layout.indent]
L:   1 | P:   9 | LT01 | Expected single whitespace between naked identifier and
                       | binary operator '+'. [layout.spacing]
L:   1 | P:  10 | LT01 | Expected single whitespace between binary operator '+'
                       | and naked identifier. [layout.spacing]
L:   1 | P:  11 | LT01 | Expected only single space before 'AS' keyword. Found ' 
                       | '. [layout.spacing]
L:   2 | P:   1 | LT02 | Expected indent of 4 spaces.
                       | [layout.indent]
L:   2 | P:   9 | LT02 | Expected line break and no indent before 'from'.
                       | [layout.indent]
L:   2 | P:  10 | CP01 | Keywords must be consistently upper case.
                       | [capitalisation.keywords]
All Finished 📜 🎉!

$ sqlfluff fix sqlfluff_example.sql --dialect postgres
%cat sqlfluff_example.sql
SELECT
    c AS bar,
    a + b AS foo
FROM my_table

Link to SQLFluff.

6.14.6. PostgresML: Integrate Machine Learning with PostgreSQL#

If you want to seamlessly integrate machine learning models into your PostgreSQL database, use PostgresML.

Sentiment Analysis:

SELECT pgml.transform(
    task   => 'text-classification',
    inputs => ARRAY[
        'I love how amazingly simple ML has become!', 
        'I hate doing mundane and thankless tasks. ☹️'
    ]
) AS positivity;

Output:

                    positivity
------------------------------------------------------
[
    {"label": "POSITIVE", "score": 0.9995759129524232}, 
    {"label": "NEGATIVE", "score": 0.9903519749641418}
]

Training a classification model

Training:

SELECT * FROM pgml.train(
    'My Classification Project',
    task => 'classification',
    relation_name => 'pgml.digits',
    y_column_name => 'target',
    algorithm => 'xgboost',
    hyperparams => '{
        "n_estimators": 25
    }'
);

Inference:

SELECT 
    target,
    pgml.predict('My Classification Project', image) AS prediction
FROM pgml.digits
LIMIT 5;

Link to PostgresML.

6.14.7. Efficient SQL Operations with DuckDB on Pandas DataFrames#

!pip install --quiet duckdb
!wget -q https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet

Using SQL with pandas empowers data scientists to leverage SQL’s powerful querying capabilities alongside the data manipulation functionalities of pandas.

However, traditional database systems often demand the management of a separate DBMS server, introducing additional complexity to the workflow.

With DuckDB, you can efficiently run SQL operations on pandas DataFrames without the need to manage a separate DBMS server.

import pandas as pd
import duckdb

mydf = pd.DataFrame({'a' : [1, 2, 3]})
print(duckdb.query("SELECT SUM(a) FROM mydf").to_df())

In the code below, aggregating data using DuckDB is nearly 6 times faster compared to aggregating with pandas.

import pandas as pd
import duckdb

df = pd.read_parquet("lineitemsf1.snappy.parquet")
%%timeit
df.groupby('l_returnflag').agg(
  Sum=('l_extendedprice', 'sum'),
  Min=('l_extendedprice', 'min'),
  Max=('l_extendedprice', 'max'),
  Avg=('l_extendedprice', 'mean')
)
226 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
duckdb.query("""
SELECT
      l_returnflag,
      SUM(l_extendedprice),
      MIN(l_extendedprice),
      MAX(l_extendedprice),
      AVG(l_extendedprice)
FROM df
GROUP BY
        l_returnflag
""").to_df()
37 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Link to DuckDB.

6.14.8. Efficiently Handle Large Datasets with DuckDB and PyArrow#

!pip install deltalake duckdb 
!wget -q https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet

DuckDB leverages various optimizations for query execution, while PyArrow efficiently handles in-memory data processing and storage. Combining DuckDB and PyArrow allows you to efficiently process datasets larger than memory on a single machine.

In the code below, we convert a Delta Lake table with over 6 million rows to a pandas DataFrame and a PyArrow dataset, which are then used by DuckDB.

Running DuckDB on PyArrow dataset is approximately 2906 times faster than running DuckDB on pandas.

import pandas as pd
import duckdb
from deltalake.writer import write_deltalake

df = pd.read_parquet("lineitemsf1.snappy.parquet")
write_deltalake("delta_lake", df)
from deltalake import DeltaTable

table = DeltaTable("delta_lake")
%%timeit
quack = duckdb.df(table.to_pandas())
quack.filter("l_quantity > 50")
2.77 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
quack = duckdb.arrow(table.to_pyarrow_dataset())
quack.filter("l_quantity > 50")
954 µs ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Link to DuckDB.

6.14.9. sqlparse: Extract Components From a SQL Statement in Python#

Hide code cell content
!pip install sqlparse

If you want to extract specific components of a SQL statement for downstream Python tasks, use sqlparse.

In the code below, we use sqlparse to extract tables and columns from the SQL statement.

import sqlparse

sql_query = """
SELECT e.employee_name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id
"""
parsed = sqlparse.parse(sql_query)[0]
parsed.tokens
[<Newline ' ' at 0x10A13E4C0>,
 <DML 'SELECT' at 0x10A1AE040>,
 <Whitespace ' ' at 0x10A1AE100>,
 <IdentifierList 'e.empl...' at 0x10A198E40>,
 <Newline ' ' at 0x10A1AE580>,
 <Keyword 'FROM' at 0x10A1AE5E0>,
 <Whitespace ' ' at 0x10A1AE640>,
 <Identifier 'employ...' at 0x10A198C10>,
 <Newline ' ' at 0x10A1AE7C0>,
 <Keyword 'JOIN' at 0x10A1AE820>,
 <Whitespace ' ' at 0x10A1AE880>,
 <Identifier 'depart...' at 0x10A198CF0>,
 <Whitespace ' ' at 0x10A1AEA00>,
 <Keyword 'ON' at 0x10A1AEA60>,
 <Whitespace ' ' at 0x10A1AEAC0>,
 <Comparison 'e.depa...' at 0x10A198DD0>,
 <Newline ' ' at 0x10A1AEE80>]
tables = []
for token in parsed.tokens:
    if isinstance(token, sqlparse.sql.IdentifierList):
        columns = [identifier.get_real_name() for identifier in token.get_identifiers()]
    elif isinstance(token, sqlparse.sql.Identifier):
        table = token.get_real_name()
        tables.append(table)

print(f'Tables: {tables}')
print(f'Columns: {columns}')
Tables: ['employees', 'departments']
Columns: ['employee_name', 'department_name']

Link to sqlparse.