Validate Your pandas DataFrame with Pandera
Contents
4.2. Validate Your pandas DataFrame with Pandera#
4.2.1. Motivation#
In the previous section, I showed how to use Great Expectations to validate your data.
Even though Great Expectations provide a lot of useful utilities, it can be complicated to create a validation suite with Great Expectations. For a small data science project, using Great Expectations can be overkill.
That is why in this section, we will learn about Pandera, a simple Python library for validating a pandas DataFrame.
To install Pandera, type:
pip install pandera
4.2.2. Introduction#
To learn how Pandera works, let’s start with creating a simple dataset:
import pandas as pd
fruits = pd.DataFrame(
{
"name": ["apple", "banana", "apple", "orange"],
"store": ["Aldi", "Walmart", "Walmart", "Aldi"],
"price": [2, 1, 3, 4],
}
)
fruits
Imagine this scenario. Your manager told you that there can only be certain fruits and stores in the dataset, and the price must be less than 4.
available_fruits = ["apple", "banana", "orange"]
nearby_stores = ["Aldi", "Walmart"]
To make sure your data follow these conditions, checking your data manually can cost too much time, especially when your data is big. Is there a way that you can automate this process?
That is when Pandera comes in handy. Specifically, we:
Create multiple tests for the entire dataset using
DataFrameSchema
Create multiple tests for each column using
Column
Specify the type of test using
Check
import pandera as pa
from pandera import Column, Check
schema = pa.DataFrameSchema(
{
"name": Column(str, Check.isin(available_fruits)),
"store": Column(str, Check.isin(nearby_stores)),
"price": Column(int, Check.less_than(4)),
}
)
schema.validate(fruits)
SchemaError: <Schema Column(name=price, type=DataType(int64))> failed element-wise validator 0:
<Check less_than: less_than(4)>
failure cases:
index failure_case
0 3 4
In the code above:
"name": Column(str, Check.isin(available_fruits))
checks if the columnname
is of type string and if all values of the columnname
are inside a specified list."price": Column(int, Check.less_than(4))
checks if all values in the columnprice
are of typeint
and are less than 4.Since not all values in the column
price
are less than 4, the test fails.
Find other built-in Checks
methods here.
4.2.3. Custom Checks#
We can also create custom checks using lambda
. In the code below, Check(lambda price: sum(price) < 20)
checks if the sum of the column price
is less than 20.
schema = pa.DataFrameSchema(
{
"name": Column(str, Check.isin(available_fruits)),
"store": Column(str, Check.isin(nearby_stores)),
"price": Column(
int, [Check.less_than(5), Check(lambda price: sum(price) < 20)]
),
}
)
schema.validate(fruits)
4.2.4. Schema Model#
When our tests are complicated, using dataclass can make our tests look much cleaner than using a dictionary. Luckily, Pandera also allows us to create tests using a dataclass instead of a dictionary.
from pandera.typing import Series
class Schema(pa.SchemaModel):
name: Series[str] = pa.Field(isin=available_fruits)
store: Series[str] = pa.Field(isin=nearby_stores)
price: Series[int] = pa.Field(le=5)
@pa.check("price")
def price_sum_lt_20(cls, price: Series[int]) -> Series[bool]:
return sum(price) < 20
Schema.validate(fruits)
4.2.5. Validation Decorator#
4.2.5.1. Check Input#
Now that we know how to create tests for our data, how do we use it to test the input of our function? A straightforward approach is to add schema.validate(input)
inside a function.
fruits = pd.DataFrame(
{
"name": ["apple", "banana", "apple", "orange"],
"store": ["Aldi", "Walmart", "Walmart", "Aldi"],
"price": [2, 1, 3, 4],
}
)
schema = pa.DataFrameSchema(
{
"name": Column(str, Check.isin(available_fruits)),
"store": Column(str, Check.isin(nearby_stores)),
"price": Column(int, Check.less_than(5)),
}
)
def get_total_price(fruits: pd.DataFrame, schema: pa.DataFrameSchema):
validated = schema.validate(fruits)
return validated["price"].sum()
get_total_price(fruits, schema)
However, this approach makes it difficult for us to test our function. Since the argument of get_total_price
is both fruits
and schema
, we need to include both of these arguments inside the test:
def test_get_total_price():
fruits = pd.DataFrame({'name': ['apple', 'banana'], 'store': ['Aldi', 'Walmart'], 'price': [1, 2]})
# Need to include schema in the unit test
schema = pa.DataFrameSchema(
{
"name": Column(str, Check.isin(available_fruits)),
"store": Column(str, Check.isin(nearby_stores)),
"price": Column(int, Check.less_than(5)),
}
)
assert get_total_price(fruits, schema) == 3
test_get_total_price
tests both the data and the function. Because a unit test should only test one thing, including data validation inside a function is not ideal.
Pandera provides a solution for this with the check_input
decorator. The argument of this decorator is used to validate the input of the function.
from pandera import check_input
@check_input(schema)
def get_total_price(fruits: pd.DataFrame):
return fruits.price.sum()
get_total_price(fruits)
If the input is not valid, Pandera will raise an error before the input is processed by your function:
fruits = pd.DataFrame(
{
"name": ["apple", "banana", "apple", "orange"],
"store": ["Aldi", "Walmart", "Walmart", "Aldi"],
"price": ["2", "1", "3", "4"],
}
)
@check_input(schema)
def get_total_price(fruits: pd.DataFrame):
return fruits.price.sum()
get_total_price(fruits)
SchemaError: error in check_input decorator of function 'get_total_price': expected series 'price' to have type int64, got object
Validating data before processing is very nice since it prevents us from wasting a significant amount of time on processing the data.
4.2.5.2. Check Output#
We can also use Pandera’s check_output
decorator to check the output of a function:
from pandera import check_output
fruits_nearby = pd.DataFrame(
{
"name": ["apple", "banana", "apple", "orange"],
"store": ["Aldi", "Walmart", "Walmart", "Aldi"],
"price": [2, 1, 3, 4],
}
)
fruits_faraway = pd.DataFrame(
{
"name": ["apple", "banana", "apple", "orange"],
"store": ["Whole Foods", "Whole Foods", "Schnucks", "Schnucks"],
"price": [3, 2, 4, 5],
}
)
out_schema = pa.DataFrameSchema(
{"store": Column(str, Check.isin(["Aldi", "Walmart", "Whole Foods", "Schnucks"]))}
)
@check_output(out_schema)
def combine_fruits(fruits_nearby: pd.DataFrame, fruits_faraway: pd.DataFrame):
fruits = pd.concat([fruits_nearby, fruits_faraway])
return fruits
combine_fruits(fruits_nearby, fruits_faraway)
4.2.5.3. Check Both Inputs and Outputs#
Now you might wonder, is there a way to check both inputs and outputs? We can do that using the decorator check_io
:
from pandera import check_io
in_schema = pa.DataFrameSchema({"store": Column(str)})
out_schema = pa.DataFrameSchema(
{"store": Column(str, Check.isin(["Aldi", "Walmart", "Whole Foods", "Schnucks"]))}
)
@check_io(fruits_nearby=in_schema, fruits_faraway=in_schema, out=out_schema)
def combine_fruits(fruits_nearby: pd.DataFrame, fruits_faraway: pd.DataFrame):
fruits = pd.concat([fruits_nearby, fruits_faraway])
return fruits
combine_fruits(fruits_nearby, fruits_faraway)
4.2.6. Other Arguments for Column Validation#
4.2.6.1. Deal with Null Values#
By default, Pandera will raise an error if there are null values in a column we are testing. If null values are acceptable, add nullable=False
to our Column
class:
import numpy as np
fruits = fruits = pd.DataFrame(
{
"name": ["apple", "banana", "apple", "orange"],
"store": ["Aldi", "Walmart", "Walmart", np.nan],
"price": [2, 1, 3, 4],
}
)
schema = pa.DataFrameSchema(
{
"name": Column(str, Check.isin(available_fruits)),
"store": Column(str, Check.isin(nearby_stores), nullable=True),
"price": Column(int, Check.less_than(5)),
}
)
schema.validate(fruits)
4.2.6.2. Deal with Duplicates#
By default, duplicates are acceptable. To raise an error when there are duplicates, use allow_duplicates=False
:
schema = pa.DataFrameSchema(
{
"name": Column(str, Check.isin(available_fruits)),
"store": Column(
str, Check.isin(nearby_stores), nullable=True, allow_duplicates=False
),
"price": Column(int, Check.less_than(5)),
}
)
schema.validate(fruits)
SchemaError: series 'store' contains duplicate values: {2: 'Walmart'}
4.2.6.3. Convert Data Types#
coerce=True
changes the data type of a column if its data type doesn’t satisfy the test condition.
In the code below, the data type of price is changed from integer to string.
fruits = pd.DataFrame(
{
"name": ["apple", "banana", "apple", "orange"],
"store": ["Aldi", "Walmart", "Walmart", "Aldi"],
"price": [2, 1, 3, 4],
}
)
schema = pa.DataFrameSchema({"price": Column(str, coerce=True)})
validated = schema.validate(fruits)
validated.dtypes
name object
store object
price object
dtype: object
4.2.6.4. Match Patterns#
What if we want to change all columns that start with the word store
?
favorite_stores = ["Aldi", "Walmart", "Whole Foods", "Schnucks"]
fruits = pd.DataFrame(
{
"name": ["apple", "banana", "apple", "orange"],
"store_nearby": ["Aldi", "Walmart", "Walmart", "Aldi"],
"store_far": ["Whole Foods", "Schnucks", "Whole Foods", "Schnucks"],
}
)
Pandera allows us to apply the same checks on multiple columns that share a certain pattern by adding regex=True
:
schema = pa.DataFrameSchema(
{
"name": Column(str, Check.isin(available_fruits)),
"store_+": Column(str, Check.isin(favorite_stores), regex=True),
}
)
schema.validate(fruits)
4.2.7. Export and Load From a YAML File#
4.2.8. Export to YAML#
Using a YAML file is a neat way to show your tests to colleagues who don’t know Python. We can keep a record of all validations in a YAML file using schema.to_yaml()
:
from pathlib import Path
# Get a YAML object
yaml_schema = schema.to_yaml()
# Save to a file
f = Path("schema.yml")
f.touch()
f.write_text(yaml_schema)
The schema.yml
should look like the below:
schema_type: dataframe
version: 0.7.0
columns:
name:
dtype: str
nullable: false
checks:
isin:
- apple
- banana
- orange
allow_duplicates: true
coerce: false
required: true
regex: false
store:
dtype: str
nullable: true
checks:
isin:
- Aldi
- Walmart
allow_duplicates: false
coerce: false
required: true
regex: false
price:
dtype: int64
nullable: false
checks:
less_than: 5
allow_duplicates: true
coerce: false
required: true
regex: false
checks: null
index: null
coerce: false
strict: false
4.2.9. Load from YAML#
To load from a YAML file, simple use pa.io.from_yaml(yaml_schema)
:
with f.open() as file:
yaml_schema = file.read()
schema = pa.io.from_yaml(yaml_schema)