6.1. Configure your Data Science Projects with Hydra#

6.1.1. Introduction#

Hydra is a simple tool to manage complex configurations in Python. To install Hydra, type:

pip install hydra-core

The video below shows some simple features of Hydra.

Imagine your YAML configuration file looks like this:

process:
  keep_columns:
      - Income
      - Recency
      - NumWebVisitsMonth
      - Complain
      - age
      - total_purchases
      - enrollment_years
      - family_size

  remove_outliers_threshold:
    age: 90
    Income: 600000

To access the list under process.keep_columns in the configuration file, simple add the @hydra.main decorator to the function that uses the configuration:

import hydra
from omegaconf import DictConfig, OmegaConf


@hydra.main(config_path="../config", config_name="main")
def process_data(config: DictConfig):

    print(config.process.keep_columns)

process_data()

Output:

['Income', 'Recency', 'NumWebVisitsMonth', 'Complain', 'age', 'total_purchases', 'enrollment_years', 'family_size']

6.1.2. Group Configuration Files#

Imagine the structure of your config directory looks like this:

config
├── main.yaml
└── process
    ├── process_1.yaml
    ├── process_2.yaml
    ├── process_3.yaml
    └── process_4.yaml

Each file has different values for the same parameters. You can set the parameters in the file process_2.yaml as default by adding the following to main.yaml:

defaults:
  - process: process_2
  - _self_

Now the parameters in main.yaml are merged with the parameters in process_2.yaml.

Running the file print_config.py:

python print_config.py

should print:

# From process_2.yaml
process:
  keep_columns:
  - Income
  - Recency
  - NumWebVisitsMonth
  - Complain
  - age
  - total_purchases
  - enrollment_years
  - family_size
  remove_outliers_threshold:
    age: 90
    Income: 600000
  family_size:
    Married: 2
    Together: 2
    Absurd: 1
    Widow: 1
    YOLO: 1
    Divorced: 1
    Single: 1
    Alone: 1

# From main.yaml
raw_data:
  path: data/raw/marketing_campaign.csv
intermediate:
  dir: data/intermediate
  name: scale_features.csv
  path: ${intermediate.dir}/${intermediate.name}
flow: all
image:
  kmeans: image/elbow.png
  clusters: image/cluster.png

6.1.3. Override Default Parameters#

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/t9hwWxBnU0o?start=167" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

You can also override the default parameters on the command line. For example, to replace process_2 with process_1, run the following:

python print_config.py process=process_1

The output should be the combination of all parameters in main.yaml and in process_1.yaml:

# From process_1.yaml
process:
  keep_columns:
  - Income
  - Recency
  - NumWebVisitsMonth
  - AcceptedCmp3
  - AcceptedCmp4
  - AcceptedCmp5
  - AcceptedCmp1
  - AcceptedCmp2
  - Complain
  - Response
  - age
  - total_purchases
  - enrollment_years
  - family_size
  remove_outliers_threshold:
    age: 90
    Income: 600000
  family_size:
    Married: 2
    Together: 2
    Absurd: 1
    Widow: 1
    YOLO: 1
    Divorced: 1
    Single: 1
    Alone: 1
    
# From main.yaml
raw_data:
  path: data/raw/marketing_campaign.csv
intermediate:
  dir: data/intermediate
  name: scale_features.csv
  path: ${intermediate.dir}/${intermediate.name}
flow: all
image:
  kmeans: image/elbow.png
  clusters: image/cluster.png

Build a Reproducible and Maintainable Data Science Project

Configure your Data Science Projects with Hydra

Contents

6.1. Configure your Data Science Projects with Hydra#

6.1.1. Introduction#

6.1.2. Group Configuration Files#

6.1.3. Override Default Parameters#