6.2. Feature Engineer

This section covers some libraries for feature engineering.

6.2.1. Drop Correlated Features

!pip install feature_engine 

If you want to remove the correlated variables from a dataframe, use feature_engine.DropCorrelatedFeatures.

import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import DropCorrelatedFeatures

# make dataframe with some correlated variables
X, y = make_classification(
        n_samples=1000,
        n_features=6,
        n_redundant=3,
        n_clusters_per_class=1,
        class_sep=2,
        random_state=0,
    )

# trabsform arrays into pandas df and series
colnames = ["var_" + str(i) for i in range(6)]
X = pd.DataFrame(X, columns=colnames)
X.columns
Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5'], dtype='object')
X[["var_0", "var_1", "var_2"]].corr()
var_0 var_1 var_2
var_0 1.000000 0.938936 0.874845
var_1 0.938936 1.000000 0.654745
var_2 0.874845 0.654745 1.000000

Drop the variables with a correlation above 0.8.

tr = DropCorrelatedFeatures(variables=None, method="pearson", threshold=0.8)

Xt = tr.fit_transform(X)

tr.correlated_feature_sets_
[{'var_0', 'var_1', 'var_2'}]
Xt.columns
Index(['var_0', 'var_3', 'var_4', 'var_5'], dtype='object')

Link to feature-engine.

6.2.2. Similarity Encoding for Dirty Categories Using dirty_cat

!pip install dirty-cat

To capture the similarities among dirty categories when encoding categorical variables, use dirty_cat’s SimilarityEncoder .

To understand how SimilarityEncoder works, let’s start with the employee_salaries dataset.

from dirty_cat.datasets import fetch_employee_salaries
from dirty_cat import SimilarityEncoder

X = fetch_employee_salaries().X
X.head(10)
gender department department_name division assignment_category employee_position_title underfilled_job_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator NaN 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer NaN 09/12/1988 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV NaN 11/19/1989 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II NaN 05/05/2014 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III NaN 03/05/2007 2007
5 M POL Department of Police PSB 6th District Special Assignment Team Fulltime-Regular Police Officer III NaN 07/16/2007 2007
6 F FRS Fire and Rescue Services EMS Billing Fulltime-Regular Accountant/Auditor II NaN 06/27/2016 2016
7 M HHS Department of Health and Human Services Head Start Fulltime-Regular Administrative Specialist II NaN 11/17/2014 2014
8 M FRS Fire and Rescue Services Recruit Training Fulltime-Regular Firefighter/Rescuer III Firefighter/Rescuer I (Recruit) 12/12/2016 2016
9 F POL Department of Police FSB Traffic Division Automated Traffic Enforce... Fulltime-Regular Police Aide NaN 02/05/2007 2007
dirty_column = "employee_position_title"
X_dirty = df[dirty_column].values
X_dirty[:7]
array(['Office Services Coordinator', 'Master Police Officer',
       'Social Worker IV', 'Resident Supervisor II',
       'Planning Specialist III', 'Police Officer III',
       'Accountant/Auditor II'], dtype=object)

We can see that titles such as ‘Master Police Officer’ and ‘Police Officer III’ are similar. We can use SimilaryEncoder to encode these categories while capturing their similarities.

enc = SimilarityEncoder(similarity="ngram")
X_enc = enc.fit_transform(X_dirty[:10].reshape(-1, 1))
X_enc
array([[0.05882353, 0.03125   , 0.02739726, 0.19008264, 1.        ,
        0.01351351, 0.05555556, 0.20535714, 0.08088235, 0.032     ],
       [0.008     , 0.02083333, 0.056     , 1.        , 0.19008264,
        0.02325581, 0.23076923, 0.56      , 0.01574803, 0.02777778],
       [0.03738318, 0.07317073, 0.05405405, 0.02777778, 0.032     ,
        0.0733945 , 0.        , 0.0625    , 0.06542056, 1.        ],
       [0.11206897, 0.07142857, 0.09756098, 0.01574803, 0.08088235,
        0.07142857, 0.03125   , 0.08108108, 1.        , 0.06542056],
       [0.04761905, 0.3539823 , 0.06976744, 0.02325581, 0.01351351,
        1.        , 0.02      , 0.09821429, 0.07142857, 0.0733945 ],
       [0.0733945 , 0.05343511, 0.14953271, 0.56      , 0.20535714,
        0.09821429, 0.26086957, 1.        , 0.08108108, 0.0625    ],
       [1.        , 0.05      , 0.06451613, 0.008     , 0.05882353,
        0.04761905, 0.01052632, 0.0733945 , 0.11206897, 0.03738318],
       [0.05      , 1.        , 0.03378378, 0.02083333, 0.03125   ,
        0.3539823 , 0.02631579, 0.05343511, 0.07142857, 0.07317073],
       [0.06451613, 0.03378378, 1.        , 0.056     , 0.02739726,
        0.06976744, 0.        , 0.14953271, 0.09756098, 0.05405405],
       [0.01052632, 0.02631579, 0.        , 0.23076923, 0.05555556,
        0.02      , 1.        , 0.26086957, 0.03125   , 0.        ]])

Cool! Let’s create a heatmap to understand the correlation between the encoded features.

import seaborn as sns
import numpy as np
from sklearn.preprocessing import normalize
from IPython.core.pylabtools import figsize

def plot_similarity(labels, features):
  
    normalized_features = normalize(features)
    
    # Create correction matrix
    corr = np.inner(normalized_features, normalized_features)
    
    # Plot
    figsize(10, 10)
    sns.set(font_scale=1.2)
    g = sns.heatmap(corr, xticklabels=labels, yticklabels=labels, vmin=0,
        vmax=1, cmap="YlOrRd", annot=True, annot_kws={"size": 10})
        
    g.set_xticklabels(labels, rotation=90)
    g.set_title("Similarity")


def encode_and_plot(labels):
  
    enc = SimilarityEncoder(similarity="ngram") # Encode
    X_enc = enc.fit_transform(labels.reshape(-1, 1))
    
    plot_similarity(labels, X_enc) # Plot
encode_and_plot(X_dirty[:10])
../_images/feature_engineer_21_0.png

As we can see from the matrix above,

  • The similarity between the same strings such as ‘Office Services Coordinator’ and ‘Office Services Coordinator’ is 1

  • The similarity between somewhat similar strings such as ‘Office Services Coordinator’ and ‘Master Police Officer’ is 0.41

  • The similarity between two very different strings such as ‘Social Worker IV’ and ‘Polic Aide’ is 0.028

Link to dirty-cat.

Link to my full article about dirty-cat.