6.2. Feature Engineer

This section covers some libraries for feature engineering.

6.2.1. Split Data in a Stratified Fashion in scikit-learn

Normally, after using scikit-learn’s train_test_split, the proportion of values in the sample will be different from the proportion of values in the entire dataset.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

X, y = load_iris(return_X_y=True)
np.bincount(y)
array([50, 50, 50])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Get count of each class in the train set

np.bincount(y_train)
array([37, 34, 41])
# Get count of each class in the test set

np.bincount(y_test)
array([13, 16,  9])

If you want to keep the proportion of classes in the sample the same as the proportion of classes in the entire dataset, add stratify=y.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)
np.bincount(y_train)
array([37, 37, 38])
np.bincount(y_test)
array([13, 13, 12])

6.2.2. Drop Correlated Features

!pip install feature_engine 

If you want to remove the correlated variables from a dataframe, use feature_engine.DropCorrelatedFeatures.

import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import DropCorrelatedFeatures

# make dataframe with some correlated variables
X, y = make_classification(
        n_samples=1000,
        n_features=6,
        n_redundant=3,
        n_clusters_per_class=1,
        class_sep=2,
        random_state=0,
    )

# trabsform arrays into pandas df and series
colnames = ["var_" + str(i) for i in range(6)]
X = pd.DataFrame(X, columns=colnames)
X.columns
Index(['var_0', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5'], dtype='object')
X[["var_0", "var_1", "var_2"]].corr()
var_0 var_1 var_2
var_0 1.000000 0.938936 0.874845
var_1 0.938936 1.000000 0.654745
var_2 0.874845 0.654745 1.000000

Drop the variables with a correlation above 0.8.

tr = DropCorrelatedFeatures(variables=None, method="pearson", threshold=0.8)

Xt = tr.fit_transform(X)

tr.correlated_feature_sets_
[{'var_0', 'var_1', 'var_2'}]
Xt.columns
Index(['var_0', 'var_3', 'var_4', 'var_5'], dtype='object')

Link to feature-engine.

6.2.3. Encode Rare Labels with Feature-engine

When dealing with features with high cardinality, you might want to mark the rare categories as “Other”. Feature-engine’s RareLabelEncoder makes it easy for you to do so.

from sklearn.datasets import fetch_openml
from feature_engine.encoding import RareLabelEncoder

data = fetch_openml('dating_profile')['data']
data.head(10)
body_type diet drinks drugs education essay0 essay1 essay2 essay3 essay4 ... location offspring orientation pets religion sex sign smokes speaks status
0 a little extra strictly anything socially never working on college/university about me:<br />\n<br />\ni would love to think... currently working as an international agent fo... making people laugh.<br />\nranting about a go... the way i look. i am a six foot half asian, ha... books:<br />\nabsurdistan, the republic, of mi... ... south san francisco, california doesn&rsquo;t have kids, but might want them straight likes dogs and likes cats agnosticism and very serious about it m gemini sometimes english single
1 average mostly other often sometimes working on space camp i am a chef: this is what that means.<br />\n1... dedicating everyday to being an unbelievable b... being silly. having ridiculous amonts of fun w... None i am die hard christopher moore fan. i don't r... ... oakland, california doesn&rsquo;t have kids, but might want them straight likes dogs and likes cats agnosticism but not too serious about it m cancer no english (fluently), spanish (poorly), french (... single
2 thin anything socially None graduated from masters program i'm not ashamed of much, but writing public te... i make nerdy software for musicians, artists, ... improvising in different contexts. alternating... my large jaw and large glasses are the physica... okay this is where the cultural matrix gets so... ... san francisco, california None straight has cats None m pisces but it doesn&rsquo;t matter no english, french, c++ available
3 thin vegetarian socially None working on college/university i work in a library and go to school. . . reading things written by old dead people playing synthesizers and organizing books acco... socially awkward but i do my best bataille, celine, beckett. . .<br />\nlynch, j... ... berkeley, california doesn&rsquo;t want kids straight likes cats None m pisces no english, german (poorly) single
4 athletic None socially never graduated from college/university hey how's it going? currently vague on the pro... work work work work + play creating imagery to look at:<br />\nhttp://bag... i smile a lot and my inquisitive nature music: bands, rappers, musicians<br />\nat the... ... san francisco, california None straight likes dogs and likes cats None m aquarius no english single
5 average mostly anything socially None graduated from college/university i'm an australian living in san francisco, but... building awesome stuff. figuring out what's im... imagining random shit. laughing at aforementio... i have a big smile. i also get asked if i'm we... books: to kill a mockingbird, lord of the ring... ... san francisco, california doesn&rsquo;t have kids, but might want them straight likes cats atheism m taurus no english (fluently), chinese (okay) single
6 fit strictly anything socially never graduated from college/university life is about the little things. i love to lau... digging up buried treasure frolicking<br />\nwitty banter<br />\nusing my... i am the last unicorn i like books. ones with pictures. reading them... ... san francisco, california None straight likes dogs and likes cats None f virgo None english single
7 average mostly anything socially never graduated from college/university None writing. meeting new people, spending time wit... remembering people's birthdays, sending cards,... i'm rather approachable (a byproduct of being ... i like: alphabetized lists, aquariums, autobio... ... san francisco, california doesn&rsquo;t have kids, but wants them straight likes dogs and likes cats christianity f sagittarius no english, spanish (okay) single
8 None strictly anything socially None graduated from college/university None oh goodness. at the moment i have 4 jobs, so i... None i'm freakishly blonde and have the same name a... i am always willing to try new foods and am no... ... belvedere tiburon, california doesn&rsquo;t have kids straight likes dogs and likes cats christianity but not too serious about it f gemini but it doesn&rsquo;t matter when drinking english single
9 athletic mostly anything not at all never working on two-year college my names jake.<br />\ni'm a creative guy and i... i have an apartment. i like to explore and che... i'm good at finding creative solutions to prob... i'm short i like some tv. i love summer heights high and... ... san mateo, california None straight likes dogs and likes cats atheism and laughing about it m cancer but it doesn&rsquo;t matter no english (fluently) single

10 rows × 30 columns

processed = data.dropna(subset=['education'])

In the code below,

  • tol species the minimum frequency below which a category is considered rare.

  • replace_with species the value to be used to replace rare categories.

  • variables specify the list of categorical variables that will be encoded.

encoder = RareLabelEncoder(tol=0.05, variables=["education"], replace_with="Other")
encoded = encoder.fit_transform(processed)

Now the rare categories in the column education are replaced with “Other”.

encoded['education'].sample(10)
46107                                Other
45677       graduated from masters program
57928    graduated from college/university
53127        working on college/university
33300                                Other
33648       graduated from masters program
59701                                Other
57013       graduated from masters program
46428    graduated from college/university
57123    graduated from college/university
Name: education, dtype: object

Link to feature-engine.

6.2.4. Encode Categorical Data Using Frequency

!pip install feature-engine

Sometimes, count or frequency can be useful features for your model. If you want to replace categories by either the count or the percentage of observations per category, use feature_engine’s CountFrequencyEncoder.

import seaborn as sns
from feature_engine.encoding import CountFrequencyEncoder
from sklearn.model_selection import train_test_split

data = sns.load_dataset("diamonds")

X_train, X_test, y_train, y_test = train_test_split(data, data["price"], random_state=0)
X_train
carat cut color clarity depth table price x y z
441 0.89 Premium H SI2 60.2 59.0 2815 6.26 6.23 3.76
50332 0.70 Very Good D SI1 64.0 53.0 2242 5.57 5.61 3.58
35652 0.31 Ideal G VVS2 62.7 57.0 907 4.33 4.31 2.71
9439 0.90 Very Good H VS1 62.3 59.0 4592 6.12 6.17 3.83
15824 1.01 Good F VS2 60.6 62.0 6332 6.52 6.49 3.94
... ... ... ... ... ... ... ... ... ... ...
45891 0.52 Premium F VS2 60.7 59.0 1720 5.18 5.14 3.13
52416 0.70 Good D SI1 63.6 60.0 2512 5.59 5.51 3.51
42613 0.32 Premium I VS1 61.3 58.0 505 4.35 4.39 2.68
43567 0.41 Ideal G IF 61.0 57.0 1431 4.81 4.79 2.93
2732 0.91 Ideal F SI2 61.1 55.0 3246 6.24 6.19 3.80

40455 rows × 10 columns

In the code below, I encode color and clarity.

# initiate an encoder
encoder = CountFrequencyEncoder(
    encoding_method="frequency", variables=["color", "clarity"]
)

# fit the encoder
encoder.fit(X_train)

# process the data
p_train = encoder.transform(X_train)
p_test = encoder.transform(X_test)
p_test
carat cut color clarity depth table price x y z
10176 1.10 Ideal 0.152762 0.170436 62.0 55.0 4733 6.61 6.65 4.11
16083 1.29 Ideal 0.152762 0.242022 62.6 56.0 6424 6.96 6.93 4.35
13420 1.20 Premium 0.100531 0.242022 61.1 58.0 5510 6.88 6.80 4.18
20407 1.50 Ideal 0.179409 0.242022 60.9 56.0 8770 7.43 7.36 4.50
8909 0.90 Very Good 0.179409 0.227314 61.7 57.0 4493 6.17 6.21 3.82
... ... ... ... ... ... ... ... ... ... ...
52283 0.59 Very Good 0.182005 0.094401 61.7 59.0 2494 5.37 5.36 3.31
10789 1.00 Fair 0.152762 0.227314 64.8 62.0 4861 6.22 6.13 4.00
1190 0.70 Very Good 0.179409 0.094401 63.2 58.0 2932 5.66 5.60 3.56
3583 0.59 Ideal 0.182005 0.067384 60.7 57.0 3422 5.41 5.45 3.29
40845 0.46 Premium 0.182005 0.227314 61.5 60.0 1173 4.95 4.91 3.03

13485 rows × 10 columns

Link to feature-engine.

6.2.5. Return a DataFrame When Using a scikit-learn’s Transformer

!pip install feature_engine 

Applying a scikit-learn’s transformer on your DataFrame will return a NumPy array.

import pandas as pd 
from sklearn.preprocessing import StandardScaler
from feature_engine.wrappers import SklearnTransformerWrapper
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
StandardScaler().fit_transform(df)
array([[-1.22474487, -1.22474487],
       [ 0.        ,  0.        ],
       [ 1.22474487,  1.22474487]])

If you want to return a pandas DataFrame instead, use feature-engine’s SklearnTransformerWrapper along with your scikit-learn’s tranformer.

scaler = SklearnTransformerWrapper(transformer=StandardScaler())
scaler.fit_transform(df)
a b
0 -1.224745 -1.224745
1 0.000000 0.000000
2 1.224745 1.224745

Link to feature-engine.

6.2.6. Similarity Encoding for Dirty Categories Using dirty_cat

!pip install dirty-cat

To capture the similarities among dirty categories when encoding categorical variables, use dirty_cat’s SimilarityEncoder .

To understand how SimilarityEncoder works, let’s start with the employee_salaries dataset.

from dirty_cat.datasets import fetch_employee_salaries
from dirty_cat import SimilarityEncoder

X = fetch_employee_salaries().X
X.head(10)
gender department department_name division assignment_category employee_position_title underfilled_job_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator NaN 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer NaN 09/12/1988 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV NaN 11/19/1989 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II NaN 05/05/2014 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III NaN 03/05/2007 2007
5 M POL Department of Police PSB 6th District Special Assignment Team Fulltime-Regular Police Officer III NaN 07/16/2007 2007
6 F FRS Fire and Rescue Services EMS Billing Fulltime-Regular Accountant/Auditor II NaN 06/27/2016 2016
7 M HHS Department of Health and Human Services Head Start Fulltime-Regular Administrative Specialist II NaN 11/17/2014 2014
8 M FRS Fire and Rescue Services Recruit Training Fulltime-Regular Firefighter/Rescuer III Firefighter/Rescuer I (Recruit) 12/12/2016 2016
9 F POL Department of Police FSB Traffic Division Automated Traffic Enforce... Fulltime-Regular Police Aide NaN 02/05/2007 2007
dirty_column = "employee_position_title"
X_dirty = df[dirty_column].values
X_dirty[:7]
array(['Office Services Coordinator', 'Master Police Officer',
       'Social Worker IV', 'Resident Supervisor II',
       'Planning Specialist III', 'Police Officer III',
       'Accountant/Auditor II'], dtype=object)

We can see that titles such as ‘Master Police Officer’ and ‘Police Officer III’ are similar. We can use SimilaryEncoder to encode these categories while capturing their similarities.

enc = SimilarityEncoder(similarity="ngram")
X_enc = enc.fit_transform(X_dirty[:10].reshape(-1, 1))
X_enc
array([[0.05882353, 0.03125   , 0.02739726, 0.19008264, 1.        ,
        0.01351351, 0.05555556, 0.20535714, 0.08088235, 0.032     ],
       [0.008     , 0.02083333, 0.056     , 1.        , 0.19008264,
        0.02325581, 0.23076923, 0.56      , 0.01574803, 0.02777778],
       [0.03738318, 0.07317073, 0.05405405, 0.02777778, 0.032     ,
        0.0733945 , 0.        , 0.0625    , 0.06542056, 1.        ],
       [0.11206897, 0.07142857, 0.09756098, 0.01574803, 0.08088235,
        0.07142857, 0.03125   , 0.08108108, 1.        , 0.06542056],
       [0.04761905, 0.3539823 , 0.06976744, 0.02325581, 0.01351351,
        1.        , 0.02      , 0.09821429, 0.07142857, 0.0733945 ],
       [0.0733945 , 0.05343511, 0.14953271, 0.56      , 0.20535714,
        0.09821429, 0.26086957, 1.        , 0.08108108, 0.0625    ],
       [1.        , 0.05      , 0.06451613, 0.008     , 0.05882353,
        0.04761905, 0.01052632, 0.0733945 , 0.11206897, 0.03738318],
       [0.05      , 1.        , 0.03378378, 0.02083333, 0.03125   ,
        0.3539823 , 0.02631579, 0.05343511, 0.07142857, 0.07317073],
       [0.06451613, 0.03378378, 1.        , 0.056     , 0.02739726,
        0.06976744, 0.        , 0.14953271, 0.09756098, 0.05405405],
       [0.01052632, 0.02631579, 0.        , 0.23076923, 0.05555556,
        0.02      , 1.        , 0.26086957, 0.03125   , 0.        ]])

Cool! Let’s create a heatmap to understand the correlation between the encoded features.

import seaborn as sns
import numpy as np
from sklearn.preprocessing import normalize
from IPython.core.pylabtools import figsize

def plot_similarity(labels, features):
  
    normalized_features = normalize(features)
    
    # Create correction matrix
    corr = np.inner(normalized_features, normalized_features)
    
    # Plot
    figsize(10, 10)
    sns.set(font_scale=1.2)
    g = sns.heatmap(corr, xticklabels=labels, yticklabels=labels, vmin=0,
        vmax=1, cmap="YlOrRd", annot=True, annot_kws={"size": 10})
        
    g.set_xticklabels(labels, rotation=90)
    g.set_title("Similarity")


def encode_and_plot(labels):
  
    enc = SimilarityEncoder(similarity="ngram") # Encode
    X_enc = enc.fit_transform(labels.reshape(-1, 1))
    
    plot_similarity(labels, X_enc) # Plot
encode_and_plot(X_dirty[:10])
../_images/feature_engineer_57_0.png

As we can see from the matrix above,

  • The similarity between the same strings such as ‘Office Services Coordinator’ and ‘Office Services Coordinator’ is 1

  • The similarity between somewhat similar strings such as ‘Office Services Coordinator’ and ‘Master Police Officer’ is 0.41

  • The similarity between two very different strings such as ‘Social Worker IV’ and ‘Polic Aide’ is 0.028

Link to dirty-cat.

Link to my full article about dirty-cat.

6.2.7. Snorkel — Programmatically Build Training Data in Python

!pip install snorkel

Imagine you try to determine whether a job posting is fake or not. You come up with some assumptions about a fake job posting, such as:

  • If a job posting has few to no descriptions about the requirements, it is likely to be fake.

  • If a job posting does not include any company profile or logo, it is likely to be fake.

  • If the job posting requires some sort of education or experience, it is likely to be real.

import pandas as pd 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


train_df = pd.read_pickle(
    "https://github.com/khuyentran1401/Data-science/blob/master/feature_engineering/snorkel_example/train_fake_jobs.pkl?raw=true"
)
train_df.head(5)
job_id title location department salary_range company_profile description requirements benefits telecommuting has_company_logo has_questions employment_type required_experience required_education industry function fraudulent
12276 12277 Big Data Analyst GB, WSM, London Product Ops NaN Founded in 2010 by a team from Google’s London... Qubit: Cutting Edge Big Data EngineeringQubit ... What you'll need: A background in consulting, ... Plenty of perks:As well as the opportunity to ... 0 1 1 Full-time Associate Bachelor's Degree Internet Product Management 0
14680 14681 Instructional Advocate US, GA, Savannah NaN NaN We are an after-school program committed to as... 21st Century Community Learning Centers is an ... Bachelor's Degree or an Associate's Degree; or... NaN 0 1 0 Part-time NaN NaN NaN Education 0
16518 16519 Software Developer US, FL, Gainesville NaN NaN 352 Inc. is a full-service digital agency crea... We partner with great clients to build smart s... 3-5 years of great c# work Experience in mvc o... What You’ll GetFreedom: We trust you to do you... 0 1 0 Full-time Mid-Senior level NaN Computer Software Information Technology 0
15478 15479 Internship in India IN, , Bangalore NaN NaN London is a fast paced city of culture, divers... As specialists in delivering high quality and ... 0 1 0 NaN NaN NaN NaN NaN 0
16348 16349 Web Developer Backend Microservices (m/f) DE, BE, 10969 Engineering NaN airfy prägt sicheres und einfach zu bedienende... Design and develop a microservice platform for... Senior level experience with web backends, esp... Flat hierarchies and a productive work environ... 0 1 0 Full-time Associate Bachelor's Degree Internet Engineering 0

How do you test which of these features are the most accurate in predicting fraud?

That is when Snorkel comes in handy. Snorkel is an open-source Python library for programmatically building training datasets without manual labeling.

To learn how Snorkel works, start with giving a meaningful name to each value:

from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis

FAKE = 1
REAL = 0
ABSTAIN = -1

We assume that:

  • Fake companies don’t have company profiles or logos

  • Fake companies are found in a lot of fake job postings

  • Real job postings often requires a certain level of experience and education

Let’s test those assumptions using Snorkel’s labeling_function decorator. The labeling_function decorator allows us to quickly label instances in a dataset using functions.

@labeling_function()
def no_company_profile(x: pd.Series):
    return FAKE if x.company_profile == "" else ABSTAIN


@labeling_function()
def no_company_logo(x: pd.Series):
    return FAKE if x.has_company_logo == 0 else ABSTAIN


@labeling_function()
def required_experience(x: pd.Series):
    return REAL if x.required_experience else ABSTAIN


@labeling_function()
def required_education(x: pd.Series):
    return REAL if x.required_education else ABSTAIN

ABSTAIN or -1 tells Snorkel not to make any conclusion about the instance that doesn’t satisfy the condition.

Next, we will use each of these labeling functions to label our training dataset:

lfs = [
    no_company_profile,
    no_company_logo,
    required_experience,
    required_education,
]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train_df)
100%|██████████████████████████████████████████████| 13410/13410 [00:02<00:00, 5849.25it/s]

Now that we have created the labels using each labeling function, we can use LFAnalysis to determine the accuracy of these labels.

LFAnalysis(L=L_train, lfs=lfs).lf_summary(Y=train_df.fraudulent.values)
j Polarity Coverage Overlaps Conflicts Correct Incorrect Emp. Acc.
no_company_profile 0 [1] 0.186204 0.186204 0.186204 459 2038 0.183821
no_company_logo 1 [1] 0.205742 0.205742 0.205742 459 2300 0.166365
required_experience 2 [0] 1.000000 1.000000 0.244295 12741 669 0.950112
required_education 3 [0] 1.000000 1.000000 0.244295 12741 669 0.950112

Details of the statistics in the table above:

  • Polarity: The set of unique labels this LF outputs (excluding abstains)

  • Coverage: The fraction of the dataset that is labeled

  • Overlaps: The fraction of the dataset where this LF and at least one other LF agree

  • Conflicts: The fraction of the dataset where this LF and at least one other LF disagree

  • Correct: The number of data points this LF labels correctly

  • Incorrect: The number of data points this LF labels incorrectly

  • Empirical Accuracy: The empirical accuracy of this LF

Link to Snorkel.

My full article about Snorkel.