Feature Extraction¶

distfit: Find The Best Theoretical Distribution For Your Data in Python¶

!pip install distfit

If you want to find the best theoretical distribution for your data in Python, try distfit.

import numpy as np
from distfit import distfit

X = np.random.normal(0, 3, 1000)

# Initialize model
dist = distfit()

# Find best theoretical distribution for empirical data X
distribution = dist.fit_transform(X)
dist.plot()

[distfit] >fit..
[distfit] >transform..
[distfit] >[norm      ] [0.00 sec] [RSS: 0.0037316] [loc=-0.018 scale=2.999]
[distfit] >[expon     ] [0.00 sec] [RSS: 0.1588997] [loc=-14.019 scale=14.001]
[distfit] >[dweibull  ] [0.00 sec] [RSS: 0.0079433] [loc=-0.012 scale=2.529]
[distfit] >[t         ] [0.02 sec] [RSS: 0.0036884] [loc=-0.012 scale=2.873]
[distfit] >[genextreme] [0.07 sec] [RSS: 0.0049831] [loc=-1.132 scale=3.037]
[distfit] >[gamma     ] [0.04 sec] [RSS: 0.0038504] [loc=-101.098 scale=0.089]
[distfit] >[lognorm   ] [0.09 sec] [RSS: 0.0037897] [loc=-237.099 scale=237.056]
[distfit] >[uniform   ] [0.00 sec] [RSS: 0.1145382] [loc=-14.019 scale=24.469]
[distfit] >[loggamma  ] [0.04 sec] [RSS: 0.0036960] [loc=-239.858 scale=44.472]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot..

(<Figure size 1000x800 with 1 Axes>,
 <AxesSubplot:title={'center':'\nt\ndf=24.44, loc=-0.01, scale=2.87'}, xlabel='Values', ylabel='Frequency'>)

Besides finding the best theoretical distribution, distfit is also useful in detecting outliers. New data points that deviate significantly can then be marked as outliers.

Link to distfit.

datefinder: Automatically Find Dates and Time in a Python String¶

!pip install datefinder

If you want to automatically find date and time with different formats in a Python string, try datefinder.

from datefinder import find_dates

text = """"We have one meeting on May 17th,
2021 at 9:00am and another meeting on 5/18/2021
at 10:00. I hope you can attend one of the
meetings."""

matches = find_dates(text)

for match in matches:
    print("Date and time:", match)
    print("Only day:", match.day)

Date and time: 2021-05-17 09:00:00
Only day: 17
Date and time: 2021-05-18 10:00:00
Only day: 18

Link to datefinder.

pytrends: Get the Trend of a Keyword on Google Search Over Time¶

!pip install pytrends

If you want to get the trend of a keyword on Google Search over time, try pytrends.

In the code below, I use pytrends to get the interest of the keyword “data science” on Google Search from 2016 to 2021.

from pytrends.request import TrendReq

pytrends = TrendReq(hl="en-US", tz=360)
pytrends.build_payload(kw_list=["data science"])

df = pytrends.interest_over_time()
df["data science"].plot(figsize=(20, 7))

<AxesSubplot:xlabel='date'>

Link to pytrends

Fastai’s add_datepart: Add Relevant DateTime Features in One Line of Code¶

!pip install fastai

When working with time series, other features such as year, month, week, day of the week, day of the year, whether it is the end of the year or not, can be really helpful to predict future events. Is there a way that you can get all of those features in one line of code?

Fastai’s add_datepart method allows you to do exactly that.

import pandas as pd
from fastai.tabular.core import add_datepart
from datetime import datetime

df = pd.DataFrame(
    {
        "date": [
            datetime(2020, 2, 5),
            datetime(2020, 2, 6),
            datetime(2020, 2, 7),
            datetime(2020, 2, 8),
        ],
        "val": [1, 2, 3, 4],
    }
)

df

	date	val
0	2020-02-05	1
1	2020-02-06	2
2	2020-02-07	3
3	2020-02-08	4

df = add_datepart(df, "date")
df.columns

Index(['val', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
       'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start',
       'Is_year_end', 'Is_year_start', 'Elapsed'],
      dtype='object')

Link to Fastai’s methods to work with tabular data

Geopy: Extract Location Based on Python String¶

!pip install geopy

If you work with location data, you might want to visualize them on the map. Geopy makes it easy to locate the coordinates of addresses across the globe based on a Python string.

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="find_location")
location = geolocator.geocode("30 North Circle Drive, Edwardsville, IL")

After defining the app name and insert location, all you need to exact information about the location is to use location.address.

location.address

'30, Circle Drive, Edwardsville, Madison County, Illinois, 62025, United States'

To extract the latitude and longitude or the use location.latitide, location.longitude.

location.latitude, location.longitude

(38.80371599362934, -89.93842706888563)

Link to Geopy

Maya: Convert the string to datetime automatically¶

!pip install maya

If you want to convert a string type to a datetime type, the common way is to use strptime(date_string, format). But it is quite inconvenient to to specify the structure of your datetime string, such as ‘ %Y-%m-%d %H:%M:%S’.

There is a tool that helps you convert the string to datetime automatically called maya. You just need to parse the string and maya will figure out the structure of your string.

import maya

# Automatically parse datetime string
string = "2016-12-16 18:23:45.423992+00:00"
maya.parse(string).datetime()

datetime.datetime(2016, 12, 16, 18, 23, 45, 423992, tzinfo=<UTC>)

Better yet, if you want to convert the string to a different time zone (for example, CST), you can parse that into maya’s datetime function.

maya.parse(string).datetime(to_timezone="US/Central")

datetime.datetime(2016, 12, 16, 12, 23, 45, 423992, tzinfo=<DstTzInfo 'US/Central' CST-1 day, 18:00:00 STD>)

Check out the doc for more ways of manipulating your date string faster here.

Extract holiday from date column¶

!pip install holidays

You have a date column and you think the holidays might affect the target of your data. Is there an easy way to extract the holidays from the date? Yes, that is when holidays package comes in handy.

Holidays package provides a dictionary of holidays for different countries. The code below is to confirm whether 2020-07-04 is a US holiday and extract the name of the holiday.

from datetime import date
import holidays

us_holidays = holidays.UnitedStates()

"2014-07-04" in us_holidays

True

The great thing about this package is that you can write the date in whatever way you want and the package is still able to detect which date you are talking about.

us_holidays.get("2014-7-4")

'Independence Day'

us_holidays.get("2014/7/4")

'Independence Day'

You can also add more holidays if you think that the library is lacking some holidays. Try this out if you are looking for something similar.

fastai’s cont_cat_split: Get a DataFrame’s Continuous and Categorical Variables Based on Their Cardinality¶

!pip install fastai

To get a DataFrame’s continuous and categorical variables based on their cardinality, use fastai’s cont_cat_split method.

If a column consists of integers, but its cardinality is smaller than the max_card parameter, it is considered as a category variable.

import pandas as pd
from fastai.tabular.core import cont_cat_split

df = pd.DataFrame(
    {
        "col1": [1, 2, 3, 4, 5],
        "col2": ["a", "b", "c", "d", "e"],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
    }
)

cont_names, cat_names = cont_cat_split(df)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)

Continuous columns: ['col3']
Categorical columns: ['col1', 'col2']

cont_names, cat_names = cont_cat_split(df, max_card=3)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)

Continuous columns: ['col1', 'col3']
Categorical columns: ['col2']

Link to the documentation.

Effective Python for Data Scientists