Feature Extraction¶
distfit: Find The Best Theoretical Distribution For Your Data in Python¶
!pip install distfit
If you want to find the best theoretical distribution for your data in Python, try distfit
.
import numpy as np
from distfit import distfit
X = np.random.normal(0, 3, 1000)
# Initialize model
dist = distfit()
# Find best theoretical distribution for empirical data X
distribution = dist.fit_transform(X)
dist.plot()
[distfit] >fit..
[distfit] >transform..
[distfit] >[norm ] [0.00 sec] [RSS: 0.0037316] [loc=-0.018 scale=2.999]
[distfit] >[expon ] [0.00 sec] [RSS: 0.1588997] [loc=-14.019 scale=14.001]
[distfit] >[dweibull ] [0.00 sec] [RSS: 0.0079433] [loc=-0.012 scale=2.529]
[distfit] >[t ] [0.02 sec] [RSS: 0.0036884] [loc=-0.012 scale=2.873]
[distfit] >[genextreme] [0.07 sec] [RSS: 0.0049831] [loc=-1.132 scale=3.037]
[distfit] >[gamma ] [0.04 sec] [RSS: 0.0038504] [loc=-101.098 scale=0.089]
[distfit] >[lognorm ] [0.09 sec] [RSS: 0.0037897] [loc=-237.099 scale=237.056]
[distfit] >[uniform ] [0.00 sec] [RSS: 0.1145382] [loc=-14.019 scale=24.469]
[distfit] >[loggamma ] [0.04 sec] [RSS: 0.0036960] [loc=-239.858 scale=44.472]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot..
(<Figure size 1000x800 with 1 Axes>,
<AxesSubplot:title={'center':'\nt\ndf=24.44, loc=-0.01, scale=2.87'}, xlabel='Values', ylabel='Frequency'>)
Besides finding the best theoretical distribution, distfit is also useful in detecting outliers. New data points that deviate significantly can then be marked as outliers.
datefinder: Automatically Find Dates and Time in a Python String¶
!pip install datefinder
If you want to automatically find date and time with different formats in a Python string, try datefinder.
from datefinder import find_dates
text = """"We have one meeting on May 17th,
2021 at 9:00am and another meeting on 5/18/2021
at 10:00. I hope you can attend one of the
meetings."""
matches = find_dates(text)
for match in matches:
print("Date and time:", match)
print("Only day:", match.day)
Date and time: 2021-05-17 09:00:00
Only day: 17
Date and time: 2021-05-18 10:00:00
Only day: 18
pytrends: Get the Trend of a Keyword on Google Search Over Time¶
!pip install pytrends
If you want to get the trend of a keyword on Google Search over time, try pytrends.
In the code below, I use pytrends to get the interest of the keyword “data science” on Google Search from 2016 to 2021.
from pytrends.request import TrendReq
pytrends = TrendReq(hl="en-US", tz=360)
pytrends.build_payload(kw_list=["data science"])
df = pytrends.interest_over_time()
df["data science"].plot(figsize=(20, 7))
<AxesSubplot:xlabel='date'>
Fastai’s add_datepart: Add Relevant DateTime Features in One Line of Code¶
!pip install fastai
When working with time series, other features such as year, month, week, day of the week, day of the year, whether it is the end of the year or not, can be really helpful to predict future events. Is there a way that you can get all of those features in one line of code?
Fastai’s add_datepart method allows you to do exactly that.
import pandas as pd
from fastai.tabular.core import add_datepart
from datetime import datetime
df = pd.DataFrame(
{
"date": [
datetime(2020, 2, 5),
datetime(2020, 2, 6),
datetime(2020, 2, 7),
datetime(2020, 2, 8),
],
"val": [1, 2, 3, 4],
}
)
df
date | val | |
---|---|---|
0 | 2020-02-05 | 1 |
1 | 2020-02-06 | 2 |
2 | 2020-02-07 | 3 |
3 | 2020-02-08 | 4 |
df = add_datepart(df, "date")
df.columns
Index(['val', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start',
'Is_year_end', 'Is_year_start', 'Elapsed'],
dtype='object')
Geopy: Extract Location Based on Python String¶
!pip install geopy
If you work with location data, you might want to visualize them on the map. Geopy makes it easy to locate the coordinates of addresses across the globe based on a Python string.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="find_location")
location = geolocator.geocode("30 North Circle Drive, Edwardsville, IL")
After defining the app name and insert location, all you need to exact information about the location is to use location.address
.
location.address
'30, Circle Drive, Edwardsville, Madison County, Illinois, 62025, United States'
To extract the latitude and longitude or the use location.latitide
, location.longitude
.
location.latitude, location.longitude
(38.80371599362934, -89.93842706888563)
Maya: Convert the string to datetime automatically¶
!pip install maya
If you want to convert a string type to a datetime type, the common way is to use strptime(date_string, format). But it is quite inconvenient to to specify the structure of your datetime string, such as ‘ %Y-%m-%d %H:%M:%S’.
There is a tool that helps you convert the string to datetime automatically called maya. You just need to parse the string and maya will figure out the structure of your string.
import maya
# Automatically parse datetime string
string = "2016-12-16 18:23:45.423992+00:00"
maya.parse(string).datetime()
datetime.datetime(2016, 12, 16, 18, 23, 45, 423992, tzinfo=<UTC>)
Better yet, if you want to convert the string to a different time zone (for example, CST), you can parse that into maya’s datetime function.
maya.parse(string).datetime(to_timezone="US/Central")
datetime.datetime(2016, 12, 16, 12, 23, 45, 423992, tzinfo=<DstTzInfo 'US/Central' CST-1 day, 18:00:00 STD>)
Check out the doc for more ways of manipulating your date string faster here.
Extract holiday from date column¶
!pip install holidays
You have a date column and you think the holidays might affect the target of your data. Is there an easy way to extract the holidays from the date? Yes, that is when holidays package comes in handy.
Holidays package provides a dictionary of holidays for different countries. The code below is to confirm whether 2020-07-04 is a US holiday and extract the name of the holiday.
from datetime import date
import holidays
us_holidays = holidays.UnitedStates()
"2014-07-04" in us_holidays
True
The great thing about this package is that you can write the date in whatever way you want and the package is still able to detect which date you are talking about.
us_holidays.get("2014-7-4")
'Independence Day'
us_holidays.get("2014/7/4")
'Independence Day'
You can also add more holidays if you think that the library is lacking some holidays. Try this out if you are looking for something similar.
fastai’s cont_cat_split: Get a DataFrame’s Continuous and Categorical Variables Based on Their Cardinality¶
!pip install fastai
To get a DataFrame’s continuous and categorical variables based on their cardinality, use fastai’s cont_cat_split
method.
If a column consists of integers, but its cardinality is smaller than the max_card parameter, it is considered as a category variable.
import pandas as pd
from fastai.tabular.core import cont_cat_split
df = pd.DataFrame(
{
"col1": [1, 2, 3, 4, 5],
"col2": ["a", "b", "c", "d", "e"],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0],
}
)
cont_names, cat_names = cont_cat_split(df)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)
Continuous columns: ['col3']
Categorical columns: ['col1', 'col2']
cont_names, cat_names = cont_cat_split(df, max_card=3)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)
Continuous columns: ['col1', 'col3']
Categorical columns: ['col2']