6.1. Feature Extraction#
6.1.1. distfit: Find The Best Theoretical Distribution For Your Data#
Show code cell content
!pip install distfit
If you’re looking to identify the best theoretical distribution for your data in Python, try distfit. It allows you to fit and compare multiple distributions, identifying the best match for your dataset.
import numpy as np
from distfit import distfit
X = np.random.normal(0, 3, 1000)
# Initialize model
dist = distfit()
# Find best theoretical distribution for empirical data X
distribution = dist.fit_transform(X)
dist.plot()
[distfit] >fit..
[distfit] >transform..
[distfit] >[norm ] [0.00 sec] [RSS: 0.0037316] [loc=-0.018 scale=2.999]
[distfit] >[expon ] [0.00 sec] [RSS: 0.1588997] [loc=-14.019 scale=14.001]
[distfit] >[dweibull ] [0.00 sec] [RSS: 0.0079433] [loc=-0.012 scale=2.529]
[distfit] >[t ] [0.02 sec] [RSS: 0.0036884] [loc=-0.012 scale=2.873]
[distfit] >[genextreme] [0.07 sec] [RSS: 0.0049831] [loc=-1.132 scale=3.037]
[distfit] >[gamma ] [0.04 sec] [RSS: 0.0038504] [loc=-101.098 scale=0.089]
[distfit] >[lognorm ] [0.09 sec] [RSS: 0.0037897] [loc=-237.099 scale=237.056]
[distfit] >[uniform ] [0.00 sec] [RSS: 0.1145382] [loc=-14.019 scale=24.469]
[distfit] >[loggamma ] [0.04 sec] [RSS: 0.0036960] [loc=-239.858 scale=44.472]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot..
(<Figure size 1000x800 with 1 Axes>,
<AxesSubplot:title={'center':'\nt\ndf=24.44, loc=-0.01, scale=2.87'}, xlabel='Values', ylabel='Frequency'>)
Beyond finding the optimal distribution, distfit
can also help identify outliers based on deviation from the fitted distribution.
6.1.2. Geopy: Extract Location Based on Python String#
Show code cell content
!pip install geopy
Collecting geopy
Downloading geopy-2.4.1-py3-none-any.whl.metadata (6.8 kB)
Collecting geographiclib<3,>=1.52 (from geopy)
Downloading geographiclib-2.0-py3-none-any.whl.metadata (1.4 kB)
Downloading geopy-2.4.1-py3-none-any.whl (125 kB)
Downloading geographiclib-2.0-py3-none-any.whl (40 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-2.0 geopy-2.4.1
Geopy
simplifies the process of extracting geospatial information from location strings. With just a few lines of code, you can obtain the coordinates of addresses globally.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="find_location")
location = geolocator.geocode("30 North Circle Drive")
To get detailed information about the location:
location.address
'30, North Circle Drive, East Longmeadow, Hampden County, Massachusetts, 01028, United States'
You can also extract latitude and longitude:
location.latitude, location.longitude
(35.8796631, -79.0770546)
6.1.3. fastai’s cont_cat_split: Separate Continuous and Categorical Variables#
Show code cell content
!pip install fastai
Fastai’s cont_cat_split
method helps you automatically separate continuous and categorical columns in a DataFrame based on their cardinality.
import pandas as pd
from fastai.tabular.core import cont_cat_split
df = pd.DataFrame(
{
"col1": [1, 2, 3, 4, 5],
"col2": ["a", "b", "c", "d", "e"],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0],
}
)
cont_names, cat_names = cont_cat_split(df)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)
Continuous columns: ['col3']
Categorical columns: ['col1', 'col2']
cont_names, cat_names = cont_cat_split(df, max_card=3)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)
Continuous columns: ['col1', 'col3']
Categorical columns: ['col2']
6.1.4. Patsy: Build Features with Arbitrary Python Code#
Show code cell content
!pip install patsy
Patsy lets you quickly create features for your model using an intuitive syntax, ideal for experimentation.
from sklearn.datasets import load_wine
import pandas as pd
df = load_wine(as_frame=True)
data = pd.concat([df['data'], df['target']], axis=1)
data.head(10)
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 | 0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 | 0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 | 0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 | 0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 | 0 |
5 | 14.20 | 1.76 | 2.45 | 15.2 | 112.0 | 3.27 | 3.39 | 0.34 | 1.97 | 6.75 | 1.05 | 2.85 | 1450.0 | 0 |
6 | 14.39 | 1.87 | 2.45 | 14.6 | 96.0 | 2.50 | 2.52 | 0.30 | 1.98 | 5.25 | 1.02 | 3.58 | 1290.0 | 0 |
7 | 14.06 | 2.15 | 2.61 | 17.6 | 121.0 | 2.60 | 2.51 | 0.31 | 1.25 | 5.05 | 1.06 | 3.58 | 1295.0 | 0 |
8 | 14.83 | 1.64 | 2.17 | 14.0 | 97.0 | 2.80 | 2.98 | 0.29 | 1.98 | 5.20 | 1.08 | 2.85 | 1045.0 | 0 |
9 | 13.86 | 1.35 | 2.27 | 16.0 | 98.0 | 2.98 | 3.15 | 0.22 | 1.85 | 7.22 | 1.01 | 3.55 | 1045.0 | 0 |
from patsy import dmatrices
y, X = dmatrices('target ~ alcohol + flavanoids + proline', data=data)
X
DesignMatrix with shape (178, 4)
Intercept alcohol flavanoids proline
1 14.23 3.06 1065
1 13.20 2.76 1050
1 13.16 3.24 1185
1 14.37 3.49 1480
1 13.24 2.69 735
1 14.20 3.39 1450
1 14.39 2.52 1290
1 14.06 2.51 1295
1 14.83 2.98 1045
1 13.86 3.15 1045
1 14.10 3.32 1510
1 14.12 2.43 1280
1 13.75 2.76 1320
1 14.75 3.69 1150
1 14.38 3.64 1547
1 13.63 2.91 1310
1 14.30 3.14 1280
1 13.83 3.40 1130
1 14.19 3.93 1680
1 13.64 3.03 845
1 14.06 3.17 780
1 12.93 2.41 770
1 13.71 2.88 1035
1 12.85 2.37 1015
1 13.50 2.61 845
1 13.05 2.68 830
1 13.39 2.94 1195
1 13.30 2.19 1285
1 13.87 2.97 915
1 14.02 2.33 1035
[148 rows omitted]
Terms:
'Intercept' (column 0)
'alcohol' (column 1)
'flavanoids' (column 2)
'proline' (column 3)
(to view full data, use np.asarray(this_obj))
These features can be directly used with machine learning models:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)
6.1.5. yarl: Create and Extract Elements from a URL Using Python#
Show code cell content
!pip install yarl
yarl
makes URL parsing and creation easy. You can extract elements like host, path, and query from a URL or construct new URLs.
from yarl import URL
url = URL('https://github.com/search?q=data+science')
url
URL('https://github.com/search?q=data+science')
print(url.host)
github.com
print(url.path)
/search
print(url.query_string)
q=data science
You can also build new URLs:
# Create a URL
url = URL.build(
scheme="https",
host="github.com",
path="/search",
query={"p": 2, "q": "data science"},
)
print(url)
https://github.com/search?p=2&q=data+science
# Replace the query
print(url.with_query({"q": "python"}))
https://github.com/search?q=python
# Replace the path
new_path = url.with_path("khuyentran1401/Data-science")
print(new_path)
https://github.com/khuyentran1401/Data-science
# Update the fragment
print(new_path.with_fragment("contents"))
https://github.com/khuyentran1401/Data-science#contents
6.1.6. Pigeon: Quickly Annotate Your Data on Jupyter Notebook#
Show code cell content
!pip install pigeon-jupyter
For fast data annotation within Jupyter Notebooks, use Pigeon
. This tool allows you to label data interactively by selecting from predefined options.
from pigeon import annotate
annotations = annotate(
["The service is terrible", "I will definitely come here again"],
options=["positive", "negative"],
)
annotations
[('The service is terrible', 'negative'),
('I will definitely come here again', 'positive')]
After labeling all your data, you can get the examples along with their labels by calling annotations
.
6.1.7. probablepeople: Parse Unstructured Names Into Structured Components#
Show code cell content
!pip install probablepeople
probablepeople
helps you parse unstructured names into structured components like first names, surnames, and company names.
import probablepeople as pp
pp.parse("Mr. Owen Harris II")
[('Mr.', 'PrefixMarital'),
('Owen', 'GivenName'),
('Harris', 'Surname'),
('II', 'SuffixGenerational')]
pp.parse("Kate & John Cumings")
[('Kate', 'GivenName'),
('&', 'And'),
('John', 'GivenName'),
('Cumings', 'Surname')]
pp.parse("Prefect Technologies, Inc")
[('Prefect', 'CorporationName'),
('Technologies,', 'CorporationName'),
('Inc', 'CorporationLegalType')]
6.1.8. Supercharge PDF Text Extraction in Python with pypdf#
Show code cell content
!pip install -U pypdf
PDF text is designed for beautiful on-screen display rather than optimized structured data extraction, making text extraction from PDFs challenging.
Besides simple text extraction, pypdf also knows about fonts, encodings, and typical character distance, which enhances the accuracy of text extraction from PDFs.
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()