6.1. Feature Extraction#

6.1.1. distfit: Find The Best Theoretical Distribution For Your Data#

Hide code cell content
!pip install distfit

If you’re looking to identify the best theoretical distribution for your data in Python, try distfit. It allows you to fit and compare multiple distributions, identifying the best match for your dataset.

import numpy as np
from distfit import distfit

X = np.random.normal(0, 3, 1000)

# Initialize model
dist = distfit()

# Find best theoretical distribution for empirical data X
distribution = dist.fit_transform(X)
dist.plot()
[distfit] >fit..
[distfit] >transform..
[distfit] >[norm      ] [0.00 sec] [RSS: 0.0037316] [loc=-0.018 scale=2.999]
[distfit] >[expon     ] [0.00 sec] [RSS: 0.1588997] [loc=-14.019 scale=14.001]
[distfit] >[dweibull  ] [0.00 sec] [RSS: 0.0079433] [loc=-0.012 scale=2.529]
[distfit] >[t         ] [0.02 sec] [RSS: 0.0036884] [loc=-0.012 scale=2.873]
[distfit] >[genextreme] [0.07 sec] [RSS: 0.0049831] [loc=-1.132 scale=3.037]
[distfit] >[gamma     ] [0.04 sec] [RSS: 0.0038504] [loc=-101.098 scale=0.089]
[distfit] >[lognorm   ] [0.09 sec] [RSS: 0.0037897] [loc=-237.099 scale=237.056]
[distfit] >[uniform   ] [0.00 sec] [RSS: 0.1145382] [loc=-14.019 scale=24.469]
[distfit] >[loggamma  ] [0.04 sec] [RSS: 0.0036960] [loc=-239.858 scale=44.472]
[distfit] >Compute confidence interval [parametric]
[distfit] >plot..
../_images/c1abf81c35c0b4a8cf396d0cc35df33cb4fb12be6d697ecac9c7b18464f89965.png
(<Figure size 1000x800 with 1 Axes>,
 <AxesSubplot:title={'center':'\nt\ndf=24.44, loc=-0.01, scale=2.87'}, xlabel='Values', ylabel='Frequency'>)

Beyond finding the optimal distribution, distfit can also help identify outliers based on deviation from the fitted distribution.

Link to distfit.

6.1.2. Geopy: Extract Location Based on Python String#

Hide code cell content
!pip install geopy
Collecting geopy
  Downloading geopy-2.4.1-py3-none-any.whl.metadata (6.8 kB)
Collecting geographiclib<3,>=1.52 (from geopy)
  Downloading geographiclib-2.0-py3-none-any.whl.metadata (1.4 kB)
Downloading geopy-2.4.1-py3-none-any.whl (125 kB)
Downloading geographiclib-2.0-py3-none-any.whl (40 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-2.0 geopy-2.4.1

Geopy simplifies the process of extracting geospatial information from location strings. With just a few lines of code, you can obtain the coordinates of addresses globally.

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="find_location")
location = geolocator.geocode("30 North Circle Drive")

To get detailed information about the location:

location.address
'30, North Circle Drive, East Longmeadow, Hampden County, Massachusetts, 01028, United States'

You can also extract latitude and longitude:

location.latitude, location.longitude
(35.8796631, -79.0770546)

Link to Geopy

6.1.3. fastai’s cont_cat_split: Separate Continuous and Categorical Variables#

Hide code cell content
!pip install fastai

Fastai’s cont_cat_split method helps you automatically separate continuous and categorical columns in a DataFrame based on their cardinality.

import pandas as pd
from fastai.tabular.core import cont_cat_split

df = pd.DataFrame(
    {
        "col1": [1, 2, 3, 4, 5],
        "col2": ["a", "b", "c", "d", "e"],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
    }
)

cont_names, cat_names = cont_cat_split(df)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)
Continuous columns: ['col3']
Categorical columns: ['col1', 'col2']
cont_names, cat_names = cont_cat_split(df, max_card=3)
print("Continuous columns:", cont_names)
print("Categorical columns:", cat_names)
Continuous columns: ['col1', 'col3']
Categorical columns: ['col2']

Link to the documentation.

6.1.4. Patsy: Build Features with Arbitrary Python Code#

Hide code cell content
!pip install patsy

Patsy lets you quickly create features for your model using an intuitive syntax, ideal for experimentation.

from sklearn.datasets import load_wine
import pandas as pd 
df = load_wine(as_frame=True)
data = pd.concat([df['data'], df['target']], axis=1)
data.head(10)
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0
5 14.20 1.76 2.45 15.2 112.0 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450.0 0
6 14.39 1.87 2.45 14.6 96.0 2.50 2.52 0.30 1.98 5.25 1.02 3.58 1290.0 0
7 14.06 2.15 2.61 17.6 121.0 2.60 2.51 0.31 1.25 5.05 1.06 3.58 1295.0 0
8 14.83 1.64 2.17 14.0 97.0 2.80 2.98 0.29 1.98 5.20 1.08 2.85 1045.0 0
9 13.86 1.35 2.27 16.0 98.0 2.98 3.15 0.22 1.85 7.22 1.01 3.55 1045.0 0
from patsy import dmatrices

y, X = dmatrices('target ~ alcohol + flavanoids + proline', data=data)
X
DesignMatrix with shape (178, 4)
  Intercept  alcohol  flavanoids  proline
          1    14.23        3.06     1065
          1    13.20        2.76     1050
          1    13.16        3.24     1185
          1    14.37        3.49     1480
          1    13.24        2.69      735
          1    14.20        3.39     1450
          1    14.39        2.52     1290
          1    14.06        2.51     1295
          1    14.83        2.98     1045
          1    13.86        3.15     1045
          1    14.10        3.32     1510
          1    14.12        2.43     1280
          1    13.75        2.76     1320
          1    14.75        3.69     1150
          1    14.38        3.64     1547
          1    13.63        2.91     1310
          1    14.30        3.14     1280
          1    13.83        3.40     1130
          1    14.19        3.93     1680
          1    13.64        3.03      845
          1    14.06        3.17      780
          1    12.93        2.41      770
          1    13.71        2.88     1035
          1    12.85        2.37     1015
          1    13.50        2.61      845
          1    13.05        2.68      830
          1    13.39        2.94     1195
          1    13.30        2.19     1285
          1    13.87        2.97      915
          1    14.02        2.33     1035
  [148 rows omitted]
  Terms:
    'Intercept' (column 0)
    'alcohol' (column 1)
    'flavanoids' (column 2)
    'proline' (column 3)
  (to view full data, use np.asarray(this_obj))

These features can be directly used with machine learning models:

from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X, y)

Link to Patsy.

6.1.5. yarl: Create and Extract Elements from a URL Using Python#

Hide code cell content
!pip install yarl

yarl makes URL parsing and creation easy. You can extract elements like host, path, and query from a URL or construct new URLs.

from yarl import URL 

url = URL('https://github.com/search?q=data+science')
url 
URL('https://github.com/search?q=data+science')
print(url.host) 
github.com
print(url.path) 
/search
print(url.query_string) 
q=data science

You can also build new URLs:

# Create a URL

url = URL.build(
    scheme="https",
    host="github.com",
    path="/search",
    query={"p": 2, "q": "data science"},
)

print(url)
https://github.com/search?p=2&q=data+science
# Replace the query

print(url.with_query({"q": "python"}))
https://github.com/search?q=python
# Replace the path

new_path = url.with_path("khuyentran1401/Data-science")
print(new_path)
https://github.com/khuyentran1401/Data-science
# Update the fragment

print(new_path.with_fragment("contents"))
https://github.com/khuyentran1401/Data-science#contents

Link to yarl.

6.1.6. Pigeon: Quickly Annotate Your Data on Jupyter Notebook#

Hide code cell content
!pip install pigeon-jupyter

For fast data annotation within Jupyter Notebooks, use Pigeon. This tool allows you to label data interactively by selecting from predefined options.

from pigeon import annotate


annotations = annotate(
    ["The service is terrible", "I will definitely come here again"],
    options=["positive", "negative"],
)
annotations
[('The service is terrible', 'negative'),
 ('I will definitely come here again', 'positive')]

After labeling all your data, you can get the examples along with their labels by calling annotations.

Link to Pigeon

6.1.7. probablepeople: Parse Unstructured Names Into Structured Components#

Hide code cell content
!pip install probablepeople  

probablepeople helps you parse unstructured names into structured components like first names, surnames, and company names.

import probablepeople as pp

pp.parse("Mr. Owen Harris II")
[('Mr.', 'PrefixMarital'),
 ('Owen', 'GivenName'),
 ('Harris', 'Surname'),
 ('II', 'SuffixGenerational')]
pp.parse("Kate & John Cumings")
[('Kate', 'GivenName'),
 ('&', 'And'),
 ('John', 'GivenName'),
 ('Cumings', 'Surname')]
pp.parse("Prefect Technologies, Inc")
[('Prefect', 'CorporationName'),
 ('Technologies,', 'CorporationName'),
 ('Inc', 'CorporationLegalType')]

Link to probablepeople.

6.1.8. Supercharge PDF Text Extraction in Python with pypdf#

Hide code cell content
!pip install -U pypdf

PDF text is designed for beautiful on-screen display rather than optimized structured data extraction, making text extraction from PDFs challenging.

Besides simple text extraction, pypdf also knows about fonts, encodings, and typical character distance, which enhances the accuracy of text extraction from PDFs.

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
text = page.extract_text()

Link to pypdf.