4.7. Manipulate a DataFrame Using Data Types

4.7.1. select_dtypes: Return a Subset of a DataFrame Including/Excluding Columns Based on Their dtype

You might want to apply different kinds of processing to categorical and numerical features. Instead of manually choosing categorical features or numerical features, you can automatically get them by using df.select_dtypes('data_type').

In the example below, you can either include or exclude certain data types using exclude.

import pandas as pd 
df = pd.DataFrame({"col1": ["a", "b", "c"], "col2": [1, 2, 3], "col3": [0.1, 0.2, 0.3]})

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    3 non-null      object 
 1   col2    3 non-null      int64  
 2   col3    3 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
df.select_dtypes(include=["int64", "float64"])
col2 col3
0 1 0.1
1 2 0.2
2 3 0.3

4.7.2. Reduce pandas.DataFrame’s Memory

If you want to reduce the memory of your pandas DataFrame, start with changing the data type of a column. If your categorical variable has low cardinality, change the data type to category like below.

from sklearn.datasets import load_iris
import pandas as pd 

X, y = load_iris(as_frame=True, return_X_y=True)
df = pd.concat([X, pd.DataFrame(y, columns=["target"])], axis=1)
df.memory_usage()
Index                 128
sepal length (cm)    1200
sepal width (cm)     1200
petal length (cm)    1200
petal width (cm)     1200
target               1200
dtype: int64
df["target"] = df["target"].astype("category")
df.memory_usage()
Index                 128
sepal length (cm)    1200
sepal width (cm)     1200
petal length (cm)    1200
petal width (cm)     1200
target                282
dtype: int64

The memory is now is reduced to almost a fifth of what it was!

4.7.3. pandas.Categorical: Turn a List of Strings into a Categorical Variable

If you want to create a categorical variable, use pandas.Categorical. This variable takes on a limited number of possible values and can be ordered. In the code below, I use pd.Categorical to create a list of ordered categories.

import pandas as pd 

size = pd.Categorical(['M', 'S', 'M', 'L'], ordered=True, categories=['S', 'M', 'L'])
size
['M', 'S', 'M', 'L']
Categories (3, object): ['S' < 'M' < 'L']

Note that the parameters categories = ['S', 'M', 'L'] and ordered=True tell pandas that 'S' < 'M' < 'L'. This means we can get the smallest value in the list:

size.min()
'S'

Or sort the DataFrame by the column that contains categorical variables:

df = pd.DataFrame({'size': size, 'val': [5, 4, 3, 6]})

df.sort_values(by='size')
size val
1 S 4
0 M 5
2 M 3
3 L 6