4.8. Manipulate a DataFrame Using Data Types#

4.8.1. select_dtypes: Return a Subset of a DataFrame Including/Excluding Columns Based on Their dtype#

You might want to apply different kinds of processing to categorical and numerical features. Instead of manually choosing categorical features or numerical features, you can automatically get them by using df.select_dtypes('data_type').

In the example below, you can either include or exclude certain data types using exclude.

import pandas as pd 
df = pd.DataFrame({"col1": ["a", "b", "c"], "col2": [1, 2, 3], "col3": [0.1, 0.2, 0.3]})

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    3 non-null      object 
 1   col2    3 non-null      int64  
 2   col3    3 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
df.select_dtypes(include=["int64", "float64"])
col2 col3
0 1 0.1
1 2 0.2
2 3 0.3

4.8.2. Reduce pandas.DataFrame’s Memory#

If you want to reduce the memory of your pandas DataFrame, start with changing the data type of a column. If your categorical variable has low cardinality, change the data type to category like below.

from sklearn.datasets import load_iris
import pandas as pd 

X, y = load_iris(as_frame=True, return_X_y=True)
df = pd.concat([X, pd.DataFrame(y, columns=["target"])], axis=1)
df.memory_usage()
Index                 128
sepal length (cm)    1200
sepal width (cm)     1200
petal length (cm)    1200
petal width (cm)     1200
target               1200
dtype: int64
df["target"] = df["target"].astype("category")
df.memory_usage()
Index                 128
sepal length (cm)    1200
sepal width (cm)     1200
petal length (cm)    1200
petal width (cm)     1200
target                282
dtype: int64

The memory is now is reduced to almost a fifth of what it was!

4.8.3. pandas.Categorical: Turn a List of Strings into a Categorical Variable#

If you want to create a categorical variable, use pandas.Categorical. This variable takes on a limited number of possible values and can be ordered. In the code below, I use pd.Categorical to create a list of ordered categories.

import pandas as pd 

size = pd.Categorical(['M', 'S', 'M', 'L'], ordered=True, categories=['S', 'M', 'L'])
size
['M', 'S', 'M', 'L']
Categories (3, object): ['S' < 'M' < 'L']

Note that the parameters categories = ['S', 'M', 'L'] and ordered=True tell pandas that 'S' < 'M' < 'L'. This means we can get the smallest value in the list:

size.min()
'S'

Or sort the DataFrame by the column that contains categorical variables:

df = pd.DataFrame({'size': size, 'val': [5, 4, 3, 6]})

df.sort_values(by='size')
size val
1 S 4
0 M 5
2 M 3
3 L 6

4.8.4. Optimizing Memory Usage in a pandas DataFrame with infer_objects#

pandas DataFrames that contain columns of mixed data types are stored in a more general format (such as “object”), resulting in inefficient memory usage and slower computation times.

df.infer_objects() can infer the true data types of columns in a DataFrame, which can help optimize memory usage in your code.

In the following code, the column “col1” still has an “object” data type even though it contains integer values after removing the first row.

By using the df.infer_objects() method, “col1” is converted to an “int64” data type which saves approximately 27 MB of memory.

import pandas as pd
from random import randint 

random_numbers = [randint(0, 100) for _ in range(1000000)]
df = pd.DataFrame({"col1": ['a', *random_numbers]})

# Remove the first row
df = df.iloc[1:]

print(df.dtypes)
print(df.memory_usage(deep=True))
col1    object
dtype: object
Index         132
col1     35960884
dtype: int64
inferred_df = df.infer_objects()
print(inferred_df.dtypes)
print(inferred_df.memory_usage(deep=True))
col1    int64
dtype: object
Index        132
col1     8000000
dtype: int64

4.8.5. Say Goodbye to Data Type Conversion in pandas 2.0#

!pip install pandas==2.0.0

Previously in pandas, if a Series had missing values, its data type would be converted to float, resulting in a potential loss of precision for the original data.

import pandas as pd

s1 = pd.Series([0, 1, 2, 3])
print(f"Data type without None: {s1.dtypes}")

s1.iloc[0] = None
print(f"Data type with None: {s1.dtypes}")
Data type without None: int64
Data type with None: float64

With the integration of Apache Arrow in pandas 2.0, this issue is solved.

s2 = pd.Series([0, 1, 2, 3], dtype='int64[pyarrow]')
print(f"Data type without None: {s2.dtypes}")

s2.iloc[0] = None
print(f"Data type with None: {s2.dtypes}")
Data type without None: int64[pyarrow]
Data type with None: int64[pyarrow]

4.8.6. Efficient String Data Handling in pandas 2.0 with PyArrow Arrays#

Hide code cell content
!pip install 'pandas==2.2' pyarrow

As of pandas 2.0, data in pandas can be stored in PyArrow arrays in addition to NumPy arrays. PyArrow arrays provide a wide range of data types compared to NumPy.

One significant advantage of PyArrow arrays is their string datatype, which offers superior speed and memory efficiency than storing strings using object dtypes.

import pandas as pd
import numpy as np

data_size = 1_000_000
np.random.seed(42)
data = np.random.choice(["John", "Alice", "Michael"], size=data_size)
s_numpy = pd.Series(data)
s_pyarrow = pd.Series(data, dtype="string[pyarrow]")
print(f"Datatype of Series with Numpy backend: {s_numpy.dtype}")
print(f"Datatype of Series with PyArrow backend: {s_pyarrow.dtype}")
Datatype of Series with Numpy backend: object
Datatype of Series with PyArrow backend: string
numpy_memory = s_numpy.memory_usage(deep=True)
pyarrow_memory = s_pyarrow.memory_usage(deep=True)

print(f"Memory usage for Numpy backend: {numpy_memory / (1024 ** 2):.2f} MB.")
print(f"Memory usage for PyArrow backend: {pyarrow_memory / (1024 ** 2):.2f} MB.")
print(f"PyArrow backend consumes approximately {numpy_memory / pyarrow_memory:.2f} times less memory than Numpy backend.")
Memory usage for Numpy backend: 59.45 MB.
Memory usage for PyArrow backend: 12.72 MB.
PyArrow backend consumes approximately 4.68 times less memory than Numpy backend.