4.8. Manipulate a DataFrame Using Data Types#
4.8.1. select_dtypes: Return a Subset of a DataFrame Including/Excluding Columns Based on Their dtype#
You might want to apply different kinds of processing to categorical and numerical features. Instead of manually choosing categorical features or numerical features, you can automatically get them by using df.select_dtypes('data_type')
.
In the example below, you can either include or exclude certain data types using exclude
.
import pandas as pd
df = pd.DataFrame({"col1": ["a", "b", "c"], "col2": [1, 2, 3], "col3": [0.1, 0.2, 0.3]})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 3 non-null object
1 col2 3 non-null int64
2 col3 3 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
df.select_dtypes(include=["int64", "float64"])
col2 | col3 | |
---|---|---|
0 | 1 | 0.1 |
1 | 2 | 0.2 |
2 | 3 | 0.3 |
4.8.2. Reduce pandas.DataFrame’s Memory#
If you want to reduce the memory of your pandas DataFrame, start with changing the data type of a column. If your categorical variable has low cardinality, change the data type to category like below.
from sklearn.datasets import load_iris
import pandas as pd
X, y = load_iris(as_frame=True, return_X_y=True)
df = pd.concat([X, pd.DataFrame(y, columns=["target"])], axis=1)
df.memory_usage()
Index 128
sepal length (cm) 1200
sepal width (cm) 1200
petal length (cm) 1200
petal width (cm) 1200
target 1200
dtype: int64
df["target"] = df["target"].astype("category")
df.memory_usage()
Index 128
sepal length (cm) 1200
sepal width (cm) 1200
petal length (cm) 1200
petal width (cm) 1200
target 282
dtype: int64
The memory is now is reduced to almost a fifth of what it was!
4.8.3. pandas.Categorical: Turn a List of Strings into a Categorical Variable#
If you want to create a categorical variable, use pandas.Categorical
. This variable takes on a limited number of possible values and can be ordered. In the code below, I use pd.Categorical
to create a list of ordered categories.
import pandas as pd
size = pd.Categorical(['M', 'S', 'M', 'L'], ordered=True, categories=['S', 'M', 'L'])
size
['M', 'S', 'M', 'L']
Categories (3, object): ['S' < 'M' < 'L']
Note that the parameters categories = ['S', 'M', 'L']
and ordered=True
tell pandas that 'S' < 'M' < 'L'
. This means we can get the smallest value in the list:
size.min()
'S'
Or sort the DataFrame by the column that contains categorical variables:
df = pd.DataFrame({'size': size, 'val': [5, 4, 3, 6]})
df.sort_values(by='size')
size | val | |
---|---|---|
1 | S | 4 |
0 | M | 5 |
2 | M | 3 |
3 | L | 6 |
4.8.4. Optimizing Memory Usage in a pandas DataFrame with infer_objects#
pandas DataFrames that contain columns of mixed data types are stored in a more general format (such as “object”), resulting in inefficient memory usage and slower computation times.
df.infer_objects()
can infer the true data types of columns in a DataFrame, which can help optimize memory usage in your code.
In the following code, the column “col1” still has an “object” data type even though it contains integer values after removing the first row.
By using the df.infer_objects()
method, “col1” is converted to an “int64” data type which saves approximately 27 MB of memory.
import pandas as pd
from random import randint
random_numbers = [randint(0, 100) for _ in range(1000000)]
df = pd.DataFrame({"col1": ['a', *random_numbers]})
# Remove the first row
df = df.iloc[1:]
print(df.dtypes)
print(df.memory_usage(deep=True))
col1 object
dtype: object
Index 132
col1 35960884
dtype: int64
inferred_df = df.infer_objects()
print(inferred_df.dtypes)
print(inferred_df.memory_usage(deep=True))
col1 int64
dtype: object
Index 132
col1 8000000
dtype: int64
4.8.5. Say Goodbye to Data Type Conversion in pandas 2.0#
!pip install pandas==2.0.0
Previously in pandas, if a Series had missing values, its data type would be converted to float, resulting in a potential loss of precision for the original data.
import pandas as pd
s1 = pd.Series([0, 1, 2, 3])
print(f"Data type without None: {s1.dtypes}")
s1.iloc[0] = None
print(f"Data type with None: {s1.dtypes}")
Data type without None: int64
Data type with None: float64
With the integration of Apache Arrow in pandas 2.0, this issue is solved.
s2 = pd.Series([0, 1, 2, 3], dtype='int64[pyarrow]')
print(f"Data type without None: {s2.dtypes}")
s2.iloc[0] = None
print(f"Data type with None: {s2.dtypes}")
Data type without None: int64[pyarrow]
Data type with None: int64[pyarrow]
4.8.6. Efficient String Data Handling in pandas 2.0 with PyArrow Arrays#
Show code cell content
!pip install 'pandas==2.2' pyarrow
As of pandas 2.0, data in pandas can be stored in PyArrow arrays in addition to NumPy arrays. PyArrow arrays provide a wide range of data types compared to NumPy.
One significant advantage of PyArrow arrays is their string datatype, which offers superior speed and memory efficiency than storing strings using object dtypes.
import pandas as pd
import numpy as np
data_size = 1_000_000
np.random.seed(42)
data = np.random.choice(["John", "Alice", "Michael"], size=data_size)
s_numpy = pd.Series(data)
s_pyarrow = pd.Series(data, dtype="string[pyarrow]")
print(f"Datatype of Series with Numpy backend: {s_numpy.dtype}")
print(f"Datatype of Series with PyArrow backend: {s_pyarrow.dtype}")
Datatype of Series with Numpy backend: object
Datatype of Series with PyArrow backend: string
numpy_memory = s_numpy.memory_usage(deep=True)
pyarrow_memory = s_pyarrow.memory_usage(deep=True)
print(f"Memory usage for Numpy backend: {numpy_memory / (1024 ** 2):.2f} MB.")
print(f"Memory usage for PyArrow backend: {pyarrow_memory / (1024 ** 2):.2f} MB.")
print(f"PyArrow backend consumes approximately {numpy_memory / pyarrow_memory:.2f} times less memory than Numpy backend.")
Memory usage for Numpy backend: 59.45 MB.
Memory usage for PyArrow backend: 12.72 MB.
PyArrow backend consumes approximately 4.68 times less memory than Numpy backend.