4.7. Filter Rows or Columns#
4.7.1. Pandas.Series.isin: Filter Rows Only If Column Contains Values From Another List#
When working with a pandas Dataframe, if you want to select the values that are in another list, the fastest way is to use isin
.
In the example below, 2
is filtered out because 3
is not in the list.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df
a | b | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
l = [1, 2, 6, 7]
df.a.isin(l)
0 True
1 True
2 False
Name: a, dtype: bool
df = df[df.a.isin(l)]
df
a | b | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
4.7.2. df.query: Query Columns Using Boolean Expression#
It can be lengthy to filter columns of a pandas DataFrame using brackets.
import pandas as pd
df = pd.DataFrame(
{"fruit": ["apple", "orange", "grape", "grape"], "price": [4, 5, 6, 7]}
)
print(df[(df.price > 4) & (df.fruit == "grape")])
fruit price
2 grape 6
3 grape 7
To shorten the filtering statements, use df.query
instead.
df.query("price > 4 & fruit == 'grape'")
fruit | price | |
---|---|---|
2 | grape | 6 |
3 | grape | 7 |
4.7.3. transform: Filter a pandas DataFrame by Value Counts#
To filter a pandas DataFrame based on the occurrences of categories, you might attempt to use df.groupby
and df.count
.
import pandas as pd
df = pd.DataFrame({"type": ["A", "A", "O", "B", "O", "A"], "value": [5, 3, 2, 1, 4, 2]})
df
type | value | |
---|---|---|
0 | A | 5 |
1 | A | 3 |
2 | O | 2 |
3 | B | 1 |
4 | O | 4 |
5 | A | 2 |
df.groupby("type")["type"].count()
type
A 3
B 1
O 2
Name: type, dtype: int64
However, since the Series returned by the count
method is shorter than the original DataFrame, you will get an error when filtering.
df.loc[df.groupby("type")["type"].count() > 1]
---------------------------------------------------------------------------
IndexingError Traceback (most recent call last)
/tmp/ipykernel_791962/4076731999.py in <module>
----> 1 df.loc[df.groupby("type")["type"].count() > 1]
~/book/venv/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
929
930 maybe_callable = com.apply_if_callable(key, self.obj)
--> 931 return self._getitem_axis(maybe_callable, axis=axis)
932
933 def _is_scalar_access(self, key: tuple):
~/book/venv/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1142 return self._get_slice_axis(key, axis=axis)
1143 elif com.is_bool_indexer(key):
-> 1144 return self._getbool_axis(key, axis=axis)
1145 elif is_list_like_indexer(key):
1146
~/book/venv/lib/python3.8/site-packages/pandas/core/indexing.py in _getbool_axis(self, key, axis)
946 # caller is responsible for ensuring non-None axis
947 labels = self.obj._get_axis(axis)
--> 948 key = check_bool_indexer(labels, key)
949 inds = key.nonzero()[0]
950 return self.obj._take_with_is_copy(inds, axis=axis)
~/book/venv/lib/python3.8/site-packages/pandas/core/indexing.py in check_bool_indexer(index, key)
2386 mask = isna(result._values)
2387 if mask.any():
-> 2388 raise IndexingError(
2389 "Unalignable boolean Series provided as "
2390 "indexer (index of the boolean Series and of "
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
Instead of using count
, use transform
. This method will return the Series of value counts with the same length as the original DataFrame.
df.groupby("type")["type"].transform("size")
0 3
1 3
2 2
3 1
4 2
5 3
Name: type, dtype: int64
Now you can filter without encountering any error.
df.loc[df.groupby("type")["type"].transform("size") > 1]
type | value | |
---|---|---|
0 | A | 5 |
1 | A | 3 |
2 | O | 2 |
4 | O | 4 |
5 | A | 2 |
4.7.4. df.filter: Filter Columns Based on a Subset of Their Names#
If you want to filter columns of a pandas DataFrame based on characters in their names, use DataFrame.filter
. In the example below, we only choose the columns that contain the word “cat”.
import pandas as pd
df = pd.DataFrame({"cat1": ["a", "b"], "cat2": ["b", "c"], "num1": [1, 2]})
df
cat1 | cat2 | num1 | |
---|---|---|---|
0 | a | b | 1 |
1 | b | c | 2 |
df.filter(like='cat', axis=1)
cat1 | cat2 | |
---|---|---|
0 | a | b |
1 | b | c |
4.7.5. Filter a pandas DataFrame Based on Index’s Name#
If you want to filter a pandas DataFrame based on the index’s name, you can use either filter
or loc
.
import pandas as pd
import numpy as np
values = np.array([[1, 2], [3, 4], [5, 6]])
df = pd.DataFrame(
values,
index=["user1", "user2", "user3"],
columns=["col1", "col2"]
)
df
col1 | col2 | |
---|---|---|
user1 | 1 | 2 |
user2 | 3 | 4 |
user3 | 5 | 6 |
df.filter(items=['user1', 'user3'], axis=0)
col1 | col2 | |
---|---|---|
user1 | 1 | 2 |
user3 | 5 | 6 |
df.loc[['user1', 'user3'], :]
col1 | col2 | |
---|---|---|
user1 | 1 | 2 |
user3 | 5 | 6 |
4.7.6. all: Select Rows with All NaN Values#
DataFrame.all
is useful when you want to evaluate whether all values of a row or a column are True
. If you want to get the rows whose all values are NaN, use both isna
and all(axis=1)
.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, float('nan')], 'b': [1, float('nan'), float('nan')]})
is_all_nan = df.isna().all(axis=1)
is_all_nan
0 False
1 False
2 True
dtype: bool
df.loc[is_all_nan, :]
a b
2 NaN NaN
4.7.7. pandas.clip: Exclude Outliers#
Outliers are unusual values in your dataset, and they can distort statistical analyses.
import pandas as pd
data = {"col0": [9, -3, 0, -1, 5]}
df = pd.DataFrame(data)
df
col0 | |
---|---|
0 | 9 |
1 | -3 |
2 | 0 |
3 | -1 |
4 | 5 |
If you want to trim values that the outliers, one of the methods is to use df.clip
.
Below is how to use the 0.5-quantile as the lower threshold and .95-quantile as the upper threshold
lower = df.col0.quantile(0.05)
upper = df.col0.quantile(0.95)
df.clip(lower=lower, upper=upper)
col0 | |
---|---|
0 | 8.2 |
1 | -2.6 |
2 | 0.0 |
3 | -1.0 |
4 | 5.0 |