Filter Rows or Columns

4.7. Filter Rows or Columns#

4.7.1. Pandas.Series.isin: Filter Rows Only If Column Contains Values From Another List#

When working with a pandas Dataframe, if you want to select the values that are in another list, the fastest way is to use isin.

In the example below, 2 is filtered out because 3 is not in the list.

import pandas as pd 

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df

	a	b
0	1	4
1	2	5
2	3	6

l = [1, 2, 6, 7]
df.a.isin(l)

   True
   True
  False
Name: a, dtype: bool

df = df[df.a.isin(l)]
df

	a	b
0	1	4
1	2	5

4.7.2. df.query: Query Columns Using Boolean Expression#

It can be lengthy to filter columns of a pandas DataFrame using brackets.

import pandas as pd

df = pd.DataFrame(
    {"fruit": ["apple", "orange", "grape", "grape"], "price": [4, 5, 6, 7]}
)

print(df[(df.price > 4) & (df.fruit == "grape")])

   fruit  price
2  grape      6
3  grape      7

To shorten the filtering statements, use df.query instead.

df.query("price > 4 & fruit == 'grape'")

	fruit	price
2	grape	6
3	grape	7

4.7.3. transform: Filter a pandas DataFrame by Value Counts#

To filter a pandas DataFrame based on the occurrences of categories, you might attempt to use df.groupby and df.count.

import pandas as pd

df = pd.DataFrame({"type": ["A", "A", "O", "B", "O", "A"], "value": [5, 3, 2, 1, 4, 2]})
df

	type	value
0	A	5
1	A	3
2	O	2
3	B	1
4	O	4
5	A	2

df.groupby("type")["type"].count()

type
A    3
B    1
O    2
Name: type, dtype: int64

However, since the Series returned by the count method is shorter than the original DataFrame, you will get an error when filtering.

df.loc[df.groupby("type")["type"].count() > 1]

---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
/tmp/ipykernel_791962/4076731999.py in <module>
----> 1 df.loc[df.groupby("type")["type"].count() > 1]

~/book/venv/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

~/book/venv/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1142             return self._get_slice_axis(key, axis=axis)
   1143         elif com.is_bool_indexer(key):
-> 1144             return self._getbool_axis(key, axis=axis)
   1145         elif is_list_like_indexer(key):
   1146 

~/book/venv/lib/python3.8/site-packages/pandas/core/indexing.py in _getbool_axis(self, key, axis)
    946         # caller is responsible for ensuring non-None axis
    947         labels = self.obj._get_axis(axis)
--> 948         key = check_bool_indexer(labels, key)
    949         inds = key.nonzero()[0]
    950         return self.obj._take_with_is_copy(inds, axis=axis)

~/book/venv/lib/python3.8/site-packages/pandas/core/indexing.py in check_bool_indexer(index, key)
   2386         mask = isna(result._values)
   2387         if mask.any():
-> 2388             raise IndexingError(
   2389                 "Unalignable boolean Series provided as "
   2390                 "indexer (index of the boolean Series and of "

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

Instead of using count, use transform. This method will return the Series of value counts with the same length as the original DataFrame.

df.groupby("type")["type"].transform("size")

  3
  3
  2
  1
  2
  3
Name: type, dtype: int64

Now you can filter without encountering any error.

df.loc[df.groupby("type")["type"].transform("size") > 1]

	type	value
0	A	5
1	A	3
2	O	2
4	O	4
5	A	2

4.7.4. df.filter: Filter Columns Based on a Subset of Their Names#

If you want to filter columns of a pandas DataFrame based on characters in their names, use DataFrame.filter. In the example below, we only choose the columns that contain the word “cat”.

import pandas as pd

df = pd.DataFrame({"cat1": ["a", "b"], "cat2": ["b", "c"], "num1": [1, 2]})
df 

	cat1	cat2	num1
0	a	b	1
1	b	c	2

df.filter(like='cat', axis=1)

	cat1	cat2
0	a	b
1	b	c

4.7.5. Filter a pandas DataFrame Based on Index’s Name#

If you want to filter a pandas DataFrame based on the index’s name, you can use either filter or loc.

import pandas as pd
import numpy as np

values = np.array([[1, 2], [3, 4], [5, 6]])
df = pd.DataFrame(
    values, 
    index=["user1", "user2", "user3"], 
    columns=["col1", "col2"]
)
df

	col1	col2
user1	1	2
user2	3	4
user3	5	6

df.filter(items=['user1', 'user3'], axis=0)

	col1	col2
user1	1	2
user3	5	6

df.loc[['user1', 'user3'], :]

	col1	col2
user1	1	2
user3	5	6

4.7.6. all: Select Rows with All NaN Values#

DataFrame.all is useful when you want to evaluate whether all values of a row or a column are True. If you want to get the rows whose all values are NaN, use both isna and all(axis=1).

import pandas as pd 

df = pd.DataFrame({'a': [1, 2, float('nan')], 'b': [1, float('nan'), float('nan')]})
is_all_nan = df.isna().all(axis=1)
is_all_nan 

  False
  False
   True
dtype: bool

df.loc[is_all_nan, :]

    a   b
2 NaN NaN

4.7.7. pandas.clip: Exclude Outliers#

Outliers are unusual values in your dataset, and they can distort statistical analyses.

import pandas as pd 

data = {"col0": [9, -3, 0, -1, 5]}
df = pd.DataFrame(data)
df

	col0
0	9
1	-3
2	0
3	-1
4	5

If you want to trim values that the outliers, one of the methods is to use df.clip.

Below is how to use the 0.5-quantile as the lower threshold and .95-quantile as the upper threshold

lower = df.col0.quantile(0.05)
upper = df.col0.quantile(0.95)

df.clip(lower=lower, upper=upper)

	col0
0	8.2
1	-2.6
2	0.0
3	-1.0
4	5.0