Post

Combining categories using masks

When working with categorical data, simplifying the analysis is often beneficial by either combining or eliminating less significant categories. This might arise due to the rare frequency of said categories in our data.

Therefore, EDA streamlining can be achieved by consolidating these categories into a uniform “Others” category.

Take a Series containing the number of votes recorded for each candidate:

Liliana    1067
John        998
William     494
Emilie      196
Pattie        6
Neil          3
Bob           2
Demi          1
David         1
Hester        1
Name: Vote, dtype: int64

All the candidates below our “Emilie” candidate are insignificant due to the minimal number of votes. So we can combine them into a cohesive “Others” category.

Code

1
2
3
4
5
6
7
# election_data is the DataFrame
# get the counts for each candidate
votes = election_data['Vote'].value_counts()
other = list(votes[votes < 200].index)

mask = election_data.isin(other)
election_data[mask] = 'Other'

You can also stack the DataFrame, compare the DF against the scalar value, replace and unstack for final result.1

1
2
3
stack = df.stack()
stack[stack.isin(other)] = 'Other'
election_data = stack.unstack()

Or, you could go with the normal pd.replace route:

1
df = df.replace(other, 'Other')

The value_counts() then becomes:

Liliana    1067
John        998
William     494
Other       210
Name: Vote, dtype: int64

References

This post is licensed under CC BY 4.0 by the author.