Post

Pandas DataFrame of value_counts() for data exploration.

The Problem

We have data organised for analysis in a DataFrame.

  1. The first step is to figure out the quantitative and categorical columns.
  2. Among the categorical columns, we’re always interested to know what possible values each column could have.
    • We can figure this out for each column through value_counts().
  3. Applying value_counts to each data.column (Series) is a repetitive task that takes up way too space and time.

Solution

An organised visual representation of each column’s unique values in one place.

Make a DataFrame of value_counts() and print it for review.

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def value_counts_df(
        data: pd.DataFrame,
        cols: list,
)-> pd.DataFrame:
    """
    # Create a Dataframe of column value_counts().
    # Helps review data distributions.
    :param data: DataFrame of data.
    :param cols: List of categorical columns exempt all date columns.
    :return: The value_count() DataFrame.
    """
    df = pd.DataFrame()
    for col in cols:
        val = data[col].value_counts()
        ser = pd.Series(dtype='object')

        for index, rows in val.iteritems():
            ser = pd.concat(
                [
                    ser,
                    pd.Series(f'{index} - {rows}')
                ]
            )
            ser = ser.reset_index(drop=True)

        df = pd.concat(
            [df, ser],
            axis=1
        )
    df.columns = cols
    return df

This goes against tidyverse rules… oops 🤭.
But it’s fine since we’re organising the data not for analysis but just for a visual review.

Output

 hotelis_canceledmealcountrymarket_segmentdistribution_channelreserved_room_typeassigned_room_typedeposit_typeagentcompanycustomer_typereservation_status
0City Hotel - 793300 - 75166BB - 92310PRT - 48590Online TA - 56477TA/TO - 97870A - 85994A - 74053No Deposit - 1046419.0 - 3196140.0 - 927Transient - 89613Check-Out - 75166
1Resort Hotel - 400601 - 44224HB - 14463GBR - 12129Offline TA/TO - 24219Direct - 14645D - 19201D - 25322Non Refund - 14587240.0 - 13922223.0 - 784Transient-Party - 25124Canceled - 43017
2NaNNaNSC - 10650FRA - 10415Groups - 19811Corporate - 6677E - 6535E - 7806Refundable - 1621.0 - 719167.0 - 267Contract - 4076No-Show - 1207
3NaNNaNUndefined - 1169ESP - 8568Direct - 12606GDS - 193F - 2897F - 3751NaN14.0 - 364045.0 - 250Group - 577NaN
4NaNNaNFB - 798DEU - 7287Corporate - 5295Undefined - 5G - 2094G - 2553NaN7.0 - 3539153.0 - 215NaNNaN
5NaNNaNNaNITA - 3766Complementary - 743NaNB - 1118C - 2375NaN6.0 - 3290174.0 - 149NaNNaN
6NaNNaNNaNIRL - 3375Aviation - 237NaNC - 932B - 2163NaN250.0 - 2870219.0 - 141NaNNaN
7NaNNaNNaNBEL - 2342Undefined - 2NaNH - 601H - 712NaN241.0 - 1721281.0 - 138NaNNaN
8NaNNaNNaNBRA - 2224NaNNaNP - 12I - 363NaN28.0 - 1666154.0 - 133NaNNaN
9NaNNaNNaNNLD - 2104NaNNaNL - 6K - 279NaN8.0 - 1514405.0 - 119NaNNaN
10NaNNaNNaNUSA - 2097NaNNaNNaNP - 12NaN3.0 - 1336233.0 - 114NaNNaN

This was created from a Kaggle Dataset - Hotel Booking Demand

Final Thoughts

This is my current solution to this problem. Any alternative suggestions and enhancements are always welcome.

Maybe you have a more elegant way to go about this part of the data exploration process?

I’m all ears!

This post is licensed under CC BY 4.0 by the author.