5.20. DataFrame Aggregations

Important

.count()
.sum()
.nunique()
.mean()
.median()
.std()
.std2()
.min()
.quantile()
.max()
.first()
.last()
lambda column:

5.20.1. SetUp

✘>>> import pandas as pd
>>>
>>> pd.set_option('display.width', 250)
>>> pd.set_option('display.max_columns', 20)
>>> pd.set_option('display.max_rows', 30)
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-en.csv'
>>>
>>> df = (pd
...       .read_csv(DATA, parse_dates=['date'])
...       .drop(columns='index')
... )

date: The date and time of the entry
duration: The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry
item: A description of the event occurring – can be one of call, sms, or data
month: The billing month that each entry belongs to – of form YYYY-MM
network: The mobile network that was called/texted for each entry
network_type: Whether the number being called was a mobile, international ('world'), voicemail, landline, or other ('special') number.

Source [1]

5.20.2. Single Statistic

The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass as_index=False to the groupby operation:

✘>>> df.groupby('month', as_index=False).agg({'duration': 'sum'})
     month   duration
2014-11  26639.441
2014-12  14641.870
2015-01  18223.299
2015-02  15522.299
2015-03  22750.441

5.20.3. Multiple Statistics per Group

Group the data frame by month and item and extract a number of stats from each group:

✘>>> df.groupby(['month', 'item']).agg({
...     'duration': 'sum',        # Sum duration per group
...     'network_type': 'count',  # get the count of networks
...     'date': 'first',          # get the first date per group
... })
               duration  network_type                date
month   item
2014-11 call  25547.000           107 2014-10-15 06:58:00
        data    998.441            29 2014-10-15 06:58:00
        sms      94.000            94 2014-10-16 22:18:00
2014-12 call  13561.000            79 2014-11-14 17:24:00
        data   1032.870            30 2014-11-13 06:58:00
        sms      48.000            48 2014-11-14 17:28:00
2015-01 call  17070.000            88 2014-12-15 20:03:00
        data   1067.299            31 2014-12-13 06:58:00
        sms      86.000            86 2014-12-15 19:56:00
2015-02 call  14416.000            67 2015-01-15 10:36:00
        data   1067.299            31 2015-01-13 06:58:00
        sms      39.000            39 2015-01-15 12:23:00
2015-03 call  21727.000            47 2015-12-02 20:15:00
        data    998.441            29 2015-02-13 06:58:00
        sms      25.000            25 2015-02-19 18:46:00

5.20.4. Applying multiple functions to columns in groups

Group the data frame by month and item and extract a number of stats from each group:

✘>>> df.groupby(['month', 'item']).agg({
...
...     # Find the min, max, and sum of the duration column
...     'duration': ['min', 'max', 'sum'],
...
...     # find the number of network type entries
...     'network_type': 'count',
...
...     # minimum, first, and number of unique dates
...     'date': ['min', 'first', 'nunique']
... })
             duration                       network_type                date
                  min        max        sum        count                 min               first nunique
month   item
2014-11 call    1.000   1940.000  25547.000          107 2014-01-11 15:13:00 2014-10-15 06:58:00     104
        data   34.429     34.429    998.441           29 2014-01-11 06:58:00 2014-10-15 06:58:00      29
        sms     1.000      1.000     94.000           94 2014-03-11 08:40:00 2014-10-16 22:18:00      79
2014-12 call    2.000   2120.000  13561.000           79 2014-02-12 11:40:00 2014-11-14 17:24:00      76
        data   34.429     34.429   1032.870           30 2014-01-12 06:58:00 2014-11-13 06:58:00      30
        sms     1.000      1.000     48.000           48 2014-01-12 12:51:00 2014-11-14 17:28:00      41
2015-01 call    2.000   1859.000  17070.000           88 2014-12-15 20:03:00 2014-12-15 20:03:00      84
        data   34.429     34.429   1067.299           31 2014-12-13 06:58:00 2014-12-13 06:58:00      31
        sms     1.000      1.000     86.000           86 2014-12-15 19:56:00 2014-12-15 19:56:00      58
2015-02 call    1.000   1863.000  14416.000           67 2015-01-02 13:33:00 2015-01-15 10:36:00      67
        data   34.429     34.429   1067.299           31 2015-01-02 06:58:00 2015-01-13 06:58:00      31
        sms     1.000      1.000     39.000           39 2015-01-15 12:23:00 2015-01-15 12:23:00      27
2015-03 call    2.000  10528.000  21727.000           47 2015-01-03 12:19:00 2015-12-02 20:15:00      47
        data   34.429     34.429    998.441           29 2015-01-03 06:58:00 2015-02-13 06:58:00      29
        sms     1.000      1.000     25.000           25 2015-02-03 09:19:00 2015-02-19 18:46:00      17

5.20.5. Named Aggregations

Named Aggregations:

✘>>> df[df['item'] == 'call'].groupby('month').agg(
...
...     # Get max of the duration column for each group
...     max_duration=('duration', 'max'),
...
...     # Get min of the duration column for each group
...     min_duration=('duration', 'min'),
...
...     # Get sum of the duration column for each group
...     total_duration=('duration', 'sum'),
...
...     # Apply a lambda to date column
...     num_days=('date', lambda x: (max(x) - min(x)).days)
... )
         max_duration  min_duration  total_duration  num_days
month
2014-11        1940.0           1.0         25547.0       334
2014-12        2120.0           2.0         13561.0       305
2015-01        1859.0           2.0         17070.0       350
2015-02        1863.0           1.0         14416.0       243
2015-03       10528.0           2.0         21727.0       333

✘>>> df.groupby(['month', 'item']).agg(
...     duration_count=('duration', 'count'),
...     duration_sum=('duration', 'sum'),
...     duration_min=('duration', 'min'),
...     duration_max=('duration', 'max'),
...     duration_mean=('duration', 'mean'),
...     duration_mean_round=('duration', lambda column: column.mean().astype(int)),
...     duration_median=('duration', 'median'),
...     first=('date', 'first'),
...     last=('date', 'last'),
... )
              duration_count  duration_sum  duration_min  duration_max  duration_mean  duration_mean_round  duration_median               first                last
month   item
2014-11 call             107     25547.000         1.000      1940.000     238.757009                  238           48.000 2014-10-15 06:58:00 2014-12-11 19:01:00
        data              29       998.441        34.429        34.429      34.429000                   34           34.429 2014-10-15 06:58:00 2014-12-11 06:58:00
        sms               94        94.000         1.000         1.000       1.000000                    1            1.000 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 call              79     13561.000         2.000      2120.000     171.658228                  171           55.000 2014-11-14 17:24:00 2014-12-14 19:54:00
        data              30      1032.870        34.429        34.429      34.429000                   34           34.429 2014-11-13 06:58:00 2014-12-12 06:58:00
        sms               48        48.000         1.000         1.000       1.000000                    1            1.000 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 call              88     17070.000         2.000      1859.000     193.977273                  193           55.500 2014-12-15 20:03:00 2015-01-14 20:47:00
        data              31      1067.299        34.429        34.429      34.429000                   34           34.429 2014-12-13 06:58:00 2015-12-01 06:58:00
        sms               86        86.000         1.000         1.000       1.000000                    1            1.000 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 call              67     14416.000         1.000      1863.000     215.164179                  215           89.000 2015-01-15 10:36:00 2015-09-02 17:54:00
        data              31      1067.299        34.429        34.429      34.429000                   34           34.429 2015-01-13 06:58:00 2015-12-02 06:58:00
        sms               39        39.000         1.000         1.000       1.000000                    1            1.000 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 call              47     21727.000         2.000     10528.000     462.276596                  462          107.000 2015-12-02 20:15:00 2015-04-03 12:29:00
        data              29       998.441        34.429        34.429      34.429000                   34           34.429 2015-02-13 06:58:00 2015-03-13 06:58:00
        sms               25        25.000         1.000         1.000       1.000000                    1            1.000 2015-02-19 18:46:00 2015-03-14 00:16:00

5.20.6. Renaming index

using droplevel and ravel
Dictionary groupby format is deprecated

Drop the top level (using .droplevel()) of the newly created multi-index on columns using:

✘>>> grouped = df.groupby('month').agg({'duration': ['min', 'max', 'mean']})
>>> grouped
        duration
             min      max        mean
month
2014-11      1.0   1940.0  115.823657
2014-12      1.0   2120.0   93.260318
2015-01      1.0   1859.0   88.894141
2015-02      1.0   1863.0  113.301453
2015-03      1.0  10528.0  225.251891

✘>>> grouped.columns = grouped.columns.droplevel(level=0)
>>> grouped
         min      max        mean
month
2014-11  1.0   1940.0  115.823657
2014-12  1.0   2120.0   93.260318
2015-01  1.0   1859.0   88.894141
2015-02  1.0   1863.0  113.301453
2015-03  1.0  10528.0  225.251891

✘>>> grouped.rename(columns={
...     'min': 'min_duration',
...     'max': 'max_duration',
...     'mean': 'mean_duration'
... }, inplace=True)
>>> grouped
         min_duration  max_duration  mean_duration
month
2014-11           1.0        1940.0     115.823657
2014-12           1.0        2120.0      93.260318
2015-01           1.0        1859.0      88.894141
2015-02           1.0        1863.0     113.301453
2015-03           1.0       10528.0     225.251891

Quick renaming of grouped columns from the groupby() multi-index can be achieved using the ravel() function:

✘>>> grouped = df.groupby('month').agg({
...     'duration': ['min', 'max', 'mean']
... })
>>> grouped
        duration
             min      max        mean
month
2014-11      1.0   1940.0  115.823657
2014-12      1.0   2120.0   93.260318
2015-01      1.0   1859.0   88.894141
2015-02      1.0   1863.0  113.301453
2015-03      1.0  10528.0  225.251891

Using ravel, and a string join, we can create better names for the columns:

✘>>> grouped.columns = ['_'.join(x) for x in grouped.columns]
>>> grouped
         duration_min  duration_max  duration_mean
month
2014-11           1.0        1940.0     115.823657
2014-12           1.0        2120.0      93.260318
2015-01           1.0        1859.0      88.894141
2015-02           1.0        1863.0     113.301453
2015-03           1.0       10528.0     225.251891

5.20.7. Use Case - 1

✘>>> import pandas as pd
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-pl.csv'
>>>
>>> result = (
...     pd
...     .read_csv(DATA, parse_dates=['datetime'])
...     .set_index('datetime', drop=True)
...     .drop(columns=['id'])
...     .loc['2000-01-01':'2000-03-01']
...     .query('item == "sms"')
...     .groupby(['period','item'])
...     .agg(
...         duration_count = ('duration', 'count'),
...         duration_sum = ('duration', 'sum'),
...         duration_median = ('duration', 'median'),
...         duration_mean = ('duration', 'mean'),
...         duration_std = ('duration', 'std'),
...         duration_var = ('duration', 'var'),
...         value = ('duration', lambda column: column.mean().astype(int))
...     )
... )

5.20.8. Use Case - 2

✘>>> import pandas as pd
>>>
>>>
>>> def quantile25(column):
...     return column.quantile(.25)
>>>
>>> def quantile50(column):
...     return column.quantile(.50)
>>>
>>> def quantile75(column):
...     return column.quantile(.75)
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-en.csv'
>>> df = pd.read_csv(DATA, parse_dates=['date'])
>>> df.drop(columns='index', inplace=True)
>>>
>>> result = df.groupby(['month','item']).agg(
...     duration_count=('duration', 'count'),
...     duration_sum=('duration', 'sum'),
...     duration_nunique=('duration', 'nunique'),
...
...     duration_mean=('duration', 'mean'),
...     duration_median=('duration', 'median'),
...     duration_std=('duration', 'std'),
...     duration_std2=('duration', lambda column: column.std().astype(int)),
...
...     duration_min=('duration', 'min'),
...     duration_q25=('duration', quantile25),
...     duration_q50=('duration', quantile50),
...     duration_q75=('duration', quantile75),
...     duration_max=('duration', 'max'),
...
...     when_first=('date', 'first'),
...     when_last=('date', 'last'),
... )
>>>
>>> result
              duration_count  duration_sum  duration_nunique  duration_mean  duration_median  duration_std  duration_std2  duration_min  duration_q25  duration_q50  duration_q75  duration_max          when_first           when_last
month   item
2014-11 call             107     25547.000                76     238.757009           48.000    387.128905            387         1.000         5.500        48.000       328.000      1940.000 2014-10-15 06:58:00 2014-12-11 19:01:00
        data              29       998.441                 1      34.429000           34.429      0.000000              0        34.429        34.429        34.429        34.429        34.429 2014-10-15 06:58:00 2014-12-11 06:58:00
        sms               94        94.000                 1       1.000000            1.000      0.000000              0         1.000         1.000         1.000         1.000         1.000 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 call              79     13561.000                61     171.658228           55.000    324.731798            324         2.000        10.500        55.000       152.000      2120.000 2014-11-14 17:24:00 2014-12-14 19:54:00
        data              30      1032.870                 1      34.429000           34.429      0.000000              0        34.429        34.429        34.429        34.429        34.429 2014-11-13 06:58:00 2014-12-12 06:58:00
        sms               48        48.000                 1       1.000000            1.000      0.000000              0         1.000         1.000         1.000         1.000         1.000 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 call              88     17070.000                70     193.977273           55.500    300.671661            300         2.000        15.500        55.500       273.500      1859.000 2014-12-15 20:03:00 2015-01-14 20:47:00
        data              31      1067.299                 1      34.429000           34.429      0.000000              0        34.429        34.429        34.429        34.429        34.429 2014-12-13 06:58:00 2015-12-01 06:58:00
        sms               86        86.000                 1       1.000000            1.000      0.000000              0         1.000         1.000         1.000         1.000         1.000 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 call              67     14416.000                63     215.164179           89.000    329.672914            329         1.000        30.000        89.000       241.000      1863.000 2015-01-15 10:36:00 2015-09-02 17:54:00
        data              31      1067.299                 1      34.429000           34.429      0.000000              0        34.429        34.429        34.429        34.429        34.429 2015-01-13 06:58:00 2015-12-02 06:58:00
        sms               39        39.000                 1       1.000000            1.000      0.000000              0         1.000         1.000         1.000         1.000         1.000 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 call              47     21727.000                46     462.276596          107.000   1552.192218           1552         2.000        33.500       107.000       320.000     10528.000 2015-12-02 20:15:00 2015-04-03 12:29:00
        data              29       998.441                 1      34.429000           34.429      0.000000              0        34.429        34.429        34.429        34.429        34.429 2015-02-13 06:58:00 2015-03-13 06:58:00
        sms               25        25.000                 1       1.000000            1.000      0.000000              0         1.000         1.000         1.000         1.000         1.000 2015-02-19 18:46:00 2015-03-14 00:16:00

✘>>> result.loc[('2015-01','call')]
duration_count                       88
duration_sum                    17070.0
duration_nunique                     70
duration_mean                193.977273
duration_median                    55.5
duration_std                 300.671661
duration_std2                       300
duration_min                        2.0
duration_q25                       15.5
duration_q50                       55.5
duration_q75                      273.5
duration_max                     1859.0
when_first          2014-12-15 20:03:00
when_last           2015-01-14 20:47:00
Name: (2015-01, call), dtype: object

✘>>> result.loc['2015-01']
      duration_count  duration_sum  duration_nunique  duration_mean  duration_median  duration_std  duration_std2  duration_min  duration_q25  duration_q50  duration_q75  duration_max          when_first           when_last
item
call              88     17070.000                70     193.977273           55.500    300.671661            300         2.000        15.500        55.500       273.500      1859.000 2014-12-15 20:03:00 2015-01-14 20:47:00
data              31      1067.299                 1      34.429000           34.429      0.000000              0        34.429        34.429        34.429        34.429        34.429 2014-12-13 06:58:00 2015-12-01 06:58:00
sms               86        86.000                 1       1.000000            1.000      0.000000              0         1.000         1.000         1.000         1.000         1.000 2014-12-15 19:56:00 2015-01-14 23:36:00

✘>>> result.loc['2015-01'].transpose()
item                             call                 data                  sms
duration_count                     88                   31                   86
duration_sum                  17070.0             1067.299                 86.0
duration_nunique                   70                    1                    1
duration_mean              193.977273               34.429                  1.0
duration_median                  55.5               34.429                  1.0
duration_std               300.671661                  0.0                  0.0
duration_std2                     300                    0                    0
duration_min                      2.0               34.429                  1.0
duration_q25                     15.5               34.429                  1.0
duration_q50                     55.5               34.429                  1.0
duration_q75                    273.5               34.429                  1.0
duration_max                   1859.0               34.429                  1.0
when_first        2014-12-15 20:03:00  2014-12-13 06:58:00  2014-12-15 19:56:00
when_last         2015-01-14 20:47:00  2015-12-01 06:58:00  2015-01-14 23:36:00

✘>>> sms = result.index.get_level_values('item') == 'sms'
>>> sms
array([False, False,  True, False, False,  True, False, False,  True,
       False, False,  True, False, False,  True])
>>>
>>> result[sms]
              duration_count  duration_sum  duration_nunique  duration_mean  duration_median  duration_std  duration_std2  duration_min  duration_q25  duration_q50  duration_q75  duration_max          when_first           when_last
month   item
2014-11 sms               94          94.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 sms               48          48.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 sms               86          86.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 sms               39          39.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 sms               25          25.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2015-02-19 18:46:00 2015-03-14 00:16:00

Cross-section:

✘>>> result.xs('sms', level='item')
         duration_count  duration_sum  duration_nunique  duration_mean  duration_median  duration_std  duration_std2  duration_min  duration_q25  duration_q50  duration_q75  duration_max          when_first           when_last
month
2014-11              94          94.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12              48          48.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01              86          86.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02              39          39.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03              25          25.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2015-02-19 18:46:00 2015-03-14 00:16:00

Slicer Object:

✘>>> result.loc[(slice(None), 'sms'), :]
              duration_count  duration_sum  duration_nunique  duration_mean  duration_median  duration_std  duration_std2  duration_min  duration_q25  duration_q50  duration_q75  duration_max          when_first           when_last
month   item
2014-11 sms               94          94.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 sms               48          48.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 sms               86          86.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 sms               39          39.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 sms               25          25.0                 1            1.0              1.0           0.0              0           1.0           1.0           1.0           1.0           1.0 2015-02-19 18:46:00 2015-03-14 00:16:00