5.20. DataFrame Aggregations
.count()
.sum()
.nunique()
.mean()
.median()
.std()
.std2()
.min()
.quantile()
.max()
.first()
.last()
lambda column:
5.20.1. SetUp
>>> import pandas as pd
>>>
>>> pd.set_option('display.width', 250)
>>> pd.set_option('display.max_columns', 20)
>>> pd.set_option('display.max_rows', 30)
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-en.csv'
>>>
>>> df = (pd
... .read_csv(DATA, parse_dates=['date'])
... .drop(columns='index')
... )
- date
The date and time of the entry
- duration
The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry
- item
A description of the event occurring – can be one of call, sms, or data
- month
The billing month that each entry belongs to – of form
YYYY-MM
- network
The mobile network that was called/texted for each entry
- network_type
Whether the number being called was a mobile, international ('world'), voicemail, landline, or other ('special') number.
Source [1]
5.20.2. Single Statistic
The groupby output will have an index or multi-index on rows corresponding to
your chosen grouping variables. To avoid setting this index, pass
as_index=False
to the groupby operation:
>>> df.groupby('month', as_index=False).agg({'duration': 'sum'})
month duration
0 2014-11 26639.441
1 2014-12 14641.870
2 2015-01 18223.299
3 2015-02 15522.299
4 2015-03 22750.441
5.20.3. Multiple Statistics per Group
Group the data frame by month and item and extract a number of stats from each group:
>>> df.groupby(['month', 'item']).agg({
... 'duration': 'sum', # Sum duration per group
... 'network_type': 'count', # get the count of networks
... 'date': 'first', # get the first date per group
... })
duration network_type date
month item
2014-11 call 25547.000 107 2014-10-15 06:58:00
data 998.441 29 2014-10-15 06:58:00
sms 94.000 94 2014-10-16 22:18:00
2014-12 call 13561.000 79 2014-11-14 17:24:00
data 1032.870 30 2014-11-13 06:58:00
sms 48.000 48 2014-11-14 17:28:00
2015-01 call 17070.000 88 2014-12-15 20:03:00
data 1067.299 31 2014-12-13 06:58:00
sms 86.000 86 2014-12-15 19:56:00
2015-02 call 14416.000 67 2015-01-15 10:36:00
data 1067.299 31 2015-01-13 06:58:00
sms 39.000 39 2015-01-15 12:23:00
2015-03 call 21727.000 47 2015-12-02 20:15:00
data 998.441 29 2015-02-13 06:58:00
sms 25.000 25 2015-02-19 18:46:00
5.20.4. Applying multiple functions to columns in groups
Group the data frame by month and item and extract a number of stats from each group:
>>> df.groupby(['month', 'item']).agg({
...
... # Find the min, max, and sum of the duration column
... 'duration': ['min', 'max', 'sum'],
...
... # find the number of network type entries
... 'network_type': 'count',
...
... # minimum, first, and number of unique dates
... 'date': ['min', 'first', 'nunique']
... })
duration network_type date
min max sum count min first nunique
month item
2014-11 call 1.000 1940.000 25547.000 107 2014-01-11 15:13:00 2014-10-15 06:58:00 104
data 34.429 34.429 998.441 29 2014-01-11 06:58:00 2014-10-15 06:58:00 29
sms 1.000 1.000 94.000 94 2014-03-11 08:40:00 2014-10-16 22:18:00 79
2014-12 call 2.000 2120.000 13561.000 79 2014-02-12 11:40:00 2014-11-14 17:24:00 76
data 34.429 34.429 1032.870 30 2014-01-12 06:58:00 2014-11-13 06:58:00 30
sms 1.000 1.000 48.000 48 2014-01-12 12:51:00 2014-11-14 17:28:00 41
2015-01 call 2.000 1859.000 17070.000 88 2014-12-15 20:03:00 2014-12-15 20:03:00 84
data 34.429 34.429 1067.299 31 2014-12-13 06:58:00 2014-12-13 06:58:00 31
sms 1.000 1.000 86.000 86 2014-12-15 19:56:00 2014-12-15 19:56:00 58
2015-02 call 1.000 1863.000 14416.000 67 2015-01-02 13:33:00 2015-01-15 10:36:00 67
data 34.429 34.429 1067.299 31 2015-01-02 06:58:00 2015-01-13 06:58:00 31
sms 1.000 1.000 39.000 39 2015-01-15 12:23:00 2015-01-15 12:23:00 27
2015-03 call 2.000 10528.000 21727.000 47 2015-01-03 12:19:00 2015-12-02 20:15:00 47
data 34.429 34.429 998.441 29 2015-01-03 06:58:00 2015-02-13 06:58:00 29
sms 1.000 1.000 25.000 25 2015-02-03 09:19:00 2015-02-19 18:46:00 17
5.20.5. Named Aggregations
Named Aggregations:
>>> df[df['item'] == 'call'].groupby('month').agg(
...
... # Get max of the duration column for each group
... max_duration=('duration', 'max'),
...
... # Get min of the duration column for each group
... min_duration=('duration', 'min'),
...
... # Get sum of the duration column for each group
... total_duration=('duration', 'sum'),
...
... # Apply a lambda to date column
... num_days=('date', lambda x: (max(x) - min(x)).days)
... )
max_duration min_duration total_duration num_days
month
2014-11 1940.0 1.0 25547.0 334
2014-12 2120.0 2.0 13561.0 305
2015-01 1859.0 2.0 17070.0 350
2015-02 1863.0 1.0 14416.0 243
2015-03 10528.0 2.0 21727.0 333
>>> df.groupby(['month', 'item']).agg(
... duration_count=('duration', 'count'),
... duration_sum=('duration', 'sum'),
... duration_min=('duration', 'min'),
... duration_max=('duration', 'max'),
... duration_mean=('duration', 'mean'),
... duration_mean_round=('duration', lambda column: column.mean().astype(int)),
... duration_median=('duration', 'median'),
... first=('date', 'first'),
... last=('date', 'last'),
... )
duration_count duration_sum duration_min duration_max duration_mean duration_mean_round duration_median first last
month item
2014-11 call 107 25547.000 1.000 1940.000 238.757009 238 48.000 2014-10-15 06:58:00 2014-12-11 19:01:00
data 29 998.441 34.429 34.429 34.429000 34 34.429 2014-10-15 06:58:00 2014-12-11 06:58:00
sms 94 94.000 1.000 1.000 1.000000 1 1.000 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 call 79 13561.000 2.000 2120.000 171.658228 171 55.000 2014-11-14 17:24:00 2014-12-14 19:54:00
data 30 1032.870 34.429 34.429 34.429000 34 34.429 2014-11-13 06:58:00 2014-12-12 06:58:00
sms 48 48.000 1.000 1.000 1.000000 1 1.000 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 call 88 17070.000 2.000 1859.000 193.977273 193 55.500 2014-12-15 20:03:00 2015-01-14 20:47:00
data 31 1067.299 34.429 34.429 34.429000 34 34.429 2014-12-13 06:58:00 2015-12-01 06:58:00
sms 86 86.000 1.000 1.000 1.000000 1 1.000 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 call 67 14416.000 1.000 1863.000 215.164179 215 89.000 2015-01-15 10:36:00 2015-09-02 17:54:00
data 31 1067.299 34.429 34.429 34.429000 34 34.429 2015-01-13 06:58:00 2015-12-02 06:58:00
sms 39 39.000 1.000 1.000 1.000000 1 1.000 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 call 47 21727.000 2.000 10528.000 462.276596 462 107.000 2015-12-02 20:15:00 2015-04-03 12:29:00
data 29 998.441 34.429 34.429 34.429000 34 34.429 2015-02-13 06:58:00 2015-03-13 06:58:00
sms 25 25.000 1.000 1.000 1.000000 1 1.000 2015-02-19 18:46:00 2015-03-14 00:16:00
5.20.6. Renaming index
using
droplevel
andravel
Dictionary
groupby
format is deprecated
Drop the top level (using .droplevel()
) of the newly created multi-index
on columns using:
>>> grouped = df.groupby('month').agg({'duration': ['min', 'max', 'mean']})
>>> grouped
duration
min max mean
month
2014-11 1.0 1940.0 115.823657
2014-12 1.0 2120.0 93.260318
2015-01 1.0 1859.0 88.894141
2015-02 1.0 1863.0 113.301453
2015-03 1.0 10528.0 225.251891
>>> grouped.columns = grouped.columns.droplevel(level=0)
>>> grouped
min max mean
month
2014-11 1.0 1940.0 115.823657
2014-12 1.0 2120.0 93.260318
2015-01 1.0 1859.0 88.894141
2015-02 1.0 1863.0 113.301453
2015-03 1.0 10528.0 225.251891
>>> grouped.rename(columns={
... 'min': 'min_duration',
... 'max': 'max_duration',
... 'mean': 'mean_duration'
... }, inplace=True)
>>> grouped
min_duration max_duration mean_duration
month
2014-11 1.0 1940.0 115.823657
2014-12 1.0 2120.0 93.260318
2015-01 1.0 1859.0 88.894141
2015-02 1.0 1863.0 113.301453
2015-03 1.0 10528.0 225.251891
Quick renaming of grouped columns from the groupby() multi-index can be achieved using the ravel() function:
>>> grouped = df.groupby('month').agg({
... 'duration': ['min', 'max', 'mean']
... })
>>> grouped
duration
min max mean
month
2014-11 1.0 1940.0 115.823657
2014-12 1.0 2120.0 93.260318
2015-01 1.0 1859.0 88.894141
2015-02 1.0 1863.0 113.301453
2015-03 1.0 10528.0 225.251891
Using ravel, and a string join, we can create better names for the columns:
>>> grouped.columns = ['_'.join(x) for x in grouped.columns]
>>> grouped
duration_min duration_max duration_mean
month
2014-11 1.0 1940.0 115.823657
2014-12 1.0 2120.0 93.260318
2015-01 1.0 1859.0 88.894141
2015-02 1.0 1863.0 113.301453
2015-03 1.0 10528.0 225.251891
5.20.7. Use Case - 1
>>> import pandas as pd
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-pl.csv'
>>>
>>> result = (
... pd
... .read_csv(DATA, parse_dates=['datetime'])
... .set_index('datetime', drop=True)
... .drop(columns=['id'])
... .loc['2000-01-01':'2000-03-01']
... .query('item == "sms"')
... .groupby(['period','item'])
... .agg(
... duration_count = ('duration', 'count'),
... duration_sum = ('duration', 'sum'),
... duration_median = ('duration', 'median'),
... duration_mean = ('duration', 'mean'),
... duration_std = ('duration', 'std'),
... duration_var = ('duration', 'var'),
... value = ('duration', lambda column: column.mean().astype(int))
... )
... )
5.20.8. Use Case - 2
>>> import pandas as pd
>>>
>>>
>>> def quantile25(column):
... return column.quantile(.25)
>>>
>>> def quantile50(column):
... return column.quantile(.50)
>>>
>>> def quantile75(column):
... return column.quantile(.75)
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-en.csv'
>>> df = pd.read_csv(DATA, parse_dates=['date'])
>>> df.drop(columns='index', inplace=True)
>>>
>>> result = df.groupby(['month','item']).agg(
... duration_count=('duration', 'count'),
... duration_sum=('duration', 'sum'),
... duration_nunique=('duration', 'nunique'),
...
... duration_mean=('duration', 'mean'),
... duration_median=('duration', 'median'),
... duration_std=('duration', 'std'),
... duration_std2=('duration', lambda column: column.std().astype(int)),
...
... duration_min=('duration', 'min'),
... duration_q25=('duration', quantile25),
... duration_q50=('duration', quantile50),
... duration_q75=('duration', quantile75),
... duration_max=('duration', 'max'),
...
... when_first=('date', 'first'),
... when_last=('date', 'last'),
... )
>>>
>>> result
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
month item
2014-11 call 107 25547.000 76 238.757009 48.000 387.128905 387 1.000 5.500 48.000 328.000 1940.000 2014-10-15 06:58:00 2014-12-11 19:01:00
data 29 998.441 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2014-10-15 06:58:00 2014-12-11 06:58:00
sms 94 94.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 call 79 13561.000 61 171.658228 55.000 324.731798 324 2.000 10.500 55.000 152.000 2120.000 2014-11-14 17:24:00 2014-12-14 19:54:00
data 30 1032.870 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2014-11-13 06:58:00 2014-12-12 06:58:00
sms 48 48.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 call 88 17070.000 70 193.977273 55.500 300.671661 300 2.000 15.500 55.500 273.500 1859.000 2014-12-15 20:03:00 2015-01-14 20:47:00
data 31 1067.299 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2014-12-13 06:58:00 2015-12-01 06:58:00
sms 86 86.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 call 67 14416.000 63 215.164179 89.000 329.672914 329 1.000 30.000 89.000 241.000 1863.000 2015-01-15 10:36:00 2015-09-02 17:54:00
data 31 1067.299 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2015-01-13 06:58:00 2015-12-02 06:58:00
sms 39 39.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 call 47 21727.000 46 462.276596 107.000 1552.192218 1552 2.000 33.500 107.000 320.000 10528.000 2015-12-02 20:15:00 2015-04-03 12:29:00
data 29 998.441 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2015-02-13 06:58:00 2015-03-13 06:58:00
sms 25 25.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2015-02-19 18:46:00 2015-03-14 00:16:00
>>> result.loc[('2015-01','call')]
duration_count 88
duration_sum 17070.0
duration_nunique 70
duration_mean 193.977273
duration_median 55.5
duration_std 300.671661
duration_std2 300
duration_min 2.0
duration_q25 15.5
duration_q50 55.5
duration_q75 273.5
duration_max 1859.0
when_first 2014-12-15 20:03:00
when_last 2015-01-14 20:47:00
Name: (2015-01, call), dtype: object
>>> result.loc['2015-01']
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
item
call 88 17070.000 70 193.977273 55.500 300.671661 300 2.000 15.500 55.500 273.500 1859.000 2014-12-15 20:03:00 2015-01-14 20:47:00
data 31 1067.299 1 34.429000 34.429 0.000000 0 34.429 34.429 34.429 34.429 34.429 2014-12-13 06:58:00 2015-12-01 06:58:00
sms 86 86.000 1 1.000000 1.000 0.000000 0 1.000 1.000 1.000 1.000 1.000 2014-12-15 19:56:00 2015-01-14 23:36:00
>>> result.loc['2015-01'].transpose()
item call data sms
duration_count 88 31 86
duration_sum 17070.0 1067.299 86.0
duration_nunique 70 1 1
duration_mean 193.977273 34.429 1.0
duration_median 55.5 34.429 1.0
duration_std 300.671661 0.0 0.0
duration_std2 300 0 0
duration_min 2.0 34.429 1.0
duration_q25 15.5 34.429 1.0
duration_q50 55.5 34.429 1.0
duration_q75 273.5 34.429 1.0
duration_max 1859.0 34.429 1.0
when_first 2014-12-15 20:03:00 2014-12-13 06:58:00 2014-12-15 19:56:00
when_last 2015-01-14 20:47:00 2015-12-01 06:58:00 2015-01-14 23:36:00
>>> sms = result.index.get_level_values('item') == 'sms'
>>> sms
array([False, False, True, False, False, True, False, False, True,
False, False, True, False, False, True])
>>>
>>> result[sms]
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
month item
2014-11 sms 94 94.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 sms 48 48.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 sms 86 86.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 sms 39 39.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 sms 25 25.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-02-19 18:46:00 2015-03-14 00:16:00
Cross-section:
>>> result.xs('sms', level='item')
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
month
2014-11 94 94.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 48 48.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 86 86.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 39 39.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 25 25.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-02-19 18:46:00 2015-03-14 00:16:00
Slicer Object:
>>> result.loc[(slice(None), 'sms'), :]
duration_count duration_sum duration_nunique duration_mean duration_median duration_std duration_std2 duration_min duration_q25 duration_q50 duration_q75 duration_max when_first when_last
month item
2014-11 sms 94 94.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-10-16 22:18:00 2014-11-13 22:31:00
2014-12 sms 48 48.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-11-14 17:28:00 2014-07-12 23:22:00
2015-01 sms 86 86.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2014-12-15 19:56:00 2015-01-14 23:36:00
2015-02 sms 39 39.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-01-15 12:23:00 2015-10-02 21:40:00
2015-03 sms 25 25.0 1 1.0 1.0 0.0 0 1.0 1.0 1.0 1.0 1.0 2015-02-19 18:46:00 2015-03-14 00:16:00