5.19. DataFrame GroupBy
.size()
.mean()
.nunique()
.sum()
.count()
.max()
.first()
5.19.1. SetUp
>>> import pandas as pd
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-en.csv'
>>>
>>> df = pd.read_csv(DATA, parse_dates=['date'])
>>> df.drop(columns='index', inplace=True)
- date
The date and time of the entry
- duration
The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry
- item
A description of the event occurring – can be one of call, sms, or data
- month
The billing month that each entry belongs to – of form
YYYY-MM
- network
The mobile network that was called/texted for each entry
- network_type
Whether the number being called was a mobile, international ('world'), voicemail, landline, or other ('special') number.
Source [1]
5.19.2. Grouping
Group by one item:
>>> df.groupby('item')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>
Group by multiple items:
>>> df.groupby(['month', 'item'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>
5.19.3. Groupby Methods
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns
.size()
.mean()
.nunique()
.sum()
.count()
.max()
.first()
5.19.4. Size
>>> df.groupby('item').size()
item
call 388
data 150
sms 292
dtype: int64
>>> df.groupby(['month', 'item']).size()
month item
2014-11 call 107
data 29
sms 94
2014-12 call 79
data 30
sms 48
2015-01 call 88
data 31
sms 86
2015-02 call 67
data 31
sms 39
2015-03 call 47
data 29
sms 25
dtype: int64
5.19.5. Mean
>>> df.groupby('item')['duration'].mean()
item
call 237.940722
data 34.429000
sms 1.000000
Name: duration, dtype: float64
>>> df.groupby(['month', 'item'])['duration'].mean()
month item
2014-11 call 238.757009
data 34.429000
sms 1.000000
2014-12 call 171.658228
data 34.429000
sms 1.000000
2015-01 call 193.977273
data 34.429000
sms 1.000000
2015-02 call 215.164179
data 34.429000
sms 1.000000
2015-03 call 462.276596
data 34.429000
sms 1.000000
Name: duration, dtype: float64
5.19.6. Number of Uniques
>>> df.groupby('item').nunique()
date duration month network network_type
item
call 378 220 5 6 3
data 150 1 5 1 1
sms 222 1 5 6 3
>>> df.groupby(['month', 'item']).nunique()
date duration network network_type
month item
2014-11 call 104 76 6 3
data 29 1 1 1
sms 79 1 5 2
2014-12 call 76 61 6 3
data 30 1 1 1
sms 41 1 5 2
2015-01 call 84 70 6 3
data 31 1 1 1
sms 58 1 4 1
2015-02 call 67 63 6 3
data 31 1 1 1
sms 27 1 5 2
2015-03 call 47 46 6 3
data 29 1 1 1
sms 17 1 4 2
5.19.7. Sum
>>> df.groupby('item')['duration'].sum()
item
call 92321.00
data 5164.35
sms 292.00
Name: duration, dtype: float64
>>> df.groupby(['month', 'item'])['duration'].sum()
month item
2014-11 call 25547.000
data 998.441
sms 94.000
2014-12 call 13561.000
data 1032.870
sms 48.000
2015-01 call 17070.000
data 1067.299
sms 86.000
2015-02 call 14416.000
data 1067.299
sms 39.000
2015-03 call 21727.000
data 998.441
sms 25.000
Name: duration, dtype: float64
5.19.8. Count
>>> df.groupby('item').count()
date duration month network network_type
item
call 388 388 388 388 388
data 150 150 150 150 150
sms 292 292 292 292 292
>>> df.groupby(['month', 'item']).count()
date duration network network_type
month item
2014-11 call 107 107 107 107
data 29 29 29 29
sms 94 94 94 94
2014-12 call 79 79 79 79
data 30 30 30 30
sms 48 48 48 48
2015-01 call 88 88 88 88
data 31 31 31 31
sms 86 86 86 86
2015-02 call 67 67 67 67
data 31 31 31 31
sms 39 39 39 39
2015-03 call 47 47 47 47
data 29 29 29 29
sms 25 25 25 25
5.19.9. Minimum
>>> df.groupby('item').min()
date duration month network network_type
item
call 2014-01-11 15:13:00 1.000 2014-11 Meteor landline
data 2014-01-11 06:58:00 34.429 2014-11 data data
sms 2014-01-12 12:51:00 1.000 2014-11 Meteor mobile
>>> df.groupby(['month', 'item']).min()
date duration network network_type
month item
2014-11 call 2014-01-11 15:13:00 1.000 Meteor landline
data 2014-01-11 06:58:00 34.429 data data
sms 2014-03-11 08:40:00 1.000 Meteor mobile
2014-12 call 2014-02-12 11:40:00 2.000 Meteor landline
data 2014-01-12 06:58:00 34.429 data data
sms 2014-01-12 12:51:00 1.000 Meteor mobile
2015-01 call 2014-12-15 20:03:00 2.000 Meteor landline
data 2014-12-13 06:58:00 34.429 data data
sms 2014-12-15 19:56:00 1.000 Meteor mobile
2015-02 call 2015-01-02 13:33:00 1.000 Meteor landline
data 2015-01-02 06:58:00 34.429 data data
sms 2015-01-15 12:23:00 1.000 Meteor mobile
2015-03 call 2015-01-03 12:19:00 2.000 Meteor landline
data 2015-01-03 06:58:00 34.429 data data
sms 2015-02-03 09:19:00 1.000 Tesco mobile
5.19.10. Maximum
>>> df.groupby('item').max()
date duration month network network_type
item
call 2015-12-02 20:51:00 10528.000 2015-03 voicemail voicemail
data 2015-12-03 06:58:00 34.429 2015-03 data data
sms 2015-12-01 18:26:00 1.000 2015-03 world world
>>> df.groupby(['month', 'item']).max()
date duration network network_type
month item
2014-11 call 2014-12-11 19:01:00 1940.000 voicemail voicemail
data 2014-12-11 06:58:00 34.429 data data
sms 2014-12-11 19:20:00 1.000 special special
2014-12 call 2014-12-14 19:54:00 2120.000 voicemail voicemail
data 2014-12-12 06:58:00 34.429 data data
sms 2014-11-30 14:44:00 1.000 world world
2015-01 call 2015-12-01 18:23:00 1859.000 voicemail voicemail
data 2015-12-01 06:58:00 34.429 data data
sms 2015-12-01 18:26:00 1.000 Vodafone mobile
2015-02 call 2015-09-02 17:54:00 1863.000 voicemail voicemail
data 2015-12-02 06:58:00 34.429 data data
sms 2015-10-02 21:40:00 1.000 special special
2015-03 call 2015-12-02 20:51:00 10528.000 voicemail voicemail
data 2015-12-03 06:58:00 34.429 data data
sms 2015-04-03 10:30:00 1.000 world world
5.19.11. First
>>> df.groupby('item').first()
date duration month network network_type
item
call 2014-10-15 06:58:00 13.000 2014-11 Vodafone mobile
data 2014-10-15 06:58:00 34.429 2014-11 data data
sms 2014-10-16 22:18:00 1.000 2014-11 Meteor mobile
>>> df.groupby(['month', 'item']).first()
date duration network network_type
month item
2014-11 call 2014-10-15 06:58:00 13.000 Vodafone mobile
data 2014-10-15 06:58:00 34.429 data data
sms 2014-10-16 22:18:00 1.000 Meteor mobile
2014-12 call 2014-11-14 17:24:00 124.000 voicemail voicemail
data 2014-11-13 06:58:00 34.429 data data
sms 2014-11-14 17:28:00 1.000 Vodafone mobile
2015-01 call 2014-12-15 20:03:00 4.000 Three mobile
data 2014-12-13 06:58:00 34.429 data data
sms 2014-12-15 19:56:00 1.000 Three mobile
2015-02 call 2015-01-15 10:36:00 28.000 Three mobile
data 2015-01-13 06:58:00 34.429 data data
sms 2015-01-15 12:23:00 1.000 special special
2015-03 call 2015-12-02 20:15:00 69.000 landline landline
data 2015-02-13 06:58:00 34.429 data data
sms 2015-02-19 18:46:00 1.000 Vodafone mobile
5.19.12. Last
>>> df.groupby('item').last()
date duration month network network_type
item
call 2015-04-03 12:29:00 10528.000 2015-03 landline landline
data 2015-03-13 06:58:00 34.429 2015-03 data data
sms 2015-03-14 00:16:00 1.000 2015-03 world world
>>> df.groupby(['month', 'item']).last()
date duration network network_type
month item
2014-11 call 2014-12-11 19:01:00 7.000 Vodafone mobile
data 2014-12-11 06:58:00 34.429 data data
sms 2014-11-13 22:31:00 1.000 Vodafone mobile
2014-12 call 2014-12-14 19:54:00 25.000 Three mobile
data 2014-12-12 06:58:00 34.429 data data
sms 2014-07-12 23:22:00 1.000 world world
2015-01 call 2015-01-14 20:47:00 36.000 Three mobile
data 2015-12-01 06:58:00 34.429 data data
sms 2015-01-14 23:36:00 1.000 Three mobile
2015-02 call 2015-09-02 17:54:00 89.000 Three mobile
data 2015-12-02 06:58:00 34.429 data data
sms 2015-10-02 21:40:00 1.000 Vodafone mobile
2015-03 call 2015-04-03 12:29:00 10528.000 landline landline
data 2015-03-13 06:58:00 34.429 data data
sms 2015-03-14 00:16:00 1.000 world world
5.19.13. Output format
Series or DataFrame?
Produces Pandas Series:
>>> df.groupby('month')['duration'].sum()
month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
Name: duration, dtype: float64
Produces Pandas DataFrame:
>>> df.groupby('month')[['duration']].sum()
duration
month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
5.19.14. Use Case - 1
>>> list(df.groupby(['month']).groups.keys())
['2014-11', '2014-12', '2015-01', '2015-02', '2015-03']
>>> len(df.groupby(['month']).groups['2014-11'])
230
5.19.15. Use Case - 2
Get the first entry for each month:
>>> df.groupby('month').first()
date duration item network network_type
month
2014-11 2014-10-15 06:58:00 34.429 data data data
2014-12 2014-11-13 06:58:00 34.429 data data data
2015-01 2014-12-13 06:58:00 34.429 data data data
2015-02 2015-01-13 06:58:00 34.429 data data data
2015-03 2015-12-02 20:15:00 69.000 call landline landline
5.19.16. Use Case - 3
Get the sum of the durations per month:
>>> df.groupby('month')['duration'].sum()
month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
Name: duration, dtype: float64
5.19.17. Use Case - 4
Get the number of dates / entries in each month:
>>> df.groupby('month')['date'].count()
month
2014-11 230
2014-12 157
2015-01 205
2015-02 137
2015-03 101
Name: date, dtype: int64
5.19.18. Use Case - 5
What is the sum of durations, for calls only, to each network:
>>> df.loc[df['item'] == 'call'].groupby('network')['duration'].sum()
network
Meteor 7200.0
Tesco 13828.0
Three 36464.0
Vodafone 14621.0
landline 18433.0
voicemail 1775.0
Name: duration, dtype: float64
5.19.19. Use Case - 6
How many calls, sms, and data entries are in each month?:
>>> df.groupby(['month', 'item'])['date'].count()
month item
2014-11 call 107
data 29
sms 94
2014-12 call 79
data 30
sms 48
2015-01 call 88
data 31
sms 86
2015-02 call 67
data 31
sms 39
2015-03 call 47
data 29
sms 25
Name: date, dtype: int64
5.19.20. Use Case - 7
How many calls, texts, and data are sent per month, split by network_type?:
>>> df.groupby(['month', 'network_type'])['date'].count()
month network_type
2014-11 data 29
landline 5
mobile 189
special 1
voicemail 6
2014-12 data 30
landline 7
mobile 108
voicemail 8
world 4
2015-01 data 31
landline 11
mobile 160
voicemail 3
2015-02 data 31
landline 8
mobile 90
special 2
voicemail 6
2015-03 data 29
landline 11
mobile 54
voicemail 4
world 3
Name: date, dtype: int64
5.19.21. Datasets
5.19.22. References
5.19.23. Assignments
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame Groupby Phones
# - Difficulty: easy
# - Lines: 5
# - Minutes: 8
# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Give information about total number of all phone calls for each calendar month
# 3. Run doctests - all must succeed
# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Podaj informacje o łącznej liczbie wszystkich połączeń telefonicznych dla każdego miesiąca kalendarzowego
# 3. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.Series, \
'Variable `result` must be a `pd.Series` type'
>>> result # doctest: +NORMALIZE_WHITESPACE
year month
1999 10 16309.0
11 16780.0
12 14861.0
2000 1 18705.0
2 11019.0
3 14647.0
Name: duration, dtype: float64
"""
import pandas as pd
DATA = 'https://python3.info/_static/phones-pl.csv'
# type: pd.DataFrame
result = ...
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame Groupby FemaleTop
# - Difficulty: medium
# - Lines: 5
# - Minutes: 8
# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Which nationality has the most flight time of a female in space?
# 3. Sort the result in descending order
# 4. Run doctests - all must succeed
# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Który kraj ma największy nalot kobiet w kosmosie?
# 3. Posortuj wynik malejąco
# 4. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.Series, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result # doctest: +NORMALIZE_WHITESPACE
Nationality
American 124
Russian 6
Canadian 3
Japanese 3
Chinese 2
French 2
British 1
Italian 1
South Korean 1
Name: Flights, dtype: int64
"""
import pandas as pd
DATA = 'https://python3.info/_static/astro-gender.csv'
# type: pd.Series
result = ...
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame Groupby AstronautTop10
# - Difficulty: medium
# - Lines: 5
# - Minutes: 13
# %% English
# 1. Read data from `DATA`
# 2. Create ranking of astronauts with most fligts
# 3. Define `result: pd.Dataframe` with top 9
# 4. Sort by `flights` (descending) and `name` (ascending)
# 5. Run doctests - all must succeed
# %% Polish
# 1. Wczytaj dane z `DATA`
# 2. Stwórz ranking astronautów z największą liczbą lotów
# 3. Zdefiniuj `result: pd.Dataframe` z top 9
# 4. Posortuj po `flights` (malejąco) i `name` (rosnąco)
# 5. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result.reset_index(drop=True)
name flights
0 Chang-Diaz, Franklin R. 7
1 Ross, Jerry L. 7
2 Brown, Curtis L., Jr. 6
3 Foale, C. Michael 6
4 Krikalev, Sergei 6
5 Malenchenko, Yuri 6
6 Musgrave, Franklin Story 6
7 Wetherbee, James D. 6
8 Young, John W. 6
"""
import pandas as pd
DATA = 'https://python3.info/_static/astro-selection.csv'
# Read data from `DATA`
# Create ranking of astronauts with most fligts
# Define `result: pd.Dataframe` with top 9
# Sort by `flights` (descending) and `name` (ascending)
# Run doctests - all must succeed
# type: pd.Series
result = ...
# FIXME: za trudne zadanie, przenieść je do case study
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: DataFrame Groupby Astro EVA
# - Difficulty: medium
# - Lines: 13
# - Minutes: 21
# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Create top 10 ranking of astronauts with the most time spent on EVA (ExtraVehicular Activity)
# 3. Run doctests - all must succeed
# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Stwórz ranking top 10 astronautów z największym czasem EVA (Spacerów kosmicznych)
# 3. Uruchom doctesty - wszystkie muszą się powieść
# %% Hints
# - Note, that file delimiter is semicolon ";" (not comma)
# - Parse CSV and replace newlines inside fields with `","`
# - Split names into separate columns for each spacewalker (first, second, third)
# - Split names into separate rows for each spacewalker (use ffill)
# - Split times into separate columns (hours, minutes)
# - `pd.Series.str.split()` with `expand=True`
# - `pd.DataFrame.melt()`
# - `pd.DataFrame.set_index()`
# - `pd.Series.astype()`
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result # doctest: +NORMALIZE_WHITESPACE
Duration
Astronaut
Anatoliy Solovyov 3 days 06:48:00
Michael Lopez-Alegria 2 days 19:40:00
Peggy Whitson 2 days 12:21:00
Fyodor Yurchikhin 2 days 11:29:00
Jerry Ross 2 days 10:38:00
John Grunsfeld 2 days 10:30:00
Richard Mastracchio 2 days 05:04:00
Sunita Williams 2 days 02:40:00
Stephen Smith 2 days 01:48:00
Edward Fincke 2 days 00:36:00
"""
import pandas as pd
DATA = 'https://python3.info/_static/astro-eva.csv'
# type: pd.DataFrame
result = ...