5.19. DataFrame GroupBy

.size()
.mean()
.nunique()
.sum()
.count()
.max()
.first()

5.19.1. SetUp

>>> import pandas as pd
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-en.csv'
>>>
>>> df = pd.read_csv(DATA, parse_dates=['date'])
>>> df.drop(columns='index', inplace=True)

date: The date and time of the entry
duration: The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry
item: A description of the event occurring – can be one of call, sms, or data
month: The billing month that each entry belongs to – of form YYYY-MM
network: The mobile network that was called/texted for each entry
network_type: Whether the number being called was a mobile, international ('world'), voicemail, landline, or other ('special') number.

Source [1]

5.19.2. Grouping

Group by one item:

>>> df.groupby('item')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>

Group by multiple items:

>>> df.groupby(['month', 'item'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>

5.19.3. Groupby Methods

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns
.size()
.mean()
.nunique()
.sum()
.count()
.max()
.first()

5.19.4. Size

>>> df.groupby('item').size()
item
call    388
data    150
sms     292
dtype: int64

>>> df.groupby(['month', 'item']).size()
month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
         sms      48
2015-01  call     88
         data     31
         sms      86
2015-02  call     67
         data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
dtype: int64

5.19.5. Mean

>>> df.groupby('item')['duration'].mean()
item
call    237.940722
data     34.429000
sms       1.000000
Name: duration, dtype: float64

>>> df.groupby(['month', 'item'])['duration'].mean()
month    item
2014-11  call    238.757009
         data     34.429000
         sms       1.000000
2014-12  call    171.658228
         data     34.429000
         sms       1.000000
2015-01  call    193.977273
         data     34.429000
         sms       1.000000
2015-02  call    215.164179
         data     34.429000
         sms       1.000000
2015-03  call    462.276596
         data     34.429000
         sms       1.000000
Name: duration, dtype: float64

5.19.6. Number of Uniques

>>> df.groupby('item').nunique()
      date  duration  month  network  network_type
item
call   378       220      5        6             3
data   150         1      5        1             1
sms    222         1      5        6             3

>>> df.groupby(['month', 'item']).nunique()
              date  duration  network  network_type
month   item
2014-11 call   104        76        6             3
        data    29         1        1             1
        sms     79         1        5             2
2014-12 call    76        61        6             3
        data    30         1        1             1
        sms     41         1        5             2
2015-01 call    84        70        6             3
        data    31         1        1             1
        sms     58         1        4             1
2015-02 call    67        63        6             3
        data    31         1        1             1
        sms     27         1        5             2
2015-03 call    47        46        6             3
        data    29         1        1             1
        sms     17         1        4             2

5.19.7. Sum

>>> df.groupby('item')['duration'].sum()
item
call    92321.00
data     5164.35
sms       292.00
Name: duration, dtype: float64

>>> df.groupby(['month', 'item'])['duration'].sum()
month    item
2014-11  call    25547.000
         data      998.441
         sms        94.000
2014-12  call    13561.000
         data     1032.870
         sms        48.000
2015-01  call    17070.000
         data     1067.299
         sms        86.000
2015-02  call    14416.000
         data     1067.299
         sms        39.000
2015-03  call    21727.000
         data      998.441
         sms        25.000
Name: duration, dtype: float64

5.19.8. Count

>>> df.groupby('item').count()
      date  duration  month  network  network_type
item
call   388       388    388      388           388
data   150       150    150      150           150
sms    292       292    292      292           292

>>> df.groupby(['month', 'item']).count()
              date  duration  network  network_type
month   item
2014-11 call   107       107      107           107
        data    29        29       29            29
        sms     94        94       94            94
2014-12 call    79        79       79            79
        data    30        30       30            30
        sms     48        48       48            48
2015-01 call    88        88       88            88
        data    31        31       31            31
        sms     86        86       86            86
2015-02 call    67        67       67            67
        data    31        31       31            31
        sms     39        39       39            39
2015-03 call    47        47       47            47
        data    29        29       29            29
        sms     25        25       25            25

5.19.9. Minimum

>>> df.groupby('item').min()
                    date  duration    month network network_type
item
call 2014-01-11 15:13:00     1.000  2014-11  Meteor     landline
data 2014-01-11 06:58:00    34.429  2014-11    data         data
sms  2014-01-12 12:51:00     1.000  2014-11  Meteor       mobile

>>> df.groupby(['month', 'item']).min()
                            date  duration network network_type
month   item
2014-11 call 2014-01-11 15:13:00     1.000  Meteor     landline
        data 2014-01-11 06:58:00    34.429    data         data
        sms  2014-03-11 08:40:00     1.000  Meteor       mobile
2014-12 call 2014-02-12 11:40:00     2.000  Meteor     landline
        data 2014-01-12 06:58:00    34.429    data         data
        sms  2014-01-12 12:51:00     1.000  Meteor       mobile
2015-01 call 2014-12-15 20:03:00     2.000  Meteor     landline
        data 2014-12-13 06:58:00    34.429    data         data
        sms  2014-12-15 19:56:00     1.000  Meteor       mobile
2015-02 call 2015-01-02 13:33:00     1.000  Meteor     landline
        data 2015-01-02 06:58:00    34.429    data         data
        sms  2015-01-15 12:23:00     1.000  Meteor       mobile
2015-03 call 2015-01-03 12:19:00     2.000  Meteor     landline
        data 2015-01-03 06:58:00    34.429    data         data
        sms  2015-02-03 09:19:00     1.000   Tesco       mobile

5.19.10. Maximum

>>> df.groupby('item').max()
                    date   duration    month    network network_type
item
call 2015-12-02 20:51:00  10528.000  2015-03  voicemail    voicemail
data 2015-12-03 06:58:00     34.429  2015-03       data         data
sms  2015-12-01 18:26:00      1.000  2015-03      world        world

>>> df.groupby(['month', 'item']).max()
                            date   duration    network network_type
month   item
2014-11 call 2014-12-11 19:01:00   1940.000  voicemail    voicemail
        data 2014-12-11 06:58:00     34.429       data         data
        sms  2014-12-11 19:20:00      1.000    special      special
2014-12 call 2014-12-14 19:54:00   2120.000  voicemail    voicemail
        data 2014-12-12 06:58:00     34.429       data         data
        sms  2014-11-30 14:44:00      1.000      world        world
2015-01 call 2015-12-01 18:23:00   1859.000  voicemail    voicemail
        data 2015-12-01 06:58:00     34.429       data         data
        sms  2015-12-01 18:26:00      1.000   Vodafone       mobile
2015-02 call 2015-09-02 17:54:00   1863.000  voicemail    voicemail
        data 2015-12-02 06:58:00     34.429       data         data
        sms  2015-10-02 21:40:00      1.000    special      special
2015-03 call 2015-12-02 20:51:00  10528.000  voicemail    voicemail
        data 2015-12-03 06:58:00     34.429       data         data
        sms  2015-04-03 10:30:00      1.000      world        world

5.19.11. First

>>> df.groupby('item').first()
                    date  duration    month   network network_type
item
call 2014-10-15 06:58:00    13.000  2014-11  Vodafone       mobile
data 2014-10-15 06:58:00    34.429  2014-11      data         data
sms  2014-10-16 22:18:00     1.000  2014-11    Meteor       mobile

>>> df.groupby(['month', 'item']).first()
                            date  duration    network network_type
month   item
2014-11 call 2014-10-15 06:58:00    13.000   Vodafone       mobile
        data 2014-10-15 06:58:00    34.429       data         data
        sms  2014-10-16 22:18:00     1.000     Meteor       mobile
2014-12 call 2014-11-14 17:24:00   124.000  voicemail    voicemail
        data 2014-11-13 06:58:00    34.429       data         data
        sms  2014-11-14 17:28:00     1.000   Vodafone       mobile
2015-01 call 2014-12-15 20:03:00     4.000      Three       mobile
        data 2014-12-13 06:58:00    34.429       data         data
        sms  2014-12-15 19:56:00     1.000      Three       mobile
2015-02 call 2015-01-15 10:36:00    28.000      Three       mobile
        data 2015-01-13 06:58:00    34.429       data         data
        sms  2015-01-15 12:23:00     1.000    special      special
2015-03 call 2015-12-02 20:15:00    69.000   landline     landline
        data 2015-02-13 06:58:00    34.429       data         data
        sms  2015-02-19 18:46:00     1.000   Vodafone       mobile

5.19.12. Last

>>> df.groupby('item').last()
                    date   duration    month   network network_type
item
call 2015-04-03 12:29:00  10528.000  2015-03  landline     landline
data 2015-03-13 06:58:00     34.429  2015-03      data         data
sms  2015-03-14 00:16:00      1.000  2015-03     world        world

>>> df.groupby(['month', 'item']).last()
                            date   duration   network network_type
month   item
2014-11 call 2014-12-11 19:01:00      7.000  Vodafone       mobile
        data 2014-12-11 06:58:00     34.429      data         data
        sms  2014-11-13 22:31:00      1.000  Vodafone       mobile
2014-12 call 2014-12-14 19:54:00     25.000     Three       mobile
        data 2014-12-12 06:58:00     34.429      data         data
        sms  2014-07-12 23:22:00      1.000     world        world
2015-01 call 2015-01-14 20:47:00     36.000     Three       mobile
        data 2015-12-01 06:58:00     34.429      data         data
        sms  2015-01-14 23:36:00      1.000     Three       mobile
2015-02 call 2015-09-02 17:54:00     89.000     Three       mobile
        data 2015-12-02 06:58:00     34.429      data         data
        sms  2015-10-02 21:40:00      1.000  Vodafone       mobile
2015-03 call 2015-04-03 12:29:00  10528.000  landline     landline
        data 2015-03-13 06:58:00     34.429      data         data
        sms  2015-03-14 00:16:00      1.000     world        world

5.19.13. Output format

Series or DataFrame?

Produces Pandas Series:

>>> df.groupby('month')['duration'].sum()
month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

Produces Pandas DataFrame:

>>> df.groupby('month')[['duration']].sum()
          duration
month
2014-11  26639.441
2014-12  14641.870
2015-01  18223.299
2015-02  15522.299
2015-03  22750.441

5.19.14. Use Case - 1

>>> list(df.groupby(['month']).groups.keys())
['2014-11', '2014-12', '2015-01', '2015-02', '2015-03']

>>> len(df.groupby(['month']).groups['2014-11'])
230

5.19.15. Use Case - 2

Get the first entry for each month:

>>> df.groupby('month').first()
                       date  duration  item   network network_type
month
2014-11 2014-10-15 06:58:00    34.429  data      data         data
2014-12 2014-11-13 06:58:00    34.429  data      data         data
2015-01 2014-12-13 06:58:00    34.429  data      data         data
2015-02 2015-01-13 06:58:00    34.429  data      data         data
2015-03 2015-12-02 20:15:00    69.000  call  landline     landline

5.19.16. Use Case - 3

Get the sum of the durations per month:

>>> df.groupby('month')['duration'].sum()
month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

5.19.17. Use Case - 4

Get the number of dates / entries in each month:

>>> df.groupby('month')['date'].count()
month
2014-11    230
2014-12    157
2015-01    205
2015-02    137
2015-03    101
Name: date, dtype: int64

5.19.18. Use Case - 5

What is the sum of durations, for calls only, to each network:

>>> df.loc[df['item'] == 'call'].groupby('network')['duration'].sum()
network
Meteor        7200.0
Tesco        13828.0
Three        36464.0
Vodafone     14621.0
landline     18433.0
voicemail     1775.0
Name: duration, dtype: float64

5.19.19. Use Case - 6

How many calls, sms, and data entries are in each month?:

>>> df.groupby(['month', 'item'])['date'].count()
month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
         sms      48
2015-01  call     88
         data     31
         sms      86
2015-02  call     67
         data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
Name: date, dtype: int64

5.19.20. Use Case - 7

How many calls, texts, and data are sent per month, split by network_type?:

>>> df.groupby(['month', 'network_type'])['date'].count()
month    network_type
2014-11  data             29
         landline          5
         mobile          189
         special           1
         voicemail         6
2014-12  data             30
         landline          7
         mobile          108
         voicemail         8
         world             4
2015-01  data             31
         landline         11
         mobile          160
         voicemail         3
2015-02  data             31
         landline          8
         mobile           90
         special           2
         voicemail         6
2015-03  data             29
         landline         11
         mobile           54
         voicemail         4
         world             3
Name: date, dtype: int64

5.19.21. Datasets

5.19.22. References

5.19.23. Assignments

# %% About
# - Name: DataFrame Groupby Phones
# - Difficulty: easy
# - Lines: 5
# - Minutes: 8

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Give information about total duration of all phone calls for each calendar month
# 3. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Podaj informacje o łącznej długości wszystkich połączeń telefonicznych dla każdego miesiąca kalendarzowego
# 3. Uruchom doctesty - wszystkie muszą się powieść

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0

>>> assert sys.version_info >= (3, 9), \
'Python has an is invalid version; expected: `3.9` or newer.'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Variable `result` has an invalid value; assign result of your program to it.'

>>> assert type(result) is pd.Series, \
'Variable `result` has an invalid type; expected: `pd.Series`.'

>>> result  # doctest: +NORMALIZE_WHITESPACE
year  month
1999  10       16309.0
      11       16780.0
      12       14861.0
2000  1        18705.0
      2        11019.0
      3        14647.0
Name: duration, dtype: float64
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -f -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.DataFrame

# %% Data
DATA = 'https://python3.info/_static/phones-pl.csv'

# %% Result
result = ...

# %% About
# - Name: DataFrame Groupby FemaleTop
# - Difficulty: medium
# - Lines: 5
# - Minutes: 8

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Which nationality has the most flight time of a female in space?
# 3. Sort the result in descending order
# 4. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Który kraj ma największy nalot kobiet w kosmosie?
# 3. Posortuj wynik malejąco
# 4. Uruchom doctesty - wszystkie muszą się powieść

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0

>>> assert sys.version_info >= (3, 9), \
'Python has an is invalid version; expected: `3.9` or newer.'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Variable `result` has an invalid value; assign result of your program to it.'

>>> assert type(result) is pd.Series, \
'Variable `result` has an invalid type; expected: `pd.Series`.'

>>> result  # doctest: +NORMALIZE_WHITESPACE
Nationality
American        124
Russian           6
Canadian          3
Japanese          3
Chinese           2
French            2
British           1
Italian           1
South Korean      1
Name: Flights, dtype: int64
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -f -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.Series

# %% Data
DATA = 'https://python3.info/_static/astro-gender.csv'

# %% Result
result = ...

# %% About
# - Name: DataFrame Groupby AstronautTop10
# - Difficulty: medium
# - Lines: 5
# - Minutes: 13

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# 1. Read data from `DATA`
# 2. Create ranking of astronauts with most flights
# 3. Define `result: pd.Dataframe` with top 9
# 4. Sort by `flights` (descending) and `name` (ascending)
# 5. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA`
# 2. Stwórz ranking astronautów z największą liczbą lotów
# 3. Zdefiniuj `result: pd.Dataframe` z top 9
# 4. Posortuj po `flights` (malejąco) i `name` (rosnąco)
# 5. Uruchom doctesty - wszystkie muszą się powieść

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0

>>> assert sys.version_info >= (3, 9), \
'Python has an is invalid version; expected: `3.9` or newer.'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Variable `result` has an invalid value; assign result of your program to it.'

>>> assert type(result) is pd.DataFrame, \
'Variable `result` has an invalid type; expected: `pd.DataFrame`.'

>>> result.reset_index(drop=True)
                       name  flights
0   Chang-Diaz, Franklin R.        7
1            Ross, Jerry L.        7
2     Brown, Curtis L., Jr.        6
3         Foale, C. Michael        6
4          Krikalev, Sergei        6
5         Malenchenko, Yuri        6
6  Musgrave, Franklin Story        6
7       Wetherbee, James D.        6
8            Young, John W.        6
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -f -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.Series

# %% Data
DATA = 'https://python3.info/_static/astro-selection.csv'

# %% Result
result = ...

# FIXME: za trudne zadanie, przenieść je do case study

# %% About
# - Name: DataFrame Groupby Astro EVA
# - Difficulty: medium
# - Lines: 13
# - Minutes: 21

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Create top 10 ranking of astronauts with the most time spent on EVA (ExtraVehicular Activity)
# 3. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Stwórz ranking top 10 astronautów z największym czasem EVA (Spacerów kosmicznych)
# 3. Uruchom doctesty - wszystkie muszą się powieść

# %% Hints
# - Note, that file delimiter is semicolon ";" (not comma)
# - Parse CSV and replace newlines inside fields with `","`
# - Split names into separate columns for each spacewalker (first, second, third)
# - Split names into separate rows for each spacewalker (use ffill)
# - Split times into separate columns (hours, minutes)
# - `pd.Series.str.split()` with `expand=True`
# - `pd.DataFrame.melt()`
# - `pd.DataFrame.set_index()`
# - `pd.Series.astype()`

# %% Doctests
"""
>>> import sys; sys.tracebacklimit = 0

>>> assert sys.version_info >= (3, 9), \
'Python has an is invalid version; expected: `3.9` or newer.'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Variable `result` has an invalid value; assign result of your program to it.'

>>> assert type(result) is pd.DataFrame, \
'Variable `result` has an invalid type; expected: `pd.DataFrame`.'

>>> result  # doctest: +NORMALIZE_WHITESPACE
                             Duration
Astronaut
Anatoliy Solovyov     3 days 06:48:00
Michael Lopez-Alegria 2 days 19:40:00
Peggy Whitson         2 days 12:21:00
Fyodor Yurchikhin     2 days 11:29:00
Jerry Ross            2 days 10:38:00
John Grunsfeld        2 days 10:30:00
Richard Mastracchio   2 days 05:04:00
Sunita Williams       2 days 02:40:00
Stephen Smith         2 days 01:48:00
Edward Fincke         2 days 00:36:00
"""

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -f -v myfile.py`

# %% Imports
import pandas as pd

# %% Types
result: pd.DataFrame

# %% Data
DATA = 'https://python3.info/_static/astro-eva.csv'

# %% Result
result = ...