5.17. DataFrame Mapping

  • Series.map() - Map values of Series according to an input mapping or function

  • Series.apply() - Invoke function on values of Series

  • DataFrame.apply() - Apply a function along an axis of the DataFrame

  • DataFrame.applymap() - Apply a function to a Dataframe elementwise

  • DataFrame.pipe() - Apply chainable functions that expect Series or DataFrames

  • DataFrame.where(cond, sub) - Replace values where the condition is False

  • DataFrame.mask(cond, sub) - Replace values where the condition is True

  • Series.str.split(regex, expand=True) - Split strings around given separator/delimiter

  • Series.str.extract(regex, expand=True) - Extract capture groups in the regex pat as columns in a DataFrame

  • Series.str.extractall() - Extract capture groups in the regex pat as columns in DataFrame

  • Series.str.findall(regex) - Find all occurrences of pattern or regular expression in the Series/Index

  • Series.str.fullmatch(regex) - Determine if each string entirely matches a regular expression

  • Series.dt.strftime(...) - formatted strings specified by date_format, which supports the same string format as the python standard library

  • Series.dt.date - Returns numpy array of python datetime.date objects

  • Series.dt.time - Returns numpy array of datetime.time objects

  • Series.dt.timez - Returns numpy array of datetime.time objects with timezone information

5.17.1. SetUp

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>>
>>>
>>> df = pd.DataFrame(
...     columns = ['Morning', 'Noon', 'Evening', 'Midnight'],
...     index = pd.date_range('1999-12-30', periods=7),
...     data = np.random.randn(7, 4))
>>>
>>> df
             Morning      Noon   Evening  Midnight
1999-12-30  1.764052  0.400157  0.978738  2.240893
1999-12-31  1.867558 -0.977278  0.950088 -0.151357
2000-01-01 -0.103219  0.410599  0.144044  1.454274
2000-01-02  0.761038  0.121675  0.443863  0.333674
2000-01-03  1.494079 -0.205158  0.313068 -0.854096
2000-01-04 -2.552990  0.653619  0.864436 -0.742165
2000-01-05  2.269755 -1.454366  0.045759 -0.187184

5.17.2. Map

  • Works only on Series

  • Argument: dict, Series, or Callable

  • Works element-wise on a Series

  • Operate on one element at time

  • When passed a dictionary/Series will map elements based on the keys in that dictionary/Series, missing values will be recorded as NaN in the output

  • Is optimised for elementwise mappings and transformation

  • Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance [5]

>>> df['Morning'].map(lambda value: round(value, 2))
1999-12-30    1.76
1999-12-31    1.87
2000-01-01   -0.10
2000-01-02    0.76
2000-01-03    1.49
2000-01-04   -2.55
2000-01-05    2.27
Freq: D, Name: Morning, dtype: float64
>>> df['Morning'].map(int)
1999-12-30    1
1999-12-31    1
2000-01-01    0
2000-01-02    0
2000-01-03    1
2000-01-04   -2
2000-01-05    2
Freq: D, Name: Morning, dtype: int64

5.17.3. Apply

  • Works on both Series and DataFrame

  • Argument: Callable

  • On Series: operate on one element at time

  • On DataFrame: elementwise but also row / column basis

  • Suited to more complex operations and aggregation

  • The behaviour and return value depends on the function

  • Returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply

  • Has fastpaths when called with certain NumPy functions such as mean, sum, etc. [5]

>>> df['Morning'].apply(int)
1999-12-30    1
1999-12-31    1
2000-01-01    0
2000-01-02    0
2000-01-03    1
2000-01-04   -2
2000-01-05    2
Freq: D, Name: Morning, dtype: int64
>>> df['Morning'].apply(lambda value: round(value, 2))
1999-12-30    1.76
1999-12-31    1.87
2000-01-01   -0.10
2000-01-02    0.76
2000-01-03    1.49
2000-01-04   -2.55
2000-01-05    2.27
Freq: D, Name: Morning, dtype: float64

5.17.4. Applymap

  • Works only on DataFrame

  • Argument: Callable

  • Works element-wise on a DataFrame

  • Operate on one element at time

  • In more recent versions has been optimised for some operations

  • You will find applymap slightly faster than apply in some cases.

  • Test both and use whatever works better [5]

5.17.5. Summary

Series.map [1]:

  • Works element-wise on a Series

  • Operate on one element at time

Series.apply [2]:

  • Operate on one element at time

DataFrame.apply [3]:

  • Works on a row / column basis of a DataFrame

  • Operates on entire rows or columns at a time

DataFrame.applymap [4]:

  • Works element-wise on a DataFrame

  • Operate on one element at time

5.17.6. Differentiation

Definition:

  • map defined on Series only

  • applymap defined on Series and DataFrame

  • apply defined on DataFrame only

Argument type:

  • map takes dict, Series, Callable

  • apply takes Callable only

  • applymap takes Callable only

Behavior:

  • map elementwise

  • apply elementwise but is suited to more complex operations and aggregation; the behaviour and return value depends on the function

  • applymap elementwise

Use Case:

  • map is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))

  • apply is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize))

  • applymap is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))

Footnotes [5]:

  • map is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance. When passed a dictionary/Series will map elements based on the keys in that dictionary/Series; missing values will be recorded as NaN in the output

  • apply returns a scalar for aggregating operations, Series otherwise. Note that apply also has fastpaths when called with certain NumPy functions such as mean, sum, etc.

  • applymap in more recent versions has been optimised for some operations. You will find applymap slightly faster than apply in some cases. Test both and use whatever works better.

../../_images/pandas-dataframe-mapping.png

5.17.7. References

5.17.8. Assignments

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame Mapping Split
# - Difficulty: easy
# - Lines: 5
# - Minutes: 5

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Parse data in `datetime` column as `datetime` object
# 3. Split column `datetime` with into two separate: date and time columns
# 4. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Sparsuj dane w kolumnie `datetime` jako obiekty `datetime`
# 3. Podziel kolumnę z `datetime` na dwie osobne: datę i czas
# 4. Uruchom doctesty - wszystkie muszą się powieść

# %% Hints
# - `pd.Series.dt.date`
# - `pd.Series.dt.time`

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result  # doctest: +NORMALIZE_WHITESPACE
      id   period            datetime   network  item           type  duration        date      time
0      0  1999-11 1999-10-15 06:58:00  T-Mobile  data           data      34.5  1999-10-15  06:58:00
1      1  1999-11 1999-10-15 06:58:00    Orange  call         mobile      13.0  1999-10-15  06:58:00
2      2  1999-11 1999-10-15 14:46:00      Play  call         mobile      23.0  1999-10-15  14:46:00
3      3  1999-11 1999-10-15 14:48:00      Plus  call         mobile       4.0  1999-10-15  14:48:00
4      4  1999-11 1999-10-15 17:27:00  T-Mobile  call         mobile       4.0  1999-10-15  17:27:00
..   ...      ...                 ...       ...   ...            ...       ...         ...       ...
825  825  2000-03 2000-03-13 00:38:00      AT&T   sms  international       1.0  2000-03-13  00:38:00
826  826  2000-03 2000-03-13 00:39:00    Orange   sms         mobile       1.0  2000-03-13  00:39:00
827  827  2000-03 2000-03-13 06:58:00    Orange  data           data      34.5  2000-03-13  06:58:00
828  828  2000-03 2000-03-14 00:13:00      AT&T   sms  international       1.0  2000-03-14  00:13:00
829  829  2000-03 2000-03-14 00:16:00      AT&T   sms  international       1.0  2000-03-14  00:16:00
<BLANKLINE>
[830 rows x 9 columns]
"""

import pandas as pd


DATA = 'https://python3.info/_static/phones-pl.csv'

# type: pd.DataFrame
result = ...


# FIXME:  pd.to_datetime(errors='ignore') is deprecated and will raise in a future version

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame Mapping Translate
# - Difficulty: easy
# - Lines: 5
# - Minutes: 5

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Convert Polish month names to English
# 3. Parse dates to `datetime` objects
# 4. Select columns ['firstname', 'lastname', 'birthdate']
# 5. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Przekonwertuj polskie nazwy miesięcy na angielskie
# 3. Sparsuj daty do obiektów `datetime`
# 4. Wybierz kolumny ['firstname', 'lastname', 'birthdate']
# 5. Uruchom doctesty - wszystkie muszą się powieść

# %% Hints
# - `pd.Series.replace(regex=True)`
# - `pd.to_datetime()`

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result[['firstname', 'lastname', 'birthdate']]  # doctest: +NORMALIZE_WHITESPACE
  firstname   lastname  birthdate
0      Mark     Watney 1994-10-12
1   Melissa      Lewis 1995-07-07
2      Rick   Martinez 1996-01-21
3      Alex      Vogel 1994-11-15
4      Beth  Johanssen 2006-05-09
5     Chris       Beck 1999-08-02
"""

import pandas as pd


DATA = 'https://python3.info/_static/martian-pl.csv'
MONTHS_PLEN = {'styczeń': 'January',
               'luty': 'February',
               'marzec': 'March',
               'kwiecień': 'April',
               'maj': 'May',
               'czerwiec': 'June',
               'lipiec': 'July',
               'sierpień': 'August',
               'wrzesień': 'September',
               'październik': 'October',
               'listopad': 'November',
               'grudzień': 'December'}

# type: pd.DataFrame
result = ...


# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame Mapping Month
# - Difficulty: easy
# - Lines: 10
# - Minutes: 8

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Add column `year` and `month` by parsing `period` column
# 3. Month name must be a string month name, not a number (i.e.: 'January', 'May')
# 4. Example: if `period` column is "2015-01", then `year`: 2015, `month`: January
# 5. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Dodaj kolumnę `year` i `month` poprzez sparsowanie kolumny `period`
# 3. Nazwa miesiąca musi być ciągiem znaków, a nie liczbą (i.e. 'January', 'May')
# 4. Example: jeżeli kolumna `period` jest "2015-01", to `year`: 2015, `month`: January
# 5. Uruchom doctesty - wszystkie muszą się powieść

# %% Hints
# - `Series.str.split(expand=True)`
# - `df[ ['A', 'B'] ] = ...`

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result  # doctest: +NORMALIZE_WHITESPACE
      id   period          datetime   network  item           type  duration  year     month
0      0  1999-11  1999-10-15 06:58  T-Mobile  data           data      34.5  1999  November
1      1  1999-11  1999-10-15 06:58    Orange  call         mobile      13.0  1999  November
2      2  1999-11  1999-10-15 14:46      Play  call         mobile      23.0  1999  November
3      3  1999-11  1999-10-15 14:48      Plus  call         mobile       4.0  1999  November
4      4  1999-11  1999-10-15 17:27  T-Mobile  call         mobile       4.0  1999  November
..   ...      ...               ...       ...   ...            ...       ...   ...       ...
825  825  2000-03  2000-03-13 00:38      AT&T   sms  international       1.0  2000     March
826  826  2000-03  2000-03-13 00:39    Orange   sms         mobile       1.0  2000     March
827  827  2000-03  2000-03-13 06:58    Orange  data           data      34.5  2000     March
828  828  2000-03  2000-03-14 00:13      AT&T   sms  international       1.0  2000     March
829  829  2000-03  2000-03-14 00:16      AT&T   sms  international       1.0  2000     March
<BLANKLINE>
[830 rows x 9 columns]
"""

import pandas as pd


DATA = 'https://python3.info/_static/phones-pl.csv'
MONTHS_EN = ['January', 'February', 'March', 'April',
             'May', 'June', 'July', 'August', 'September',
             'October', 'November', 'December']
MONTHS = dict(enumerate(MONTHS_EN, start=1))

# type: pd.DataFrame
result = ...


# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: DataFrame Mapping Substitute
# - Difficulty: medium
# - Lines: 10
# - Minutes: 8

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Select `Polish` spreadsheet
# 3. Set header and index to data from file
# 4. Mind the encoding
# 5. Substitute Polish Diacritics to English alphabet letters
# 6. Compare `df.replace(regex=True)` with `df.applymap()`
# 7. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Wybierz arkusz `Polish`
# 3. Ustaw nagłówek i index na dane zaczytane z pliku
# 4. Zwróć uwagę na encoding
# 5. Podmień polskie znaki diakrytyczne na litery z alfabetu angielskiego
# 6. Porównaj `df.replace(regex=True)` z `df.applymap()`
# 7. Uruchom doctesty - wszystkie muszą się powieść

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 3)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> result  # doctest: +NORMALIZE_WHITESPACE
                                             Definicja  ...                                 Kryteria wyjsciowe
TRL                                                     ...
1    Zaobserwowanie i opisanie podstawowych zasad d...  ...  Zweryfikowane publikacja badania lezacych u po...
2    Sformulowanie koncepcji technologicznej lub pr...  ...  Udokumentowany opis aplikacji / koncepcji, kto...
3    Przeprowadzanie eksperymentalnie i analityczni...  ...  Udokumentowane wyniki analityczne / eksperymen...
4    Przeprowadzenie weryfikacji komponentow techno...  ...  Udokumentowane wyniki testow potwierdzajace zg...
5    Przeprowadzenie weryfikacji komponentow techno...  ...  Udokumentowane wyniki testow potwierdzajace zg...
6    Dokonanie demonstracji technologii w srodowisk...  ...  Udokumentowane wyniki testow potwierdzajace zg...
7    Dokonanie demonstracji prototypu systemu w oto...  ...  Udokumentowane wyniki testow potwierdzajace zg...
8    Zakonczenie badan i demonstracja ostatecznej f...  ...  Udokumentowane wyniki testow weryfikujacych pr...
9    Weryfikacja technologii w srodowisku operacyjn...  ...            Udokumentowane wyniki operacyjne misji.
<BLANKLINE>
[9 rows x 4 columns]
"""

import pandas as pd


DATA = 'https://python3.info/_static/astro-trl.xlsx'
LETTERS_PLEN = {'ą': 'a', 'ć': 'c', 'ę': 'e',
                'ł': 'l', 'ń': 'n', 'ó': 'o',
                'ś': 's', 'ż': 'z', 'ź': 'z'}

# type: pd.DataFrame
result = ...


# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: Pandas Read JSON OpenAPI
# - Difficulty: easy
# - Lines: 5
# - Minutes: 5

# %% English
# 1. Read data from `DATA` as `df: pd.DataFrame`
# 2. Use `requests` library
# 3. Transpose data
# 4. If cell is a `dict`, then extract value for `summary`
# 5. If cell is empty, leave `pd.NA`
# 6. Run doctests - all must succeed

# %% Polish
# 1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
# 2. Użyj biblioteki `requests`
# 3. Transponuj dane
# 4. Jeżeli komórka jest `dict`, to wyciągnij wartość dla `summary`
# 5. Jeżeli komórka jest pusta, pozostaw `pd.NA`
# 6. Uruchom doctesty - wszystkie muszą się powieść

# %% Hints
# - `pandas.DataFrame()`
# - `DataFrame.map()`
# - `DataFrame.transpose()`

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)

>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'

>>> list(result.columns)
['put', 'post', 'get', 'delete']

>>> list(result.index)  # doctest: +NORMALIZE_WHITESPACE
['/pet', '/pet/findByStatus', '/pet/findByTags', '/pet/{petId}', '/pet/{petId}/uploadImage',
 '/store/inventory', '/store/order', '/store/order/{orderId}',
 '/user', '/user/createWithList', '/user/login', '/user/logout', '/user/{username}']
"""

import pandas as pd
import requests


DATA = 'https://python3.info/_static/openapi.json'
data = requests.get(DATA).json()['paths']

# type: pd.DataFrame
result = ...