4.16. Series Mapping
Using str methods for cleaning user input
80% of machine learning and data science is cleaning data
Series.apply
- apply function to data, function can have args and kwargsSeries.map
- convert data from one to another using function or dict
4.16.1. Apply
Apply a function along an axis of the DataFrame. Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.
>>>
... def apply(self,
... func: Any,
... axis: Any = 0,
... raw: Any = False,
... result_type: Any = None,
... args: Any = (),
... **kwargs: Any,
... ) -> Any | Series | DataFrame
Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported.
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>>
>>>
>>> s = pd.Series(
... index = pd.date_range('2000-01-01', periods=4),
... data = np.random.randn(4))
>>>
>>> s
2000-01-01 1.764052
2000-01-02 0.400157
2000-01-03 0.978738
2000-01-04 2.240893
Freq: D, dtype: float64
>>>
>>> s.apply(int)
2000-01-01 1
2000-01-02 0
2000-01-03 0
2000-01-04 2
Freq: D, dtype: int64
>>>
>>> s.apply(lambda x: round(x, 2))
2000-01-01 1.76
2000-01-02 0.40
2000-01-03 0.98
2000-01-04 2.24
Freq: D, dtype: float64
>>>
>>> s.apply(round, ndigits=2)
2000-01-01 1.76
2000-01-02 0.40
2000-01-03 0.98
2000-01-04 2.24
Freq: D, dtype: float64
>>>
>>> s.apply(round, args=(2,))
2000-01-01 1.76
2000-01-02 0.40
2000-01-03 0.98
2000-01-04 2.24
Freq: D, dtype: float64
functools.partial(func, *args, **keywords)
:
>>> from functools import partial
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>>
>>>
>>> s = pd.Series(
... index = pd.date_range('2000-01-01', periods=4),
... data = np.random.randn(4))
>>>
>>> s
2000-01-01 1.764052
2000-01-02 0.400157
2000-01-03 0.978738
2000-01-04 2.240893
Freq: D, dtype: float64
>>>
>>> round2 = partial(round, ndigits=2)
>>> square = partial(pow, exp=2)
>>> cube = partial(pow, exp=3)
>>>
>>> s.apply(round2)
2000-01-01 1.76
2000-01-02 0.40
2000-01-03 0.98
2000-01-04 2.24
Freq: D, dtype: float64
>>>
>>> s.apply(square)
2000-01-01 3.111881
2000-01-02 0.160126
2000-01-03 0.957928
2000-01-04 5.021602
Freq: D, dtype: float64
>>>
>>> s.apply(cube)
2000-01-01 5.489520
2000-01-02 0.064075
2000-01-03 0.937561
2000-01-04 11.252875
Freq: D, dtype: float64
4.16.2. Map
Map values of Series according to input correspondence. Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.
>>>
... def map(self,
... arg: Callable | dict | Series,
... na_action: Literal['ignore'] | None = None,
... ) -> Series
When arg is a dictionary, values in Series that are not in the dictionary (as keys) are converted to NaN. However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a method for default values), then this default is used rather than NaN.
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(0)
>>>
>>>
>>> s = pd.Series(
... index = pd.date_range('2000-01-01', periods=4),
... data = np.random.randn(4))
>>>
>>> s
2000-01-01 1.764052
2000-01-02 0.400157
2000-01-03 0.978738
2000-01-04 2.240893
Freq: D, dtype: float64
>>>
>>> s.map(int)
2000-01-01 1
2000-01-02 0
2000-01-03 0
2000-01-04 2
Freq: D, dtype: int64
>>>
>>> s.map(lambda x: round(x, 2))
2000-01-01 1.76
2000-01-02 0.40
2000-01-03 0.98
2000-01-04 2.24
Freq: D, dtype: float64
>>> import pandas as pd
>>>
>>>
>>> s = pd.Series(['Watney', 'Twardowski', pd.NA, 'Lewis'])
>>>
>>> s
0 Watney
1 Twardowski
2 <NA>
3 Lewis
dtype: object
>>>
>>> s.map({'Watney': 'Mark', 'Twardowski': 'Pan'})
0 Mark
1 Pan
2 NaN
3 NaN
dtype: object
>>>
>>> s.map('I am {}'.format)
0 I am Watney
1 I am Twardowski
2 I am <NA>
3 I am Lewis
dtype: object
>>>
>>> s.map('I am {}'.format, na_action='ignore')
0 I am Watney
1 I am Twardowski
2 <NA>
3 I am Lewis
dtype: object
4.16.3. Normalization
Comparing not normalized strings will yield invalid or at least unexpected results:
>>> 'MacGyver' == 'Macgyver'
False
Normalize strings before comparing:
>>> 'MacGyver'.upper() == 'Macgyver'.upper()
True
4.16.4. Numbers
When comparing age, height, temperature etc, the following numbers has
the same meaning. Therefore after converting to float()
it will be
exactly the same.
>>> age = 42
>>> age = 42.0
>>> age = 42.00
>>> age = '42'
>>> age = '42.0'
>>> age = '42.00'
However, when those values indicates for example a version of a program to find in text their meaning will be different. Version 21 and '21.00' will be a completely different object, so it should not be treated exactly the same.
>>> version = 21
>>> version = 21.0
>>> version = 21.00
>>> version = '21'
>>> version = '21.0'
>>> version = '21.00'
4.16.5. Addresses
Address prefix (street, road, court, place, etc.):
>>> prefix = 'ul'
>>> prefix = 'ul.'
>>> prefix = 'Ul.'
>>> prefix = 'UL.'
>>> prefix = 'ulica'
>>> prefix = 'Ulica'
>>>
>>> prefix = 'os'
>>> prefix = 'os.'
>>> prefix = 'Os.'
>>> prefix = 'osiedle'
>>> prefix = 'oś'
>>> prefix = 'oś.'
>>> prefix = 'Oś.'
>>> prefix = 'ośedle'
>>>
>>> prefix = 'pl'
>>> prefix = 'pl.'
>>> prefix = 'Pl.'
>>> prefix = 'plac'
>>>
>>> prefix = 'al'
>>> prefix = 'al.'
>>> prefix = 'Al.'
>>> prefix = 'aleja'
>>> prefix = 'aleia'
>>> prefix = 'alei'
>>> prefix = 'aleii'
>>> prefix = 'aleji'
House and apartment number:
>>> address = 'Ćwiartki 3/4'
>>> address = 'Ćwiartki 3 / 4'
>>> address = 'Ćwiartki 3 m. 4'
>>> address = 'Ćwiartki 3 m 4'
>>> address = 'Brighton Beach 1st apt 2'
>>> address = 'Brighton Beach 1st apt. 2'
>>> address = 'Myśliwiecka 3/5/7'
>>>
>>> address = 'Górczewska 180f/8f'
>>> address = 'Górczewska 180f/8'
>>> address = 'Górczewska 180/8f'
>>>
>>> address = 'Jana Pawła II 1 m. 5'
>>> address = 'Powstańców 13d bud. A piętro II sala 3'
4.16.6. Phone Numbers:
Which one is mobile, and which is landline?
>>> phone = '+48 (12) 355 5678'
>>> phone = '+48 123 555 678'
Note, the numbers. They are completely the same. Your
>>> phone = '123 555 1337'
>>> phone = '1235551337'
>>> phone = '+11235551337'
>>> phone = '+1 12 3555 1337'
>>> phone = '+1 123 555 1337'
>>> phone = '+1 (123) 555 1337'
>>> phone = '+1 (123) 555-1337'
>>> phone = '+1 (123)-555-1337'
>>> phone = '+1 (123).555.1337'
>>>
>>> phone = '+1 800-python'
>>> phone = '+1 800-798466'
>>>
>>> phone = '+48 123555133,1'
>>> phone = '+48 123555133,1,,2'
>>> phone = '+48 123 555 133 wew. 7'
4.16.7. Date and Time
>>> date = '1961-04-12'
>>> date = '12.4.1961'
>>> date = '12.04.1961'
>>> date = '12-04-1961'
>>> date = '12/04/1961'
>>> date = '4/12/61'
>>> date = '4.12.1961'
>>> date = 'Apr 12, 1961'
>>> date = 'Apr 12th, 1961'
>>> time = '12:00:00'
>>> time = '12:00'
>>> time = '12:00 pm'
>>> duration = '04:30:00'
>>> duration = '4h 30m'
>>> duration = '4 hours 30 minutes'
4.16.8. Case Study
The following code is an output from real customer relationship management
(CRM) system, that I wrote in 2000s for a swimming pool in Poznan, Poland.
The output is a result of a SELECT DISTINCT(address)
result in SQL.
Note to english speaking users:
os.
- stands forosiedle
, which means blocks of flats
ul.
- stands forulica
, which means street
Is this the same address?
>>> street = 'os. Jana III Sobieskiego'
>>> street = 'ul. Jana III Sobieskiego'
>>> street = 'ul Jana III Sobieskiego'
>>> street = 'ul.Jana III Sobieskiego'
>>> street = 'ulicaJana III Sobieskiego'
>>> street = 'Ul. Jana III Sobieskiego'
>>> street = 'UL. Jana III Sobieskiego'
>>> street = 'ulica Jana III Sobieskiego'
>>> street = 'Ulica. Jana III Sobieskiego'
>>> street = 'Jana Sobieskiego 3'
>>> street = 'Jana Sobieskiego 3ego'
>>> street = 'Jana III Sobieskiego'
>>> street = 'Jana Iii Sobieskiego'
>>> street = 'Jana IIi Sobieskiego'
>>> street = 'Jana lll Sobieskiego' # three small letters 'L'
Yes, this is the same address. Despite having information about two different geographical entities (osiedle and ulica), this is the same address. Why? It is just a simple mistake from people who entered data.
SELECT DISTINCT(address)
won't show you the number of occurrences for
each result. What seems to be a high error rate at the first glance, in
further analysis happens to be a superbly few mistakes. How come? Number of
results for os. Jana III Sobieskiego
was around 50 thousands. The other
results was one or two at most. So, few mistakes from 50k results. That's
really good result.
Why we had those errors? Browser autocomplete. User error while imputing
data. And simple shortcuts during conversation: Where do you live?
,
at Sobieskiego
. There is only one place in Poznan, Poland with that
name, so it was precise during the conversation. But, receiving party put
that incorrectly to the database assuming that it was ulica
which is
far more common then osiedle
addresses.
4.16.9. Use Case - 1
String cleaning:
>>> expected = 'Pana Twardowskiego III'
>>> text = 'UL. pana \tTWArdoWskIEGO 3'
Convert to common format:
>>> text = text.upper()
Remove unwanted whitespaces:
>>> text = text.replace('\t', '')
Remove unwanted special characters:
>>> text = text.replace('.', '')
Remove unwanted text:
>>> text = text.replace('UL', '')
>>> text = text.replace('3', 'III')
Formatting:
>>> text = text.title()
>>> text = text.replace('Iii', 'III')
>>> text = text.strip()
Check result:
>>> print('Matched:', text == expected)
Matched: True
>>>
>>> print(text)
Pana Twardowskiego III
4.16.10. Use Case - 2
Remove Polish diacritics:
>>> def pl_to_latin(text):
... PL = {'ą': 'a', 'ć': 'c', 'ę': 'e',
... 'ł': 'l', 'ń': 'n', 'ó': 'o',
... 'ś': 's', 'ż': 'z', 'ź': 'z'}
... result = ''.join(PL.get(x,x) for x in text.lower())
... return result.capitalize()
>>>
>>>
>>> s = pd.Series(['Poznań', 'Swarzędz', 'Kraków',
... 'Łódź', 'Gdańsk', 'Koło', 'Dęblin'])
>>>
>>> s
0 Poznań
1 Swarzędz
2 Kraków
3 Łódź
4 Gdańsk
5 Koło
6 Dęblin
dtype: object
>>>
>>> s.map(pl_to_latin)
0 Poznan
1 Swarzedz
2 Krakow
3 Lodz
4 Gdansk
5 Kolo
6 Deblin
dtype: object
>>>
>>> s.apply(pl_to_latin)
0 Poznan
1 Swarzedz
2 Krakow
3 Lodz
4 Gdansk
5 Kolo
6 Deblin
dtype: object
4.16.11. Assignments
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: Series Mapping Timedelta
# - Difficulty: medium
# - Lines: 15
# - Minutes: 13
# %% English
# 1. Define variable `result` with result of
# apply function `pd.to_timedelta` to each element of `DATA`
# 2. Run doctests - all must succeed
# %% Polish
# 1. Zdefiniuj zmienną `result` z wynikiem
# aplikacji funckji `pd.to_timedelta` do każdego elementu `DATA`
# 2. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 100)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.Series, \
'Variable `result` has invalid type, should be `pd.Series`'
>>> result
0 0 days 00:10:00
1 0 days 00:10:00
2 0 days 00:10:00
3 0 days 00:10:20
4 0 days 01:20:30
5 1 days 02:30:40
6 0 days 01:20:00
7 0 days 01:20:00
8 0 days 01:00:00
9 0 days 02:00:00
10 3 days 01:00:00
11 NaT
dtype: timedelta64[ns]
"""
import pandas as pd
DATA = pd.Series([
'10 m',
'10 min',
'10 minutes',
'10m 20s',
'1h 20m 30s',
'1d 2h 30m 40s',
'1 hour 20 minutes',
'1 hours 20 minutes',
'01:00:00',
'02:00:00',
'3d 01:00:00',
None,
])
# type: pd.Series
result = ...
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: Series Mapping Datetime
# - Difficulty: medium
# - Lines: 15
# - Minutes: 13
# %% English
# 1. Define variable `result` with result of
# apply function `pd.to_timedelta` to each element of `DATA`
# 2. Run doctests - all must succeed
# %% Polish
# 1. Zdefiniuj zmienną `result` z wynikiem
# aplikacji funckji `pd.to_timedelta` do każdego elementu `DATA`
# 2. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 100)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.Series, \
'Variable `result` has invalid type, should be `pd.Series`'
>>> result
0 2000-01-02
1 2000-01-02
2 2000-01-02
3 2000-01-02
4 2000-01-02
5 2000-02-01
6 2000-02-01
7 2000-02-01
8 2000-02-01
9 NaT
dtype: datetime64[ns]
"""
import pandas as pd
DATA = pd.Series([
'2000-01-02',
'Jan 2, 2000',
'Jan 2nd, 2000',
'2 Jan 2000',
'2nd Jan 2000',
'2/1/2000',
'02/01/2000',
'2.1.2000',
'02.01.2000',
None,
])
# type: pd.Series
result = ...
# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author
# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`
# %% About
# - Name: Series Mapping Clean
# - Difficulty: medium
# - Lines: 15
# - Minutes: 13
# %% English
# 1. Convert `DATA` to `pd.Series`
# 2. Write function to clean up data
# 3. Function takes one `str` argument
# 4. Function returns cleaned text
# 5. Apply function to all elements of `pd.Series`
# 6. Run doctests - all must succeed
# %% Polish
# 1. Przekonwertuj `DATA` do `pd.Series`
# 2. Napisz funkcję czyszczącą dane
# 3. Funkcja przyjmuje jeden argument typu `str`
# 4. Funkcja zwraca oczyszczony tekst
# 5. Zaaplikuj funkcję na wszystkich elementach `pd.Series`
# 6. Uruchom doctesty - wszystkie muszą się powieść
# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.Series, \
'Variable `result` has invalid type, should be `pd.Series`'
>>> result
0 Pana Twardowskiego III
1 Pana Twardowskiego III
2 Pana Twardowskiego III
3 Pana Twardowskiego III
4 Pana Twardowskiego III
5 Pana Twardowskiego III
6 Pana Twardowskiego III
7 Pana Twardowskiego III
8 Pana Twardowskiego III
9 Pana Twardowskiego III
dtype: object
"""
import pandas as pd
DATA = [
'UL. Pana \tTWArdoWskIEGO 3',
'ul Pana TwaRDOWSkiego III',
'\tul. Pana Twardowskiego trzeciego',
'ulicaPana Twardowskiego III',
'Pana \nTWARDOWSKIEGO 3',
'UL. Pana TWARDowsKIEGO III',
'ULICA Pana TWARDOWSKIEGO III ',
'ULICA. Pana TWARDowsKIEGO III',
' Pana Twardowskiego 3 ',
'Pana\tTwardowskiego III ',
]
def clean(text: str) -> str:
pass
# type: pd.Series
result = ...