6.20. Regex RE Split

  • re.split()

  • Split text by pattern

6.20.1. SetUp

>>> import re

6.20.2. Recap

  • str.splitlines()

  • str.split()

  • re.split()

>>> text = 'The\nHello\nWorld'
>>>
>>> text.splitlines()
['The', 'Hello', 'World']
>>>
>>> text.split('\n')
['The', 'Hello', 'World']
>>>
>>> re.split('\n', text)
['The', 'Hello', 'World']

6.20.3. Problem

  • str.splitlines()

  • str.split()

  • re.split()

>>> text = 'The\n\nHello\nWorld'
>>>
>>> text.splitlines()
['The', '', 'Hello', 'World']
>>>
>>> text.split('\n')
['The', '', 'Hello', 'World']
>>>
>>> re.split('\n', text)
['The', '', 'Hello', 'World']

6.20.4. Solution

  • re.split()

  • Split text by pattern

>>> text = 'The\n\nHello\nWorld'
>>>
>>> re.split('\n+', text)
['The', 'Hello', 'World']

6.20.5. Use Case - 1

  • Making a Phonebook

>>> import re
>>>
>>>
>>> TEXT = """Pan Twardowski: 834.345.1254 Polish Space Agency
...
... Mark Watney: 892.345.3428 Johnson Space Center
... Matt Kowalski: 925.541.7625 Kennedy Space Center
...
...
... Melissa Lewis: 548.326.4584 Bajkonur, Kazakhstan"""
>>>
>>> entries = re.split(r'\n+', TEXT)
>>> print(entries)  
['Pan Twardowski: 834.345.1254 Polish Space Agency',
 'Mark Watney: 892.345.3428 Johnson Space Center',
 'Matt Kowalski: 925.541.7625 Kennedy Space Center',
 'Melissa Lewis: 548.326.4584 Bajkonur, Kazakhstan']
>>>
>>> result = [re.split(r':?\s', entry, maxsplit=3) for entry in entries]
>>> print(result)  
[['Pan', 'Twardowski', '834.345.1254', 'Polish Space Agency'],
 ['Mark', 'Watney', '892.345.3428', 'Johnson Space Center'],
 ['Matt', 'Kowalski', '925.541.7625', 'Kennedy Space Center'],
 ['Melissa', 'Lewis', '548.326.4584', 'Bajkonur, Kazakhstan']]

6.20.6. Assignments

# %% License
# - Copyright 2025, Matt Harasymczuk <matt@python3.info>
# - This code can be used only for learning by humans
# - This code cannot be used for teaching others
# - This code cannot be used for teaching LLMs and AI algorithms
# - This code cannot be used in commercial or proprietary products
# - This code cannot be distributed in any form
# - This code cannot be changed in any form outside of training course
# - This code cannot have its license changed
# - If you use this code in your product, you must open-source it under GPLv2
# - Exception can be granted only by the author

# %% Run
# - PyCharm: right-click in the editor and `Run Doctest in ...`
# - PyCharm: keyboard shortcut `Control + Shift + F10`
# - Terminal: `python -m doctest -v myfile.py`

# %% About
# - Name: RE Split Moon Speech
# - Difficulty: easy
# - Lines: 4
# - Minutes: 8

# %% English
# 1. Using `re.split()` split text [1] by paragraphs
# 2. Define `result: str` containing paragraph starting with 'We choose to go to the moon'
# 3. Run doctests - all must succeed

# %% Polish
# 1. Za pomocą `re.split()` podziel tekst [1] na paragrafy
# 2. Zdefiniuj `result: str` zawierający tekst paragrafu zaczynający się od słów "We choose to go to the moon"
# 3. Uruchom doctesty - wszystkie muszą się powieść

# %% References
# [1] Kennedy, J.F. Moon Speech - Rice Stadium,
#     URL: http://er.jsc.nasa.gov/seh/ricetalk.htm
#     Year: 2019
#     Retrieved: 2019-12-14

# %% Tests
"""
>>> import sys; sys.tracebacklimit = 0
>>> assert sys.version_info >= (3, 9), \
'Python 3.9+ required'

>>> assert type(result) is str, 'result must be a str'
>>> assert not result.startswith('<p>'), 'result cannot start with <p>'
>>> assert not result.endswith('</p>'), 'result cannot end with </p>'

>>> print(result)
We choose to go to the moon. We choose to go to
the moon in this decade and do the other things, not because they
are easy, but because they are hard, because that goal will serve
to organize and measure the best of our energies and skills,because
that challenge is one that we are willing to accept, one we are
unwilling to postpone, and one which we intend to win, and the
others, too.
"""

import re


DATA = """<h1>TEXT OF PRESIDENT JOHN KENNEDY'S RICE STADIUM MOON SPEECH</h1>
<p>President Pitzer, Mr. Vice President, Governor,
CongressmanThomas, Senator Wiley, and Congressman Miller, Mr. Webb,
Mr.Bell, scientists, distinguished guests, and ladies and
gentlemen:</p><p>We choose to go to the moon. We choose to go to
the moon in this decade and do the other things, not because they
are easy, but because they are hard, because that goal will serve
to organize and measure the best of our energies and skills,because
that challenge is one that we are willing to accept, one we are
unwilling to postpone, and one which we intend to win, and the
others, too.</p><p>It is for these reasons that I regard the
decision last year to shift our efforts in space from low to high
gear as among the most important decisions that will be made during
my incumbency in the office of the Presidency.</p><p>In the last 24
hours we have seen facilities now being created for the greatest
and most complex exploration in man's history.We have felt the
ground shake and the air shattered by the testing of a Saturn C-1
booster rocket, many times as powerful as the Atlas which launched
John Glenn, generating power equivalent to 10,000 automobiles with
their accelerators on the floor.We have seen the site where the F-1
rocket engines, each one as powerful as all eight engines of the
Saturn combined, will be clustered together to make the advanced
Saturn missile, assembled in a new building to be built at Cape
Canaveral as tall as a48 story structure, as wide as a city block,
and as long as two lengths of this field.</p>
"""


# use re.split() to get paragraph "We choose to go to the moon"
# type: str
result = ...