7.15. Regex Quantifier Lazy
Quantifier specifies how many occurrences of preceding qualifier or character class
Lazy
7.15.1. SetUp
import re
7.15.2. Lazy
Prefer shortest matches
Works better with text
Not that good results for numbers
Non-greedy
{n,m}?
- minimum n repetitions, maximum m times, prefer shorter{,n}?
- maximum n repetitions, prefer shorter{n,}?
- minimum n repetitions, prefer shorter{0,1}?
- minimum 0 repetitions, maximum 1 repetitions (maybe)
Min/max:
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d{2,4}?', TEXT)
['20', '00', '12', '00']
Nolimit:
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d{2,}?', TEXT)
['20', '00', '12', '00']
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d{,4}?', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '',
'', '', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '',
'1', '', '2', '', '', '0', '', '0', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '']
7.15.3. Shorthand Lazy
*?
- minimum 0 repetitions, no maximum, prefer shorter (alias to{0,}?
)+?
- minimum 1 repetitions, no maximum, prefer shorter (alias to{1,}?
)??
- minimum 0 repetitions, maximum 1 repetition, prefer shorter (alias to{0,1}?
)
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d{1,}?', TEXT)
['1', '2', '0', '0', '0', '1', '2', '0', '0']
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d+?', TEXT)
['1', '2', '0', '0', '0', '1', '2', '0', '0']
Star:
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d{0,}?', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '',
'', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '', '1', '',
'2', '', '', '0', '', '0', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '']
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d*?', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '',
'', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '', '1', '',
'2', '', '', '0', '', '0', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '']
Question mark:
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d{0,1}?', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '',
'', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '', '1', '',
'2', '', '', '0', '', '0', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '']
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d??', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '',
'', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '', '1', '',
'2', '', '', '0', '', '0', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '']
7.15.4. Greedy vs. Lazy
r'\d+'
- Greedyr'\d+?'
- Lazy
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d+', TEXT)
['1', '2000', '12', '00']
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d+?', TEXT)
['1', '2', '0', '0', '0', '1', '2', '0', '0']
TEXT = 'Email from Alice. Received on Sunday.'
sentence = r'[A-Z].+\.'
re.findall(sentence, TEXT)
['Email from Alice. Received on Sunday.']
TEXT = 'Email from Alice. Received on Sunday.'
sentence = r'[A-Z].+?\.'
re.findall(sentence, TEXT)
['Email from Alice.', 'Received on Sunday.']
Mind the number of sentences in each case. Without lazy quantifier it returns only one result: from first capital letter to the last possible dot. Lazy quantifier splits text into two parts. From the first capital letter to the closest dot.
7.15.5. Use Case - 1
HTML = '<p>We choose to go to the Moon</p>'
tag = r'<.+>'
re.findall(tag, HTML)
['<p>We choose to go to the Moon</p>']
tag = r'<.+?>'
re.findall(tag, HTML)
['<p>', '</p>']
7.15.6. Use Case - 2
import re
HTML = '<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>'
re.findall(r'<p>.*</p>', HTML)
['<p>Paragraph 1</p><p>Paragraph 2</p>']
re.findall(r'<p>.*?</p>', HTML)
['<p>Paragraph 1</p>', '<p>Paragraph 2</p>']
7.15.7. Use Case - 3
import re
HTML = '<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>'
re.findall(r'</?.*>', HTML)
['<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>']
re.findall(r'</?.*?>', HTML)
['<h1>', '</h1>', '<p>', '</p>', '<p>', '</p>']
7.15.8. Use Case - 4
import re
HTML = '<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>'
re.findall(r'<.+>', HTML)
['<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>']
re.findall(r'<.+?>', HTML)
['<h1>', '</h1>', '<p>', '</p>', '<p>', '</p>']
re.findall(r'</?.+?>', HTML)
['<h1>', '</h1>', '<p>', '</p>', '<p>', '</p>']
re.findall(r'</?(.+?)>', HTML)
['h1', 'h1', 'p', 'p', 'p', 'p']
tags = re.findall(r'</?(.+?)>', HTML)
sorted(set(tags))
['h1', 'p']