7.15. Regex Quantifier Lazy
Quantifier specifies how many occurrences of preceding qualifier or character class
Lazy
7.15.1. SetUp
>>> import re
7.15.2. Lazy
Prefer shortest matches
Works better with text
Not that good results for numbers
Non-greedy
{n,m}?
- minimum n repetitions, maximum m times, prefer shorter{,n}?
- maximum n repetitions, prefer shorter{n,}?
- minimum n repetitions, prefer shorter{0,1}?
- minimum 0 repetitions, maximum 1 repetitions (maybe)
Min/max:
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d{2,4}?', TEXT)
['20', '00', '12', '00']
Nolimit:
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d{2,}?', TEXT)
['20', '00', '12', '00']
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d{,4}?', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '',
'', '', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '',
'1', '', '2', '', '', '0', '', '0', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '']
7.15.3. Shorthand Lazy
*?
- minimum 0 repetitions, no maximum, prefer shorter (alias to{0,}?
)+?
- minimum 1 repetitions, no maximum, prefer shorter (alias to{1,}?
)??
- minimum 0 repetitions, maximum 1 repetition, prefer shorter (alias to{0,1}?
)
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d{1,}?', TEXT)
['1', '2', '0', '0', '0', '1', '2', '0', '0']
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d+?', TEXT)
['1', '2', '0', '0', '0', '1', '2', '0', '0']
Star:
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d{0,}?', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '',
'', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '', '1', '',
'2', '', '', '0', '', '0', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '']
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d*?', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '',
'', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '', '1', '',
'2', '', '', '0', '', '0', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '']
Question mark:
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d{0,1}?', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '',
'', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '', '1', '',
'2', '', '', '0', '', '0', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '']
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d??', TEXT)
['', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '',
'', '', '2', '', '0', '', '0', '', '0', '', '', '', '', '', '1', '',
'2', '', '', '0', '', '0', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '']
7.15.4. Greedy vs. Lazy
r'\d+'
- Greedyr'\d+?'
- Lazy
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d+', TEXT)
['1', '2000', '12', '00']
>>> TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
>>> re.findall(r'\d+?', TEXT)
['1', '2', '0', '0', '0', '1', '2', '0', '0']
>>> TEXT = 'Email from Alice. Received on Sunday.'
>>> sentence = r'[A-Z].+\.'
>>> re.findall(sentence, TEXT)
['Email from Alice. Received on Sunday.']
>>> TEXT = 'Email from Alice. Received on Sunday.'
>>> sentence = r'[A-Z].+?\.'
>>> re.findall(sentence, TEXT)
['Email from Alice.', 'Received on Sunday.']
Mind the number of sentences in each case. Without lazy quantifier it returns only one result: from first capital letter to the last possible dot. Lazy quantifier splits text into two parts. From the first capital letter to the closest dot.
7.15.5. Use Case - 1
>>> HTML = '<p>We choose to go to the Moon</p>'
>>>
>>> tag = r'<.+>'
>>> re.findall(tag, HTML)
['<p>We choose to go to the Moon</p>']
>>>
>>> tag = r'<.+?>'
>>> re.findall(tag, HTML)
['<p>', '</p>']
7.15.6. Use Case - 2
>>> import re
>>> HTML = '<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>'
>>> re.findall(r'<p>.*</p>', HTML)
['<p>Paragraph 1</p><p>Paragraph 2</p>']
>>> re.findall(r'<p>.*?</p>', HTML)
['<p>Paragraph 1</p>', '<p>Paragraph 2</p>']
7.15.7. Use Case - 3
>>> import re
>>> HTML = '<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>'
>>> re.findall(r'</?.*>', HTML)
['<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>']
>>> re.findall(r'</?.*?>', HTML)
['<h1>', '</h1>', '<p>', '</p>', '<p>', '</p>']
7.15.8. Use Case - 4
>>> import re
>>> HTML = '<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>'
>>> re.findall(r'<.+>', HTML)
['<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>']
>>> re.findall(r'<.+?>', HTML)
['<h1>', '</h1>', '<p>', '</p>', '<p>', '</p>']
>>> re.findall(r'</?.+?>', HTML)
['<h1>', '</h1>', '<p>', '</p>', '<p>', '</p>']
>>> re.findall(r'</?(.+?)>', HTML)
['h1', 'h1', 'p', 'p', 'p', 'p']
>>> tags = re.findall(r'</?(.+?)>', HTML)
>>> sorted(set(tags))
['h1', 'p']