7.16. Regex Quantifier Recap
Quantifier specifies how many occurrences of preceding qualifier or character class
Lazy
7.16.1. SetUp
import re
7.16.2. Numbers
r'\d+'
- Greedyr'\d+?'
- Lazy
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d{1,}', TEXT)
['1', '2000', '12', '00']
TEXT = 'On Sun, Jan 1st, 2000 at 12:00 AM Alice <alice@example.com> wrote'
re.findall(r'\d{1,}?', TEXT)
['1', '2', '0', '0', '0', '1', '2', '0', '0']
7.16.3. Text
r'[A-Z].+\.'
- Greedyr'[A-Z].+?\.'
- Lazy
TEXT = 'Email from Alice. Received on Sunday.'
sentence = r'[A-Z].+\.'
re.findall(sentence, TEXT)
['Email from Alice. Received on Sunday.']
TEXT = 'Email from Alice. Received on Sunday.'
sentence = r'[A-Z].+?\.'
re.findall(sentence, TEXT)
['Email from Alice.', 'Received on Sunday.']
Mind the number of sentences in each case. Without lazy quantifier it returns only one result: from first capital letter to the last possible dot. Lazy quantifier splits text into two parts. From the first capital letter to the closest dot.
7.16.4. Use Case - 1
HTML = '<p>We choose to go to the Moon</p>'
tag = r'<.+>'
re.findall(tag, HTML)
['<p>We choose to go to the Moon</p>']
tag = r'<.+?>'
re.findall(tag, HTML)
['<p>', '</p>']
7.16.5. Use Case - 2
import re
HTML = '<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>'
re.findall(r'<p>.*</p>', HTML)
['<p>Paragraph 1</p><p>Paragraph 2</p>']
re.findall(r'<p>.*?</p>', HTML)
['<p>Paragraph 1</p>', '<p>Paragraph 2</p>']
7.16.6. Use Case - 3
import re
HTML = '<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>'
re.findall(r'</?.*>', HTML)
['<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>']
re.findall(r'</?.*?>', HTML)
['<h1>', '</h1>', '<p>', '</p>', '<p>', '</p>']
7.16.7. Use Case - 4
import re
HTML = '<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>'
re.findall(r'<.+>', HTML)
['<h1>Header 1</h1><p>Paragraph 1</p><p>Paragraph 2</p>']
re.findall(r'<.+?>', HTML)
['<h1>', '</h1>', '<p>', '</p>', '<p>', '</p>']
re.findall(r'</?.+?>', HTML)
['<h1>', '</h1>', '<p>', '</p>', '<p>', '</p>']
re.findall(r'</?(.+?)>', HTML)
['h1', 'h1', 'p', 'p', 'p', 'p']
tags = re.findall(r'</?(.+?)>', HTML)
sorted(set(tags))
['h1', 'p']