6.9. Regex Syntax Backreference

  • \g<number> - backreferencing by position

  • \g<name> - backreferencing by name

  • (?P=name) - backreferencing by name

  • Note, that for backreference, must use raw-sting or double backslash

6.9.1. SetUp

>>> import re

6.9.2. Backreference

>>> html = '<p>Hello World</p>'
>>> tag = re.findall(r'<(?P<tag>.+)>(?:.+)</(?P=tag)>', html)
>>>
>>> tag
['p']

6.9.3. Recall Group by Position

  • \g<number> - backreferencing by position

>>> html = '<p>Hello World</p>'
>>>
>>> search = r'<p>(.+)</p>'
>>> replace = r'<strong>\g<1></strong>'
>>>
>>> re.sub(search, replace, html)
'<strong>Hello World</strong>'

6.9.4. Recall Group by Name

  • \g<name> - backreferencing by name

>>> html = '<p>Hello World</p>'
>>>
>>> search = r'<p>(?P<text>.+)</p>'
>>> replace = r'<strong>\g<text></strong>'
>>>
>>> re.sub(search, replace, html)
'<strong>Hello World</strong>'

6.9.5. Example

  • (?P<tag><.*?>).+(?P=tag) - matches text inside of a <tag> (opening and closing tag is the same)

6.9.6. Use Case - 1

>>> import re
>>>
>>>
>>> year = r'(?P<year>\d{4})'
>>> month = r'(?P<month>[A-Z][a-z]{2})'
>>> day = r'(?P<day>\d{1,2})'
>>> date = f'{month} {day}(?:st|nd|rd|th), {year}'
>>>
>>> TEXT = 'Email from Mark Watney <mwatney@nasa.gov> received on: Sun, Jan 2nd, 2000 at 12:00 AM'

Recall group by position:

>>> re.sub(date, r'\g<3> \g<1> \g<2>', TEXT)
'Email from Mark Watney <mwatney@nasa.gov> received on: Sun, 2000 Jan 2 at 12:00 AM'

Recall group by name:

>>> re.sub(date, r'\g<year> \g<month> \g<day>', TEXT)
'Email from Mark Watney <mwatney@nasa.gov> received on: Sun, 2000 Jan 2 at 12:00 AM'

Although this is not working in Python:

>>> re.sub(f'{month} {day}nd, {year}', '(?P=day) (?P=month) (?P=year)', TEXT)
'Email from Mark Watney <mwatney@nasa.gov> received on: Sun, (?P=day) (?P=month) (?P=year) at 12:00 AM'

6.9.7. Use Case - 2

>>> import re
>>>
>>>
>>> HTML = '<p>Hello World</p>'
>>>
>>> re.findall(r'<(?P<tag>.*?)>(.*?)</(?P=tag)>', HTML)
[('p', 'Hello World')]

6.9.8. Use Case - 3

>>> import re
>>>
>>>
>>> html = '<p>We choose to go to the <strong>Moon</strong></p>'
>>>
>>>
>>> re.findall('<(?P<tagname>[a-z]+)>.*</(?P=tagname)>', html)
['p']
>>>
>>> re.findall('<(?P<tagname>[a-z]+)>(.*)</(?P=tagname)>', html)
[('p', 'We choose to go to the <strong>Moon</strong>')]

6.9.9. Assignments