10.10. Regex Syntax Flavors¶
In other programming languages
PCRE - Perl Compatible Regular Expressions

Figure 10.12. How Standards Proliferate. XKCD Standards [1]¶
10.10.1. SetUp¶
>>> import re
10.10.2. Future¶
Since Python 3.11
Atomic grouping
((?>...))
and possessive quantifiers (*+
,++
,?+
,{m,n}+
) are now supported in regular expressions.
10.10.3. Enclosing¶
In Python we use raw-string (
r'...'
)In JavaScript we use
/pattern/flags
ornew RegExp(pattern, flags)
Python:
>>> string = 'hello'
>>> string = "hello"
>>> regex = r'[a-z]+'
>>> regex = r"[a-z]+"
JavaScript:
string = "hello"
regex = /[a-z]+/
regex = new RegExp('[a-z]+)
10.10.4. Flags¶
In Python we use raw-string (
r'...'
)In JavaScript we use
/pattern/flags
ornew RegExp(pattern, flags)
Python:
>>>
... re.findall(r'[a-z]+', TEXT, flags=re.I)
... re.findall(r'[a-z]+', TEXT, flags=re.IGNORECASE)
/[a-z]+/i
new RegExp("[a-z]", 'i')
10.10.5. Named Ranges¶
[:allnum:]
- Alphabetic and numeric character[a-zA-Z0-9]
[:alpha:]
- Alphabetic character[a-zA-Z]
[:alnum:]
- Alphabetic and numeric character[a-zA-Z0-9]
[:alpha:]
- Alphabetic character[a-zA-Z]
[:blank:]
- Space or tab[:cntrl:]
- Control character[:digit:]
- Digit[:graph:]
- Non-blank character (excludes spaces, control characters, and similar)[:lower:]
- Lowercase alphabetical character[:print:]
- Like [:graph:], but includes the space character[:punct:]
- Punctuation character[:space:]
- Whitespace character ([:blank:]
, newline, carriage return, etc.)[:upper:]
- Uppercase alphabetical[:xdigit:]
- Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)[:word:]
- A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation[:ascii:]
- A character in the ASCII character set
In Python those Named Ranges does not work. String [:alpha:]
will be
interpreted literally as either: :
or a
or l
or p
or h
or a
.
>>> import re
>>> TEXT = 'hello world'
>>>
>>>
>>> re.findall(r'[:allnum:]', TEXT)
['l', 'l', 'l']
>>>
>>> re.findall(r'[:alpha:]', TEXT)
['h', 'l', 'l', 'l']
10.10.6. Range¶
[a-Z]
==[a-zA-Z]
[a-9]
==[a-zA-Z0-9]
Works in other languages, but not in Python
>>> import re
>>> TEXT = 'hello world'
>>>
>>>
>>> re.findall(r'[a-Z]', TEXT)
Traceback (most recent call last):
re.error: bad character range a-Z at position 1
>>>
>>> re.findall(r'[a-9]', TEXT)
Traceback (most recent call last):
re.error: bad character range a-9 at position 1
10.10.7. Group Backreference¶
$1
- grep, egrep, Jetbrains IDE\1
\g<1>
- Python\g<name>
- Python
In JavaScript name groups don't have ?P
but only ?
:
'(?P<name>\d+)'
'(?<name>\d+)'
>>> HTML = '<span>Hello World</span>'
>>> re.findall(r'<(?P<tag>.+)>(?:.+)</(?P=tag)>', HTML)
['span']
>>> ARES = 'Mark Watney of Ares 3 landed on Mars on: Nov 7th, 2035 at 13:37'
>>>
>>> year = r'(?P<year>\d{4})'
>>> month = r'(?P<month>[A-Z][a-z]+)'
>>> day = r'(?P<day>\d{1,2})'
>>> date = f'{month} {day}(?:st|nd|rd|th), {year}'
>>>
>>> re.search(date, ARES).groupdict()
{'month': 'Nov', 'day': '7', 'year': '2035'}
>>>
>>> re.sub(date, '\g<year> \g<month> \g<day>', ARES)
'Mark Watney of Ares 3 landed on Mars on: 2035 Nov 7 at 13:37'
>>>
>>> re.sub(date, '\g<3> \g<1> \g<2>', ARES)
'Mark Watney of Ares 3 landed on Mars on: 2035 Nov 7 at 13:37'