7.23. Regex Flavors

  • In other programming languages

  • PCRE - Perl Compatible Regular Expressions

../../_images/regex-xkcd-standards.png

Figure 7.5. How Standards Proliferate. XKCD Standards [1]

7.23.1. SetUp

>>> import re

7.23.2. Literals

  • In Python we use raw-string (r'...')

  • In JavaScript we use /pattern/flags

Python:

'hello'    # unicode string literal
"hello"    # unicode string literal
r'hello'   # raw-string literal
r"hello"   # raw-string literal

JavaScript:

'hello'     // string literal
"hello"     // string literal
`hello`     // template literal
/hello/     // regular expression

7.23.3. Compile

Python:

result = re.compile(r'[a-z]')

JavaScript:

result = new RegExp("[a-z]")

7.23.4. Flags

  • In Python we use raw-string (r'...')

  • In JavaScript we use /pattern/flags

Python:

re.compile(r'[a-z]+', flags=re.I)
re.compile(r'[a-z]+', flags=re.I|re.M)

re.compile(r'[a-z]+', flags=re.IGNORECASE)
re.compile(r'[a-z]+', flags=re.IGNORECASE|re.MULTILINE)

JavaScript:

/[a-z]+/i
/[a-z]+/im

JavaScript:

new RegExp("[a-z]", "i") new RegExp("[a-z]", "im")

7.23.5. Named Groups

  • In Python we use (?P<name>...)

  • In JavaScript we use (?<name>...)

Python:

r'(?P<mygroup>[a-z]+)'

JavaScript:

/(?<mygroup>[a-z]+)/

7.23.6. Range

  • [a-Z] == [a-zA-Z]

  • [a-9] == [a-zA-Z0-9]

  • Works in other languages, but not in Python

Python:

r'[a-z]'  # ok
r'[A-Z]'  # ok
r'[A-z]'  # ok

r'[a-Z]'  # re.PatternError: bad character range a-Z at position 1

JavaScript:

/[a-Z]/   // SyntaxError: Invalid regular expression: /[a-Z]/: Range out of order in character class

Perl:

/[a-Z]/   # works

7.23.7. Group Backreference

  • \g<name> - Python

  • \g<1> - Python

  • \1

  • $1 - grep, egrep, Jetbrains IDE

Python:

r'(?P<word>[a-z]+)\s+(?P=word)'
r'([a-z]+)\s+\1'

JavaScript:

/(?<word>[a-z]+)\s+\k<word>/
/([a-z]+)\s+\1/

7.23.8. Named Ranges

  • [:alpha:] - Alphabetic character [a-zA-Z]

  • [:alnum:] - Alphabetic and numeric character [a-zA-Z0-9]

  • [:blank:] - Space or tab

  • [:cntrl:] - Control character

  • [:digit:] - Digit

  • [:graph:] - Non-blank character (excludes spaces, control characters, and similar)

  • [:lower:] - Lowercase alphabetical character

  • [:print:] - Like [:graph:], but includes the space character

  • [:punct:] - Punctuation character

  • [:space:] - Whitespace character ([:blank:], newline, carriage return, etc.)

  • [:upper:] - Uppercase alphabetical

  • [:xdigit:] - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)

  • [:word:] - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation

  • [:ascii:] - A character in the ASCII character set

In Python those Named Ranges does not work. String [:alpha:] will be interpreted literally as either: : or a or l or p or h or a.

>>> string = 'Hello Alice'
>>>
>>> re.findall(r'[A-Z]', string)
['H', 'A']
>>>
>>> re.findall(r'[:upper:]', string)
['e', 'e']

7.23.9. References