6.10. Regex Syntax Flag

  • re.ASCII - perform ASCII-only matching instead of full Unicode matching

  • re.IGNORECASE - case-insensitive search

  • re.MULTILINE - match can start in one line, and end in another

  • re.DOTALL - dot (.) matches also newline characters

  • re.UNICODE - turns on unicode character support for \w

  • re.VERBOSE - ignores spaces (except \s) and allows for comments in in re.compile()

  • re.DEBUG - display debugging information during pattern compilation

The final piece of regex syntax that Python's regular expression engine offers is a means of setting the flags. Usually the flags are set by passing them as additional parameters when calling the re.compile() function, but sometimes it's more convenient to set them as part of the regex itself. The syntax is simply (?flags) where flags is one or more of the following:

  • re.ASCII

  • re.IGNORECASE

  • re.LOCALE

  • re.MULTILINE

  • re.DOTALL

  • re.UNICODE

  • re.VERBOSE

  • re.DEBUG

If the flags are set this way, they should be put at the start of the regex; they match nothing, so their effect on the regex is only to set the flags. The letters used for the flags are the same as the ones used by Perl's regex engine, which is why s is used for re.DOTALL and x is used for re.VERBOSE [1].

6.10.1. SetUp

>>> import re

6.10.2. IGNORECASE

  • Short: i

  • Long: re.IGNORECASE

  • Case-insensitive search

  • Has Unicode support i.e. Ą and ą

>>> text = 'Hello World'
>>>
>>> re.findall('[a-z]', text)
['e', 'l', 'l', 'o', 'o', 'r', 'l', 'd']
>>>
>>> re.findall('[a-z]', text, flags=re.IGNORECASE)
['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']

6.10.3. UNICODE

  • Short: u

  • Long: re.UNICODE

  • On by default

  • Turns on unicode character support

  • Works for \w and \W

>>> TEXT = 'cześć'  # in Polish language means hello
>>>
>>> re.findall(r'\w', TEXT)
['c', 'z', 'e', 'ś', 'ć']
>>>
>>> re.findall(r'\w', TEXT, flags=re.UNICODE)
['c', 'z', 'e', 'ś', 'ć']

Mind that range character class [a-z] is always ASCII:

>>> re.findall(r'[a-z]', TEXT)
['c', 'z', 'e']
>>>
>>> re.findall(r'[a-z]', TEXT, flags=re.UNICODE)
['c', 'z', 'e']

6.10.4. ASCII

  • Short: a

  • Long: re.ASCII

  • Perform ASCII-only matching instead of full Unicode matching

  • Works for \w, \W, \b, \B, \d, \D, \s and \S

  • ASCII only search is faster, but does not include unicode characters

>>> TEXT = 'cześć'  # 'hello' in Polish
>>> re.findall(r'\w', TEXT)
['c', 'z', 'e', 'ś', 'ć']
>>>
>>> re.findall(r'\w', TEXT, flags=re.ASCII)
['c', 'z', 'e']

Mind that range character class [a-z] is always ASCII:

>>> TEXT = 'cześć'  # 'hello' in Polish
>>>
>>> re.findall(r'[a-z]', TEXT)
['c', 'z', 'e']
>>>
>>> re.findall(r'[a-z]', TEXT, flags=re.ASCII)
['c', 'z', 'e']

6.10.5. MULTILINE

  • Short: m

  • Long: re.MULTILINE

  • Match can start in one line, and end in another

  • Changes meaning of ^, now it is a start of a line

  • Changes meaning of $, now it is an end of line

>>> text = 'Hello\nWorld'
>>>
>>> re.findall('^[A-Z]', text)
['H']
>>>
>>> re.findall('^[A-Z]', text, flags=re.MULTILINE)
['H', 'W']

Content of a text variable depends on re.MULTILINE flag.

Without flag:

Hello\nWorld

With flag:

Hello
World

6.10.6. DOTALL

  • Short: s

  • Long: re.DOTALL

  • Dot (.) matches also newline characters

  • By default newlines are not matched by .

>>> text = 'Hello\nWorld'
>>>
>>> re.findall(r'.', text)
['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']
>>>
>>> re.findall(r'.', text, flags=re.DOTALL)
['H', 'e', 'l', 'l', 'o', '\n', 'W', 'o', 'r', 'l', 'd']

Mind the \n character among results with re.DOTALL flag turned on.

6.10.7. DEBUG

  • Long: re.DEBUG

  • Display debugging information during pattern compilation

>>> x = re.compile('^[a-z]+@nasa.gov$', flags=re.DEBUG)  
AT AT_BEGINNING
MAX_REPEAT 1 MAXREPEAT
  IN
    RANGE (97, 122)
LITERAL 64
LITERAL 110
LITERAL 97
LITERAL 115
LITERAL 97
ANY None
LITERAL 103
LITERAL 111
LITERAL 118
AT AT_END

 0. INFO 4 0b0 10 MAXREPEAT (to 5)
 5: AT BEGINNING
 7. REPEAT_ONE 10 1 MAXREPEAT (to 18)
11.   IN 5 (to 17)
13.     RANGE 0x61 0x7a ('a'-'z')
16.     FAILURE
17:   SUCCESS
18: LITERAL 0x40 ('@')
20. LITERAL 0x6e ('n')
22. LITERAL 0x61 ('a')
24. LITERAL 0x73 ('s')
26. LITERAL 0x61 ('a')
28. ANY
29. LITERAL 0x67 ('g')
31. LITERAL 0x6f ('o')
33. LITERAL 0x76 ('v')
35. AT END
37. SUCCESS

6.10.8. VERBOSE

  • Short: x

  • Long: re.VERBOSE

  • Ignores spaces (except \s) and allows for comments in in re.compile()

>>> x = re.compile(r'[A-Z][a-z]+ \d{1,2}, \d{4}')
>>> x = re.compile(r'[A-Z][a-z]+(?#month) \d{1,2}(?#day), \d{4}(?#year)')
>>> x = re.compile(r"""
...     [A-Z][a-z]+  # month
...     \d{1,2}      # day
...     ,            # separator
...     \d{4}        # year
... """, flags=re.VERBOSE)

6.10.9. References