This is an archived version of the course. Please find the latest version of the course on the main webpage.

Chapter 6: Implementing regular expressions

Greedy operations

face Josiah Wang

In Python regular expressions, the *, +, ? and {m,n} quantifiers are all greedy.

This means that Python will try to match the pattern to the longest string possible.

As an example, let’s say our regular expression is "a.*b" (try describing what this represents in plain English!)

If we try to match the pattern to the string "abcbcacdefb" with re.match(), what happens? (Try this!)

>>> re.match("a.*b", "abcbcacdefb")
<re.Match object; span=(0, 11), match='abcbcacdefb'>

So Python will match the pattern to the whole string. This is a greedy behaviour (not necessarily a bad thing!)

However, sometimes we might only want to match the pattern to the minimum match. A good use case would be the following.

>>> re.match("<.*>", "<a> b <c>")
<re.Match object; span=(0, 9), match='<a> b <c>'>

The pattern "<.*>" is matched to the whole string "<a> b <c>". But perhaps we are actually really more interested in the tag <a> rather than matching the whole string.

The solution is to use a non-greedy version of the quantifier *, that is *? (note that the ? is different from ‘optional’ when it comes after a star)

>>> re.match("<.*?>", "<a> b <c>")
<re.Match object; span=(0, 3), match='<a>'>

In the non-greedy version, Python matches the closing ">" to the first match it finds, rather than to the last match.

If you use re.findall(), you will match both <a> and <c> separately.

>>> re.findall("<.*>", "<a> b <c>") # Greedy version
['<a> b <c>']
>>> re.findall("<.*?>", "<a> b <c>") # Non-greedy version
['<a>', '<c>']

Going back to our very first example, if we instead use a non-greedy quantifier, we can match all substrings that start with a and ends with b, rather than the longest one.

>>> re.findall("a.*b", "abcbcacdefb") # Greedy
['abcbcacdefb']
>>> re.findall("a.*?b", "abcbcacdefb") # Non-greedy
['ab', 'acdefb']

You can do the same for the other quantifiers:

  • +?: Non-greedy one or more.
  • ??: Non-greedy optional.
    • Try re.match("jo(siah)?", "josiah") vs. re.match("jo(siah)??", "josiah")
  • {m,n}?: Non-greedy m to n repeats.
    • Try re.match("ba{1,3}", "baaaa") vs. re.match("ba{1,3}?", "baaaa").