Advanced Lesson 1
Regular Expressions
Chapter 6: Implementing regular expressions
Greedy operations
In Python regular expressions, the *
, +
, ?
and {m,n}
quantifiers are all greedy.
This means that Python will try to match the pattern to the longest string possible.
As an example, let’s say our regular expression is "a.*b"
(try describing what this represents in plain English!)
If we try to match the pattern to the string "abcbcacdefb"
with re.match()
, what happens? (Try this!)
>>> re.match("a.*b", "abcbcacdefb")
<re.Match object; span=(0, 11), match='abcbcacdefb'>
So Python will match the pattern to the whole string. This is a greedy behaviour (not necessarily a bad thing!)
However, sometimes we might only want to match it to the minimum match. A good use case will be the following.
>>> re.match("<.*>", "<a> b <c>")
<re.Match object; span=(0, 9), match='<a> b <c>'>
The pattern "<.*>"
is matched to the whole string "<a> b <c>"
. But perhaps we are actually really more interested in the tag <a>
rather than matching the whole string.
The solution is to use a non-greedy version of the quantifier *
, that is *?
(note that the ?
is different from ‘optional’ when it comes after a star)
>>> re.match("<.*?>", "<a> b <c>")
<re.Match object; span=(0, 3), match='<a>'>
In the non-greedy version, Python matches the closing ">"
to the first match it finds, rather than to the last match.
If you use re.findall()
, you will match both <a>
and <c>
separately.
>>> re.findall("<.*>", "<a> b <c>") # Greedy version
['<a> b <c>']
>>> re.findall("<.*?>", "<a> b <c>") # Non-greedy version
['<a>', '<c>']
Going back to our very first example, if we instead use a non-greedy quantifier, we can match all substrings that start with a
and ends with b
, rather than the longest one.
>>> re.findall("a.*b", "abcbcacdefb") # Greedy
['abcbcacdefb']
>>> re.findall("a.*?b", "abcbcacdefb") # Non-greedy
['ab', 'acdefb']
You can do the same for the other quantifiers:
+?
: Non-greedy one or more.??
: Non-greedy optional.- Try
re.match("jo(siah)?", "josiah")
vs.re.match("jo(siah)??", "josiah")
- Try
{m,n}?
: Non-greedym
ton
repeats.- Try
re.match("ba{1,3}", "baaaa")
vs.re.match("ba{1,3}?", "baaaa")
.
- Try