Greedy operations
In Python regular expressions, the *
, +
, ?
and {m,n}
quantifiers are all greedy.
This means that Python will try to match the pattern to the largest string possible.
It is probably easier to illustrate this with an example.
So let’s say our regular expression is a.*b
(try describing what this represents in plain English!)
If we try to match the pattern to the string abcbcacdefb
with re.match()
, what happens? (Try this!)
>>> re.match("a.*b", "abcbcacdefb")
<re.Match object; span=(0, 11), match='abcbcacdefb'>
So Python will match the pattern to the whole string. So this is a greedy behaviour (not necessarily a bad thing!)
However, sometimes we might only want to match it to the minimum match. A good use case will be this.
>>> re.match("<.*>", "<a> b <c>")
<re.Match object; span=(0, 9), match='<a> b <c>'>
The pattern is matched to the whole string <a> b <c>
. But perhaps we are actually really more interested in the tag <a>
rather than matching the whole string.
The solution is to use a non-greedy version of the quantifier *
, that is *?
.
>>> re.match("<.*?>", "<a> b <c>")
<re.Match object; span=(0, 3), match='<a>'>
In the non-greedy version, Python matches the closing >
to the first match it finds, rather than the last match.
If you use re.findall()
, you will match both <a>
and <c>
.
### Greedy version
>>> re.findall("<.*>", "<a> b <c>")
['<a> b <c>']
### Non-greedy version
>>> re.findall("<.*?>", "<a> b <c>")
['<a>', '<c>']
In our first example, if we instead use a non-greedy quantifier, we can match all substrings that start with a
and ends with b
, rather than the longest one.
>>> re.findall("a.*?b", "abcbcacdefb")
['ab', 'acdefb']
You can do the same for the other quantifiers:
+?
: Non-greedy one or more.??
: Non-greedy optional.- Try
re.match("jo(siah)?", "josiah")
vs.re.match("jo(siah)??", "josiah")
- Try
{m,n}?
: Non-greedym
ton
repeats.- Try
re.match("ba{1,3}", "baaaa")
vs.re.match("ba{1,3}?", "baaaa")
.
- Try