Advanced Lesson 1
Regular Expressions
Chapter 4: Boundaries
Word boundary
Now, you want to search for the word "red"
anywhere in a piece of text. You constructed your regular expression "red"
. But you ended up matching "red"
against words like "hundred"
and "reddit"
.
So, how do you limit your regular expression to match "red"
only when it is the whole word "red"
, and not as part of another word like "hundred"
?
The solution is to use the word boundary marker (\b
).
The regular expression "red\b"
(with an end of word marker) will prevent a match to "reddit"
.
The regular expression "\bred"
(with a beginning of word marker) will prevent a match to "hundred"
.
You can combine both "\bred\b"
to restrict it to both cases.
Note: you need to be careful when implementing this in Python. \b
is actually the backspace character. So you will have to escape the backslash by using \\b
instead, i.e. "\\bred\\b"
. Alternatively and preferably, use a Python raw string literal: r"\b"
is equivalent to "\\b"
.
>>> re.search("red", "A hundred reddit posts a day.") # Matches red in hundred
<re.Match object; span=(6, 9), match='red'>
>>> re.search(r"\bred", "A hundred reddit posts a day.") # Matches red in reddit
<re.Match object; span=(10, 13), match='red'>
>>> re.search(r"red\b", "A hundred reddit posts a day.") # Matches red in hundred
<re.Match object; span=(6, 9), match='red'>
>>> re.search(r"\bred\b", "A hundred reddit posts a day.") # None
>>> re.search(r"\bred\b", "yellow, red, and blue")
<re.Match object; span=(8, 11), match='red'>
Notice that in the last example, the regular expression is smart enough to NOT treat the comma right after red
as part of the word.