Chapter 4: Boundaries

Word boundary

face Josiah Wang

Now, you want to search for the word "red" anywhere in a piece of text. You constructed your regular expression "red". But you ended up matching "red" against words like "hundred" and "reddit".

So, how do you limit your regular expression to match "red" only when it is the whole word "red", and not as part of another word like "hundred"?

The solution is to use the word boundary marker (\b).

The regular expression "red\b" (with an end of word marker) will prevent a match to "reddit".

The regular expression "\bred" (with a beginning of word marker) will prevent a match to "hundred".

You can combine both "\bred\b" to restrict it to both cases.

Note: you need to be careful when implementing this in Python. \b is actually the backspace character. So you will have to escape the backslash by using \\b instead, i.e. "\\bred\\b". Alternatively and preferably, use a Python raw string literal: r"\b" is equivalent to "\\b". Note the r before the quote.

>>> re.search("red", "A hundred reddit posts a day.")  # Matches red in hundred
<re.Match object; span=(6, 9), match='red'>
>>> re.search(r"\bred", "A hundred reddit posts a day.") # Matches red in reddit
<re.Match object; span=(10, 13), match='red'>
>>> re.search(r"red\b", "A hundred reddit posts a day.") # Matches red in hundred
<re.Match object; span=(6, 9), match='red'>
>>> re.search(r"\bred\b", "A hundred reddit posts a day.") # None
>>> re.search(r"\bred\b", "yellow, red, and blue")
<re.Match object; span=(8, 11), match='red'>

Notice that in the final example, the regular expression is smart enough to NOT treat the comma right after red as part of the word.