Backreferences and substitution
Here is something more advanced, and may sometimes be useful.
You might want to reuse the content of a group that you have captured earlier in the same string. For example, you might want a word occurring earlier to be repeated in the same string. Backreferences are useful for this. You use "\\1"
(or r"\1"
) to refer to the first group that you captured earlier.
### Note the r in front of the string - this is a raw string!
>>> pattern = r"([A-Za-z]+) went to ([A-Za-z]+). The cuisine of \2 fascinated \1."
>>> string1 = "Josiah went to Japan. The cuisine of Japan fascinated Josiah."
>>> match = re.match(pattern, string1)
>>> match.group()
?????
>>> match.groups()
?????
>>> string2 = "Harry went to Greece. The cuisine of Greece fascinated William."
>>> match = re.match(pattern, string2)
>>> print(match)
None
In Python, you can also used named groups (?P<name>)
, and refer to the content of the named groups with (?P=name)
>>> pattern = "(?P<person>[A-Za-z]+) went to (?P<place>[A-Za-z]+).\
The cuisine of (?P=place) fascinated (?P=person)."
Regular expression substitution
The true power of backreferencing can be seen when you need to find and replace a string.
Let’s say you want to make the section headers in your LaTeX document to be chapters (perhaps you are converting your paper into a book?).
You can replace all instances of section
with chapter
, but keeping the original header titles using backreferences. (We’re omitting the backslashes from LaTeX for simplicity)
The function re.sub(pattern, replacement, string)
or the method pattern.sub(replacement, string)
of a Pattern
object can be used for this. It is similar to the str.replace()
method, except that you can also search for substrings using regular expressions.
>>> pattern = r"section{([^}]*)}"
>>> replacement = r"chapter{\1}"
>>> string = "section{Introduction} section{Literature review}"
>>> re.sub(pattern, replacement, string)
chapter{Introduction} chapter{Literature review}
And, as expected, you can also use named groups for this. You use \g<name>
to refer to the named group in the original pattern. \g<1>
works too (and is equivalent to \1
).
>>> pattern = r"section{(?P<title>[^}]*)}"
>>> string = "section{Introduction} section{Literature review}"
>>> re.sub(pattern, r"chapter{\1}", string)
?????
>>> re.sub(pattern, r"chapter{\g<1>}", string)
?????
>>> re.sub(pattern, r"chapter{\g<title>}", string)
?????
There is also a re.subn()
function that also returns the number of substrings replaced.