Advanced Lesson 1
Regular Expressions
Chapter 1: Regular expressions?
Mmmbop!
Below is the lyrics of the song “MMMBop” by the band Hanson from 1997. If you are too young and have never heard of this song, go check it out on YouTube! And if you want to go down memory lane, go check out the song on YouTube too!
lyrics = """Mmmbop, ba duba dop
Ba du bop, ba duba dop
Ba du bop, ba duba dop
Ba du, yeah-e-yeah
Mmmbop, ba duba dop
Ba du bop, ba duba dop
Ba du bop, ba duba dop
Ba du, yeah-e-yeah
Said oh yeah
In an mmmbop they're gone
Yeah yeah
"""
Now, let’s say you need to search for the word “bop
” in the text. How would you do this in Python? I’ll give you a few seconds to think…
Any solutions? You might use the str.find()
or str.index()
methods for this (how do these two methods differ?)
start = lyrics.find("bop")
if start >= 0:
print(f"bop found between {start} to {start+3}")
else:
print(f"Cannot find bop")
What if you now need to search for either "bop"
or "dop"
? You might write something like this.
start = lyrics.find("bop")
if start >= 0:
print(f"bop found between {start} to {start+3}")
else:
start = lyrics.find("dop")
if start >= 0:
print(f"dop found between {start} to {start+3}")
else:
print(f"Cannot find bop or dop")
Or you might be smarter, and exploit that "bop"
and "dop"
have the same ending. So you might write an overlapping ‘sliding window’ to match every three characters.
found = False
for (index, character) in enumerate(lyrics):
trigram = lyrics[index:index+3]
if ((trigram.startswith("b") or trigram.startswith("d")) and
trigram.endswith("op")):
print(f"bop or dop found between {index} to {index+3}")
found = True
break
if not found:
print(f"Cannot find bop or dop")
With regular expressions
Regular expressions make this simpler. You simply need to say “find me (b|d)op
”! That’s it! 👍👍👍
>>> import re
>>> match = re.search("(b|d)op", lyrics)
>>> print(match)
<re.Match object; span=(3, 6), match='bop'>