This is an archived version of the course. Please find the latest version of the course on the main webpage.

Chapter 1: Regular expressions?

Mmmbop!

face Josiah Wang

Below is the lyrics of the song “MMMBop” by the band Hanson from 1997. If you are too young and have never heard of this song, go check it out on YouTube! And if you want to go down memory lane, go check out the song on YouTube too!

lyrics = """Mmmbop, ba duba dop
Ba du bop, ba duba dop
Ba du bop, ba duba dop
Ba du, yeah-e-yeah
Mmmbop, ba duba dop
Ba du bop, ba duba dop
Ba du bop, ba duba dop 
Ba du, yeah-e-yeah
Said oh yeah
In an mmmbop they're gone
Yeah yeah
"""

Now, let’s say you need to search for the word “bop” in the text. How would you do this in Python? I’ll give you a few seconds to think…

Any solutions? You might use the str.find() or str.index() methods for this (how do these two methods differ?)

start = lyrics.find("bop")
if start >= 0:
    print(f"bop found between {start} to {start+3}")
else:
    print(f"Cannot find bop")

What if you now need to search for both "bop" and "dop"? You might write something like this.

start = lyrics.find("bop")
if start >= 0:
    print(f"bop found between {start} to {start+3}")
else:
    start = lyrics.find("dop")
    if start >= 0:
        print(f"dop found between {start} to {start+3}")
    else:
        print(f"Cannot find bop or dop")

Or you might be smarter, and exploit that "bop" and "dop" have the same ending. So you might write an overlapping ‘sliding window’ to match every three characters.

found = False
for (index, character) in enumerate(lyrics):
    trigram = lyrics[index:index+3]
    if ((trigram.startswith("b") or trigram.startswith("d")) and   
            trigram.endswith("op")):
        print(f"bop or dop found between {index} to {index+3}")
        found = True
        break

if not found:
    print(f"Cannot find bop or dop")

With regular expressions

Regular expressions make this simpler. You simply need to say “find me (b|d)op”! That’s it! 👍👍👍

>>> import re
>>> match = re.search("(b|d)op", lyrics)
>>> print(match)
<re.Match object; span=(3, 6), match='bop'>