This is an archived version of the course. Please find the latest version of the course on the main webpage.

Chapter 1: Regular expressions?

Mmmbop!

face Josiah Wang

Here is another example of how you can match strings concisely.

We’ll take the lyrics of “MMMBop” again.

lyrics = """Mmmbop, ba duba dop
Ba du bop, ba duba dop
Ba du bop, ba duba dop
Ba du, yeah-e-yeah
Mmmbop, ba duba dop
Ba du bop, ba duba dop
Ba du bop, ba duba dop 
Ba du, yeah-e-yeah
Said oh yeah
In an mmmbop they're gone
Yeah yeah
"""

Let’s say you now want to match either "du" or "duba", and return the first match. So for the lyrics above, you should match "duba" and return the indices between 11 and 15.

The ‘sliding window’ approach from earlier might be better in this case, since you should not return just "du" (indices from 11 to 13) when it is in fact part of "duba" (indices from 11 to 15).

found = False
for (index, character) in enumerate(lyrics):
    # Get four characters from current index
    quadgram = lyrics[index:index+4]

    # Check "duba" first
    if quadgram == "duba":
        print(f"duba found between {index} to {index+4}")
        found = True
        break

    # If "duba" is not matched, then could the first two characters be "du"?
    if quadgram[:2] == "du":
        print(f"du found between {index} to {index+2}")
        found = True
        break

if not found:
    print(f"Cannot find du or duba")

With regular expressions

Regular expressions will make implementing this much easier and more concise. You simply need to say “find me du(ba)?”. This regular expression basically means “du, followed optionally by ba”.

>>> import re
>>> match = re.search("du(ba)?", lyrics)
>>> print(match)
<re.Match object; span=(11, 15), match='duba'>