Chapter 7: Groups

re.split()

face Josiah Wang

You should be familiar with the str.split() method by now.

The re module also provide a split() function (or a method .split() for the Pattern object) that allows you to split a string based on a given regular expression. This is similar to str.split(), except that you can now give a regular expression as the delimiter.

Let’s say we want to split our string at points where they are not alphanumeric characters. We can use the regular expression "\W+" to represent one or more non-words.

>>> pattern = r"\W+"
>>> string = "doe, a deer, a female deer."
>>> re.split(pattern, string)
['doe', 'a', 'deer', 'a', 'female', 'deer', '']

In the code above, the string is split by non-alphanumeric characters like comma, spaces and a full stop. Note the empty string at the end of the resulting list.

You can also limit the maximum number of splits. Let’s say we only want to split the string at maximum 3 points.

>>> re.split(pattern, string, 3)
['doe', 'a', 'deer', 'a female deer.']

If you need to keep the delimiters as well, use captured groups in your regular expression

>>> pattern2 = r"(\W+)"
>>> re.split(pattern2, string)
['doe', ', ', 'a', ' ', 'deer', ', ', 'a', ' ', 'female', ' ', 'deer', '.', '']