CO558: Python Programming | Department of Computing

Now, let’s put everything you have learnt about regular expressions into practice.

You are given some challenging “real-world” problems with noisy texts, and your task is to use regular expressions to extract the desired information.

Some of these may end up being pretty hacky, but that is how it is with using regular expressions in practice.

Many thanks to the following people for contributing to these exercises: Josiah Wang

Task 1: I love experiments

Download these two files:

These are results from scikit-learn’s classification_report() for two classifiers evaluated on different dataests.

Tasks:

Extract the classification accuracy from each file using regular expressions.

Extract the f1-score per class for each file using regular expressions. Print the f1-scores in the following format:

f1-score for class 0 = 0.88
f1-score for class 1 = 0.82
f1-score for class 2 = 0.87
f1-score for class 3 = 0.82
f1-score for class 4 = 0.83

Your code should be general enough to read the results for any number of classes. So it should be able to process both files without you having to specify the number of classes in each dataset.

Task 2: Grocery shopping!

Here is a (modified) receipt from Josiah’s grocery shopping trip last weekend.

receipt.txt

Tasks:

Extract each of the item name and corresponding price and represent each as a tuple. You should have a list of tuples
- [(W MILK 4 PT, 1.09), (CANNED FRUIT, 1.00), (FRESH CREAM, 1.85), (BLUEBERRIES, 1.56), (GARLIC, 0.78), (SALAD, 0.60), (ICE LOLLIES, 2.00), (FROZEN VEG, 1.10), (BM MADELEINES, 2.00)]
Extract the total listed on the receipt.
Compute the total by adding the prices of all items
Make sure that the total computed by the supermarket matches the total you computed.

Task 3: Harvesting emails

You are given a made-up email.

email.txt

Using regular expressions, your task is to generate a list of students and their emails as listed in the “To:” field.

Don Garrett, don.garrett@food.bz
Daryl Sears, daryl.sears@defg.com
Szymon Walmsley, szymon.walmsley@ijk.ac.uk 
Frances Hurst, frances.hurst@silly.org
Teegan Moses, teegan.moses@silly.org
...

Task 4: Screen scraper

Here is the HTML source code of the department’s academic staff page.

staff.txt

Using regular expressions, extract the name of each staff and their email and webpage URL (if any - not all staff has this, so your regular expressions should be able to handle this). This process is called screen scraping.

Print out the list of staff in the following format.

Dr Dalal Alrajeh, dalal.alrajeh04@imperial.ac.uk, https://www.doc.ic.ac.uk/~da04/
Dr Francesco Belardinelli, francesco.belardinelli@imperial.ac.uk, https://www.doc.ic.ac.uk/~fbelard/
Dr Mario Berta, mberta@imperial.ac.uk, http://marioberta.info/
Dr Cristian Cadar, c.cadar@imperial.ac.uk, https://www.doc.ic.ac.uk/~cristic/
Dr Giuliano Casale, g.casale@imperial.ac.uk, http://wp.doc.ic.ac.uk/gcasale/
Dr Antoine Cully, a.cully@imperial.ac.uk, None
Prof. Andrew Davison, a.davison@imperial.ac.uk, https://www.doc.ic.ac.uk/~ajd/index.html
Dr Yves-Alexandre de Montjoye, demontjoye@imperial.ac.uk, http://demontjoye.com/index.html
...

Note that this is not the best way to perform screen scraping in reality, as the HTML source can be brittle and prone to errors. In practice, it is better to use an HTML parser to parse the source code into a DOM Tree structure first, and then extract the information that you need from the tree. That is not the aim of this exercise though! 😊

And congratulations! You’ve now mastered regular expressions! Use your new found skills wisely!

<< Previous

Python Programming

Module 14

Real world application

Task 1: I love experiments

Task 2: Grocery shopping!

Task 3: Harvesting emails

Task 4: Screen scraper