Real world application
Now, let’s put everything you have learnt about regular expressions into practice.
You are given some challenging “real-world” problems with noisy texts, and your task is to use regular expressions to extract the desired information.
Some of these may end up being pretty hacky, but that is how it is with using regular expressions in practice.
Many thanks to the following people for contributing to these exercises: Josiah Wang
Task 1: I love experiments
Download these two files:
These are results from scikit-learn’s classification_report()
for two classifiers evaluated on different dataests.
Tasks:
- Extract the classification accuracy from each file using regular expressions.
- Extract the f1-score per class for each file using regular expressions. Print the f1-scores in the following format:
f1-score for class 0 = 0.88 f1-score for class 1 = 0.82 f1-score for class 2 = 0.87 f1-score for class 3 = 0.82 f1-score for class 4 = 0.83
Your code should be general enough to read the results for any number of classes. So it should be able to process both files without you having to specify the number of classes in each dataset.
Task 2: Grocery shopping!
Here is a (modified) receipt from Josiah’s grocery shopping trip last weekend.
Tasks:
- Extract each of the item name and corresponding price and represent each as a tuple. You should have a list of tuples
[(W MILK 4 PT, 1.09), (CANNED FRUIT, 1.00), (FRESH CREAM, 1.85), (BLUEBERRIES, 1.56), (GARLIC, 0.78), (SALAD, 0.60), (ICE LOLLIES, 2.00), (FROZEN VEG, 1.10), (BM MADELEINES, 2.00)]
- Extract the total listed on the receipt.
- Compute the total by adding the prices of all items
- Make sure that the total computed by the supermarket matches the total you computed.
Task 3: Harvesting emails
You are given a made-up email.
Using regular expressions, your task is to generate a list of students and their emails as listed in the “To:” field.
Don Garrett, don.garrett@food.bz
Daryl Sears, daryl.sears@defg.com
Szymon Walmsley, szymon.walmsley@ijk.ac.uk
Frances Hurst, frances.hurst@silly.org
Teegan Moses, teegan.moses@silly.org
...
Task 4: Screen scraper
Here is the HTML source code of the department’s academic staff page.
Using regular expressions, extract the name of each staff and their email and webpage URL (if any - not all staff has this, so your regular expressions should be able to handle this). This process is called screen scraping.
Print out the list of staff in the following format.
Dr Dalal Alrajeh, dalal.alrajeh04@imperial.ac.uk, https://www.doc.ic.ac.uk/~da04/
Dr Francesco Belardinelli, francesco.belardinelli@imperial.ac.uk, https://www.doc.ic.ac.uk/~fbelard/
Dr Mario Berta, mberta@imperial.ac.uk, http://marioberta.info/
Dr Cristian Cadar, c.cadar@imperial.ac.uk, https://www.doc.ic.ac.uk/~cristic/
Dr Giuliano Casale, g.casale@imperial.ac.uk, http://wp.doc.ic.ac.uk/gcasale/
Dr Antoine Cully, a.cully@imperial.ac.uk, None
Prof. Andrew Davison, a.davison@imperial.ac.uk, https://www.doc.ic.ac.uk/~ajd/index.html
Dr Yves-Alexandre de Montjoye, demontjoye@imperial.ac.uk, http://demontjoye.com/index.html
...
Note that this is not the best way to perform screen scraping in reality, as the HTML source can be brittle and prone to errors. In practice, it is better to use an HTML parser to parse the source code into a DOM Tree structure first, and then extract the information that you need from the tree. That is not the aim of this exercise though! 😊
And congratulations! You’ve now mastered regular expressions! Use your new found skills wisely!