Advanced Lesson 1
Regular Expressions
Chapter 8: Real world applications
Challenge 4 - Screen scraper
You are given the HTML source code of the department’s academic staff page. Download staff.txt.
Using regular expressions, extract the name of each staff and their email and webpage URL. This process is called screen scraping.
Print out the list of staff in the following format. Note that not all staff have a webpage URL, so your regular expressions should be able to handle this. Print None
if a staff does not have a webpage URL.
Dr Dalal Alrajeh, dalal.alrajeh04@imperial.ac.uk, https://www.doc.ic.ac.uk/~da04/
Dr Wenjia Bai, w.bai@imperial.ac.uk, https://www.doc.ic.ac.uk/~wbai/
Dr Francesco Belardinelli, francesco.belardinelli@imperial.ac.uk, https://www.doc.ic.ac.uk/~fbelard/
Dr Mario Berta, mberta@imperial.ac.uk, http://marioberta.info/
Prof. Michael Bronstein, m.bronstein@imperial.ac.uk, http://www.cs.technion.ac.il/~mbron/
Dr Cristian Cadar, c.cadar@imperial.ac.uk, https://www.doc.ic.ac.uk/~cristic/
Dr Giuliano Casale, g.casale@imperial.ac.uk, http://wp.doc.ic.ac.uk/gcasale/
Dr Ronnie Clark, ronald.clark@imperial.ac.uk, https://www.imperial.ac.uk/people/ronald.clark
Dr Antoine Cully, a.cully@imperial.ac.uk, None
Prof. Andrew Davison, a.davison@imperial.ac.uk, https://www.doc.ic.ac.uk/~ajd/index.html
Dr Yves-Alexandre de Montjoye, demontjoye@imperial.ac.uk, http://demontjoye.com/index.html
...
Warning: This is slightly tedious and hacky work. Start small. Try to extract the staff name first, then the URL (this will end up being optional), then the email address. Keep what is common between all instances, and turn anything different into patterns.
Note that this is not the best way to perform screen scraping in reality, as the HTML source can be brittle and prone to errors. In practice, it is better to use an HTML parser (like BeautifulSoup) to parse the source code into a DOM Tree structure first, and then extract the information that you need from the tree. That is not the aim of this exercise though! 😊