Scraping HTML with Python

Have you ever had to write a script that scrapes data from an HTML page? Was the page horribly bad HTML too? If so, you probably know how annoying and time consuming it can be to write a script that reliably fetches data from such a mess.

I was recently asked to write a script that scrapes an ASP.NET page at work. It was a paginated list of people, and each person was linked to a page with more details about them. Naturally, the HTML code was also a horrible mess.

The usual first step in such is to try if the XML functionality in your language of choice can make something workable out from the code. Since that will only work with well-formed XHTML, it was not an option here. The next thing is regular expressions, but they are such a huge pain to write and maintain for something like parsing specific data out from HTML.

Luckily, there’s a better way to do it in Python, using a library called BeautifulSoup. It’s definitely the best tool for this job I’ve seen.

No more regular expressions

I have used my fair share of regular expressions to get data out from HTML. Often it’s not a simple task to write a regex to get some data from HTML, and sometimes you’ll need tens of them to get all the data you want.

With BeautifulSoup, it’s really simple. You probably know what tag the data in is, and what’s near it. You can look up elements by attributes, parents, find text nodes based on what they contain etc.

For example, if the page has a person’s name and a phone number listed in a table:

from BeautifulSoup import BeautifulSoup
# Let's assume the_html contains the html code for the page
s = BeautifulSoup(the_html)# Look up element which contains "Name:", and get the next node's string contents
name = s.find(text="Name:").next.string# Do same for "Phone:"
phone = s.find(text="Phone:").next.string

from BeautifulSoup import BeautifulSoup
# Let's assume the_html contains the html code for the page
s = BeautifulSoup(the_html)
# Look up element which contains "Name:", and get the next node's string contents
name = s.find(text="Name:").next.string
# Do same for "Phone:"
phone = s.find(text="Phone:").next.string

Doing the above with regex could’ve been tricky, for example if there was a random amount of space between the tds, or maybe the td’s weren’t organized like that etc. etc.

That was just one small example, BeautifulSoup has much more functionality to offer, so you should definitely check out the BeautifulSoup homepage for downloads and documentation!