Subscribe to this blog

Follow by Email

Python: Some Notes on lxml

I wrote a webcrawler that uses lxml, XPath, and Beautiful Soup to easily pull data from a set of poorly formatted Web pages. In summary, it works, and I'm quite happy :)

The script needs to pull data from hundreds of Web pages, but not millions, so I opted to use threads. The script actually takes the list of things to look for as a set of XPath expressions on the command line, which makes it super flexible. Let me give you some hints for the parts that I found difficult.

You also have to be careful of thread safety issues. I was sharing an effectively read-only instance of the etree.XPath class between multiple threads, but that ended up causing bus errors. Ah, the joys of extensions written in C! It's a good reminder that the safest way to do multithreaded programming is to have each thread live in its own process ;)

lxml permits access to regular expressions from within XPath expressions. That's super useful. I had a hard time getting it working though. I forgot to pass in the right XML namespace in one part of the code. For some reason, I wasn't getting an error message. (As a general rule, I love it when software fails fast and complains loudly when I do something stupid.) Furthermore, my knowledge of XSLT was weak enough that I had a really hard time figuring out how to combine the XPath expression with the regex. Anyway, here's how to create an etree.XPath instance containing a regex: