Thoughts on computation, social science, and lifehacking from an up-and-coming data scientist.

Saturday, February 18, 2012

Q&A about web scraping with python

Here are highlights from a recent email exchange with a tech-minded friend in the polisci department. His questions were similar to others I've been fielding recently, and very to-the-point, so I thought the conversation might be useful for others looking at getting into web scraping for research (and fun and profit).

> ... I have an unrelated question: What python libraries do you use to scrape websites?

urllib2, lxml, and re do almost everything I need. Sometimes I use wget to mirror a site, then use glob and lxml to pull out the salient pieces. For the special case of directed crawls, I built snowcrawl.

> I need to click on some buttons, follow some links, parse (sometimes ugly) html, and convert html tables to csv files.

Ah. That's harder. I've done less of this, but I'm told mechanize is good. I don't know of a good table-to-csv converter, but that's definitely a pain point in some crawls -- if you find anything good, I'd love to hear about it!

It strikes me that you could do some nice table-scraping with cleverly deployed xpath to pull out rows and columns -- the design would look a little like functional programming, although you'd still have to do use loops. Python is good for that, though.

I got into lxml early and it works wonderfully. The only pain point is (sometimes) installation, but I'm sure you can handle that. My impression is that lxml does everything BeautifulSoup does, faster and with slightly cleaner syntax, but that it's not so much better that everyone has switched. I don't know much about html5lib.

Definitely. The syntax is quite easy, very similar to jquery/css selectors. It also makes for faster development: get a browser plugin for xpath and you can test your searches directly on the pages you want to crawl. This will speed up your inner loop for development time tremendously -- much better than editing and re-running scripts, or running tests from a console.

The web scraping process focuses greatly on transformation of the web content that is unstructured into structured. This allows easy analysis and storage into a database or even a spreadsheet.It is regarded as a technique that is used in extracting information or data from websites..

hey thanks for the valuable article! Big Data has potential to help organizations or companies to improve their growth rate and enable them to take potential decision. So scraping data from the web can really help the organizations to improvise their operations.