Basic Web Scraping with Emacs

Oct 17, 2018

Web scraping is the extraction of data from web pages. But most web pages aren’t designed to accomodate automated data extraction; instead, they’re designed to be easily read by humans, with colors and fonts and pictures and all sorts of junk. This makes web scraping tricky. There are two predominant techniques for web scraping: HTML parsing and browser automation.

Before going on, I must confess a shameful secret: I don’t understand HTML very well. It’s just too ugly to get me interested. Every so often I’ll try to sit down and read about HTML, and I usually get bored and quit right around the time they get to unordered lists (<ul>). Why couldn’t they just use S-expressions? Do the brackets and explicit close tags actually add anything? Whatever, it doesn’t matter. The bottom line is that I hate dealing with HTML and I’d prefer to avoid it if I possibly can.

So I’m left with browser automation if I want to scrape. But for simple scraping tasks, especially one-off tasks, most browser automation tools seem like overkill. You have to download the thing, then figure out how to use it, wade through documentation, learn the relevant APIs, blah blah blah. Maybe it’s another personal failing, but I hate doing all that stuff, and again I would prefer to avoid it if I possibly can.

So what am I to do when I want to scrape? As usual, the answer is easy: Emacs. And why not? In most cases the data I want to scrape is text, and Emacs is an all-purpose text-handling tool, so really, what else would I use?

As an example, consider Hurriyet Daily News1, an English-language Turkish news site. Comparing it to American news outlets in terms of jounalistic quality, I would it’s like CNN – not a real journalism organization like the New York Times, but also not a propoganda dissemination machine like Fox News. If you want to keep up with Turkish news and you don’t speak Turkish, it’s not a bad option.

The centerpiece of the landing page is a box containing a half dozen or so headlines with accompanying images. These headlines scroll through one at a time. Suppose, for whatever reason, that I’m interested in tracking the headlines that show up in that box. Here’s how I would do it in Emacs.

First, pop the site open in the Emacs brower eww2. It should looke like this (and if it doesn’t, then the website has changed and this post is out of date):

There are the headlines, together with topic keywords and some kind of text object corresponding to the story (I think it doesn’t show up because Javascript isn’t run). Now, open an empty buffer. I usually call my empty buffers asdf, but you can call yours something else if you want. Our ultimate goal is to copy each of those headlines into the empty buffer, at which point we can do whatever with them.

To do this, we’ll use a keyboard macro3. Steve Yegge once said “I believe I can state without the slightest hint of exaggeration that Emacs keyboard macros are the coolest thing in the entire universe”, and he’s not wrong. Keyboard macros make boring, repetitive tasks quick and even fun (I’ll sometimes spend more time trying to craft the perfect macro than it would have taken me to do it manually). The way keyboard macros work is you start recording, then hit some keys, then stop recording. When you play back the macro, the keys you recorded will be entered again. The meaning of the keys is not recorded, just the keys themselves, so be careful!

Now, pay attention here, because the details are important (except for the details of the headlines, which don’t matter at all). For reference, the first few lines of the headline section looks something like this:

Move point (cursor) to the beginning of the line in the eww buffer that says Home Page.

Start recording a keyboard macro. The default binding for this is C-x (.

Hit TAB. This will jump down to the first topic keyword, which in this case is WORLD. (This is a link of some kind).

For whatever reason, the headlines can’t be reached by TAB-jumping, so move the cursor down three lines (C-n, or <down>).

The cursor should now be at the beginning of the line that says “Saudi consul…”. If it isn’t, move it there with C-a. Now highlight the whole line. This can be done by setting the mark and moving to the end of the line (C-SPC C-e), but it can be done other ways too.

Copy the highlighted text, or kill it or whatever the weird Emacs terminology is. I use C-k for this, but that isn’t the default binding, which I can never remember.

Jump over to the empty buffer. The default binding for this is C-x b, which is an unbelievably shitty way do something as common as changing buffers. Anyway, hit that and then enter the name of the empty buffer (asdf for me).

The cursor should be at the beginning of the buffer, which should have nothing in it. Paste in (or yank or whatever) the copied text. I use C-v for this, which again is not the standard binding.

Enter a newline (RET). The cursor should be at the beginning of an empty line at the end of the buffer.

Jump back back to the eww buffer. The cursor should be at the end of the “Saudi consul…” line.

Stop recording the macro. The default binding for this is C-x ).

At this point the previously empty buffer should have the first headline, with an inactive cursor at the beginning of an empty line below it, and the active cursor should be at the end of the first headline in the eww buffer. Good? Okay, now execute the macro with C-x e. If it worked, the situation should be the same, but with the second headline copied into the other buffer, and the cursor at the end of the second headline in the eww buffer. Neat, right? If it didn’t work, something got screwed up, and there’s no telling what happened. Undo whatever it did and try again.

There are a few more headlines, so execute the macro as many times as needed to get all of them. For convenience, after hitting C-x e the first time, the macro can be replayed again by just hitting e.

Now, if you wanted to leave it at that, you could, and you would, as far as anyone could tell, have a function that did exactly what the macro did. You could call it, bind it to a key, whatever. However, with a macro as complex as this one, it’s usually better just to write a real function. This can be done without too much trouble, as the bulk of the work is just figuring out what commands the key presses are bound to, and then putting those in the function. It doesn’t have to be fancy.

Here’s a function for scraping Hurriyet based on that macro. It grabs the headlines and then dumps them into a file called hurriyet-headlines along with a timestamp. Some example output:

(require'shr)(defunscrape-hurriyet-headlines()"Scrape the top Hurriyet Daily News headlines.
The Hurriyet home page is expected to be laid out as follows:
<front matter>
Home Page
<topic -- LINK>
<story>
<headline>
<topic -- LINK>
<story>
<headline>
<topic -- LINK>
<story>
<headline>
...
The scraping strategy will be to jump to that home page section, then
walk down the first seven links and copy the headlines associated with
them, pasting them in to a result file.
"(interactive)(let((site"http://www.hurriyetdailynews.com/")(file(find-file"~/hurriyet-headlines"))(headline-count7));; Add date and time(switch-to-bufferfile)(goto-char(point-min))(insert(format-time-string"%F %T %Z"nilt))(newline2);; Give eww some time to load(ewwsite)(sit-for2);; Jump to "Home Page" header(re-search-forward"^home page$");; Stories look like this in eww:;; <topic -- LINK>;; <story>;;;; <headline>(dotimes(_headline-count);; Navigate to headline(shr-next-link)(dotimes(_3)(forward-line));; Copy headline(set-mark-commandnil)(move-end-of-linenil)(kill-ring-savettt)(deactivate-mark);; Paste headline(switch-to-bufferfile)(yank)(newline)(switch-to-buffer"*eww*"));; Save and prepare file for next invocation(switch-to-bufferfile)(newline2)(save-bufferfile)))

To be clear, this is NOT elegant Elisp, and it definitely does stuff that would be inappropriate in a distributed package. It’s also brittle, as scrapers tend to be – if the Hurriyet website changed its format, I would have to dump it in the trash and start over. Nonetheless, it works fine for personal use.

Footnotes

1hürriyet is a Turkish word derived from the Arabic حرية meaning freedom.