This is a 2008 tutorial demonstrating how to write a Haskell program to automatically archive Internet links into WebCite & Internet Archive to avoid linkrot, by parsing WP dumps, downloading & parsing WP articles for external links with the TagSoup HTML parsing library, using the WebCite/IA APIs to archive them, and optimizing runtime. This approach is suitable for one-off crawls but not for live archiving using the RSS feed, see Wikipedia RSS Archive Bot for a demonstration of how one could write a RSS-oriented daemon. These Haskell scripts are (hopefully) obsoleted by later IA/WMF initiatives dealing with linkrot. For a more general approach suitable for personal use, see the writeup of archiver-bot in Archiving URLs.

One of the best ways to learn something like Haskell is to simply dive in and make it do something useful. In my case, I’ve wanted to learn Haskell for a long time: the syntax was gorgeous compared to Common Lisp, Bash, or Java (the only other languages I had experience with); various ideas such as monads, lazy evaluation, and functional purity struck me as fascinating and exciting; and the community was excellent.

But I didn’t have any practical use for it, and I’m too lazy to learn it unless I have to. Bash worked well enough for my shell scripts, and I generally only messed with Common Lisp because of my use of the Stump window manager, but I had finished configuring and extending StumpWM to my liking1. Fortunately, I am a heavy user and long-time editor of Wikipedia, and some technical details bug me.

the feature of MediaWiki that by default, Fujiwara no Teika is a pagename distinct from Fujiwara no teika. Going to the latter does not automatically redirect to the former. One must either create redirects to the right pagename2 , or simply accept that mis-capitalizations will result in broken links and needless searches.

Links to other non-Wikipedia websites, external links, often break. Because articles can evolve so fast, the Internet Archive doesn’t necessarily manage to archive all external links; worse, the Internet Archive is dependent for its data on data dumps from Alexa & its toolbar, which are provided an eternity (6 months or more) after the original spidering of the external link. An external link could be added to an article, used as a reference, fall victim to the infamous 404 error, be noticed to be a dead link and removed on that basis within a few weeks much less 9 months!

The beginning of a solution is to realize that the Internet Archive is not usable here. Their FAQ specifically says that archiving is done only through Alexa, and Alexa only does it through their spiders, the Alexa toolbar (proprietary and MS Windows/IE-only), and some web form which doesn’t appear to do anything3. Their Archive-It service does do on-demand archiving, but at steep institution-targeted fees; not something for individual users.

Fortunately, if we poke around on the subject of archiving URLs, we’ll eventually discover something like WebCite, which is tailor-made for our purposes. They offer persistent archiving services, for free, and best of all they can archive on demand!

Reading through their documentation, it seems they will eventually allow you to upload a file containing a large number of URLs to cite, but not yet. That’s alright - just trying out the archiving (or reading the technical FAQ) reveals the submission URL encodes everything in a very simple format:

So going back to our problem. Our problem is that we want to take all the articles of the English Wikipedia, somehow extract all the external links from said articles, transform those URLs according to the previous paragraph, and then actually open them (which will cause the embedded URL to be archived).

Our first question is naturally How on Earth do we even get the addresses of the millions of Wikipedia articles? Saying they’re all somewhere in en.wikipedia.org/wiki/ doesn’t help. I’m going to cop out a bit here and not actually explain how I hit upon this approach, but we’ll borrow some code from the bot I coded to fix Problem 1, which looks a little like this:

Notice that spaces are being escaped as underscores, and that each page name is on a separate line. We are getting all this information from a special file kindly provided by Wikipedia precisely for people like us, who don’t want to spend the effort to parse Special:Allpages to get the live list of article names; parsing Allpages is how all the serious Wikipedia bots work, but relying on old database dumps gets us 90% of the functionality at 10% the effort. Go visit http://download.wikimedia.org/enwiki/latest/ and look for a file called all_titles_in_ns0.gz. When it’s downloaded and uncompressed to a ~60M file, it provides suitable fodder.

So, we have getContents lazily slurping in names of articles to standard in, and we have words (it’s a pure non-IO function, but liftM lets us use it in the IO monad) breaking a single long string with lots of newlines into a list of strings.

But what is this mystery function fetchArticleText? Well, it will take as its single argument a String, and it must return a list of Strings since any given article might have many different external links in it. And since we certainly aren’t loading into memory an immutable copy of Wikipedia (regardless of whether we download a full dump of the articles or decide to fetch articles over the Internet), the result will be promoted into the IO monad. So the type signature must be:

Purely arbitrarily, we’ll have fetchArticleURLs work on pages from the live English Wikipedia; but you could probably also go the route of working from an offline database dump, which would have the advantage of halving network requests and parsing out URLs much faster (since there’d be no need to request individual pages over the network).

It’s not too hard to see how to get the URL for any given article ("http://en.wikipedia.org/wiki/"++article), nor how to use Network.HTTP to download the HTML file (couldn’t be more than 10 lines) - but reliably parsing that HTML to extract genuine external links doesn’t seem trivial at all! We could go with some sort of clumsy regexp matching on http://, but hit-or-miss regular expressions just aren’t The Haskell Way.

Stumped, the thing to do is to go bum around on the Haskell wiki and #haskell, where some excellent fellow will eventually inform us that Hey, you know, parsing Wikipedia pages’ HTML to extract out all external links kind of reminds me of one of the examples for Neil Mitchell’s tagsoup library. You should go check it out.

Indeed, we need some sort of library to parse HTML for us and maybe handle the network business, and TagSoup fits the bill admirably. One of the examples in Example.hs (googleTechNews) does almost precisely what we want, except for Google Tech News. The example in its entirety:

We could replace undefined with whatever parsing we cook up, but we might want to use it somewhere else, and besides, it’s bad practice to muddy up a function that could be pure (like our hypothetical extractURLs function) with IO. Separation of concerns, and all that.

Alright. This list comprehension has 3 parts, which we will read from top down. We start with our arg, a String. It immediately passes to parseTags, which turns it into [Tag] - notice it’s not a tree or hierarchical structure like our String actually represents; this is why it’s soup.

A Tag can be a lot of things, but we quickly realize we only want TagOpen.4 The "a" is there because <href> actually begins with <a href=...>.

But how does TagOpen "a" atts do anything? We can think of a list comprehension as syntax sugar over a bunch of filters and whatnot. In this case, what’s really happening is something like filter (\TagOpen "a" x -> x) (parseTags arg); any entry which isn’t a TagOpen is omitted from the intermediate list.

The second line is exactly the same. We’re still working on a [Tag], and our little filter is still testing away, but now our test is really just pulling out the second entry in each TagOpen’s [Attribute] - it’s equivalent to a map of snd. And this gives us a [String] of URLs! The first entry in the lists are href, and the second are the URLs the href points to. But when we test out our two-liner, we realize to our dismay we are getting junk results like ["/wiki/Japanese_poetry",..]. A quick isPrefixOf and we fix that, and that finishes our list comprehension.

By the time urllist is generated, we’ve got an IO [String] which is containing all the external links. Now we can once again indulge the desire to map some sort of function over it. Another map, another hypothetical function:

But we can improve on this! We are using mapM which will go to all the trouble of retaining all the output from archiveURL being run on the dozens, hundreds, thousands, or millions of entries in urllist.

Can’t we be more efficient somehow? This sort of situation is exactly where naming conventions can come in handy. I happen to remember that just as foo' denotes some stricter version of a function foo, a foo_ denotes some sort of function that will throw away its output. One blindly tries mapM_ instead, and it works!

We could have been more consistent. We know that mapM :: (a -> m b) -> [a] -> m [b], and we want to get rid of the output, or turn the [b] into nothing, not even a list, which means (); we throw the new type signature into Hoogle, and the very first result is mapM_.

We could also throw in a speculative request to Alexa, which handles archiving for the Internet Archive. Supposedly any URL visited by someone using their toolbar will be spidered and then the spidered copy given to the Internet Archive. We can mimic the request of the toolbar. It turns out to be not that difficult to do, not nearly as difficult as Alexa could have made it. (Ironically, my first attempt was based on trying to read the JavaScript source code to the Alexa toolbar, which I followed in detail - and the moment I checked an actual request URL, I realized I had gotten it completely wrong.) This code, for brevity’s sake, will not be in the final WP archive bot code, but it is included in the general-purpose archiver daemon which will archive any URLs in a file in multiple ways and can be seen as the generalized version of this WP archive bot.

With the handy Firefox extension Live HTTP Headers, we can easily snoop on the toolbar’s traffic. I turned it on, visited a few domains with short names, saved the traffic to a file, and grepped for alexa.com and http://:

We can rule out widgets.alexa.com - the URL structure and the name indicates that these requests are being made by the part of the toolbar that is showing the little traffic-vs-time graph in the Firefox GUI. So:

&url=http%3A%2F%2Fwebcitation.org%2F (the URL-encoding form of http://webcitation.org/)

We can hardwire #1 & #3; I am guessing that we can hardwire #2 and #4; #6 can be left empty since none of the examples used it; #7 is just the payload; but unfortunately, #5 seems a little more difficult. We have 4 examples here for the cdt field:

rq is presumably short for request, and an incrementing integer suggests that it’s counting how many pages one has visited. For the archiver bot, we can hardwire it as 0 because there will be a good 20 seconds between each request (due to the WebCite timeout).

wid is troubling because it’s constant, and a fairly large number. It doesn’t seem large enough to be a user ID - surely the toolbar has had more than 7000 users, since the toolbar page says it has had 1.6 million downloads. Is it a user ID? We’ll leave it for now.

ttl is also troubling because it changes but isn’t even in the first request. Does it stand for time to live? Does it vary if I load the same page many times, is it a measure of how fast pages load or something (a metric very important to search engines, who rank higher the sites which are faster)? Can we get away with omitting it?

We need more information on ttl and wid. So let’s visit a bunch more URLs and see whether more data says anything:

The smallest in any ttl is 376, the largest is 2241. This is a range which smells suspiciously like a page load time - few sites load much faster than 0.25 seconds and most clock in at a second or two. Combined with the observation that the first sample URL doesn’t have a ttl, I think we can safely omit the ttl field.

And wid? We speculated it’s some kind of user ID. In this second batch, the field is wid%3D15698 or wid=15698. So the wid is not consistent across Firefox sessions, but it is consistent across multiple page loads. This suggests to me that Alexa is only aggregating URLs by a browser session ID (maybe it originally tracked individual windows, hence the w part of the ID). We can probably just generate a random number between 1000 and 20000 every time we archive a URL. So, all told our pseudocode goes like this:

We compile the entire program, and it is short. It is sweet. It works, and does what we want, and has explicit types and import statements, and we’ve reasoned out every bit of it. So we should be pretty happy with it. But I’ve noticed one thing while playing around in GHCi: it’s naive and inefficient. Not only does it eat up ungodly amounts of memory and time, it also seems to be making duplicate requests! The reason for this is that on every Wikipedia webpage containing an article, there are a number of external links which aren’t part of the article itself; this includes linking to the WikiMedia Foundation (the non-profit organization which runs all the Wikipedias and associated projects), links to donations, etc.

Well, not a hard problem. We want to ensure there are no duplicate entries in our list; one suspects this is a common desire in list processing.

The type signature would be [a] -> [a], but unfortunately, this is far too general: Hoogle will not even put the right function on the first page.

We need to think: any function of this sort must go through the list, and compare entries. Compare what? For greater-than, less-than, etc.? That would imply an Ord constraint, but punched in, we see nothing useful. No, those are useful for sorting, but we only want the entries to be unique and not sorted in anyway. We only need ==, which means an Eq constraint. nub is the very first hit for Eq a => [a] -> [a].

But alas, we see an all too common problem with beautiful concise Haskell: it doesn’t perform well. We use it from the shell like thus:

$ head -n 20 enwiki-all-titles-in-ns0 | archive-bot

It performs acceptably for small values of nn, but if we try 100 or more, we quickly notice extreme slowness & memory usage. Looking through the code, nub immediately comes to mind as a culprit. Everywhere else, we’re using maps and the like, which are lazy and make nice streams with constant memory-use, but nub operates on the whole list of URLs at once, so perhaps it’s slow. nub is O(n2)O(n^2) because it compares every entry against every other entry in a list. Let’s optimize it.

nub has to do O(n2)O(n^2) matches in part because it’s trying to preserve order (which is sort of important for a list). But we don’t care at all about order, so we’re free to do something clever… like convert the list to the mathematical structure of sets and then back! We know in sets each member is unique (no {1,1,2} business), so we know it works, and the performance is better than nub. So add in:

To convert to ByteString, we first convert to ByteString equivalents wherever possible - B.words $ B.getContents, for example. Unfortunately for our main function, ByteString introduces some weirdness, so we de-sugar the do notation to use the good old =<< operator to make sure everything fits together. But take heart, it’s shorter than before:

Notice how we’ve split out our concatenating and uniqueifying code to its own little points-free local definition. You may’ve also noticed that we’re using the regular concat and not ByteString.concat; this is because we actually aren’t concatenating the ByteStrings themselves but merely the list of lists.

The other bit of trickiness is that TagSoup regular Strings, so we need manually convert back and forth between Strings and ByteStrings using B.pack and B.unpack, and of course the type signatures need to be updated. But this is all pretty simple and mechanical; the final result of all this fiddling is the noticeably uglier but also noticeably faster and leaner code:

What’s that, you say? While ByteStrings has indeed made some difference, it’s marginal?

Fortunately, we can go still further - there’s not a whole lot we can do for the algorithm itself yet, but we can investigate parallelizing matters.

Remember that archiveURL was a bit inefficient because even with mapM_, each entry was being processed one at a time? Well, is this not Haskell? Aren’t functional programming languages famous for having functions like map which are embarrassingly easy to parallelize? If archiveURL were pure, we’d use Control.Parallel.Strategies, but it’s not.

We’re throwing away the output from openURL inside archiveURL. Ideally we would somehow be mapping a function over urllist which would say in effect Take this single ByteString and go off somewhere by yourself and do whatever needs doing; don’t bother coming back. In short, we want to fork archiveURL off.

But what function will do this for us? What is its signature? Let us think about it.

Clearly, if it really is going away forever, it cannot be pure, because such a function would be useless. If we can only write fork (1+1), how do we know it ever ran at all? So the result of the argument must be something in IO. Perhaps it is IO [a]? But that’s not good, since the argument doesn’t have to evaluate to a list - what about arguments which mess around with tuples, or sets, or all the other data structures we have? The most general argument type is IO a.

We could imagine a goAway :: (a -> IO a) -> a -> IO (), and many other variants; but we cannot communicate with goAway - that’s the whole point - and so like a spaceship, it must have everything it needs and its arguments combined must be IO a - so why insist on one particular call-scheme? goAway :: IO a -> IO () is more general.

Unfortunately, Hoogle cannot help us here. If we ask for IO a -> IO () or IO a -> IO a or other plausible types we could think of, we get a wide variety of functions, none of which we want. What we want is forkIO, with the signature IO a -> IO ThreadId.

This is a good illustration of why purity is important: it narrows down the range of possibilities. When a function starts with IO actions and ends with IO actions, it could be doing anything, and our search results reflect that. In comparison, a type like (a,b) -> a is so specific that djinn can mechanically generate the function for that type5

Much better! I’ve noticed improvements of 20-100%, since we can run on multiple CPUs. That little change was certainly worth it.

One small tweak occurs to me; if we ask a GHC 7 version of Hoogle for a function with the type m a -> m (), it will return a hit named Control.Monad.void, which is our >> return () in functor garb. If void is available, we would rewrite archiveURL like this:

But these speed increases have made me greedy - I want to be able to process a few thousand names at once, not a measly few hundred. This should cause us to ponder. Don Stewart has a quite fast program to check the viability of links, urlcheck; it delves pretty heavily into threads and gets great performance but it honestly looks pretty scary and hard to read. Do we have to resort to such nastiness?

Well, there’s a Haskell saying: the best way to optimize a program is to make it lazier or stricter.

It’s already strict in a lot of places because of the IO, and going further doesn’t seem like a good option - how can one pipe in thousands of articles if the program is going to strictly and immediately eat all one’s RAM? Laziness is the way to go.

To figure out how to get greater laziness, let’s go through step by step. We know getContents is lazy because it makes use of unsafeInterleaveIO according to the documentation - which is why we can have lazy IO. Let’s hold onto that thought.

Next is B.words. A quick experiment in GHCi (take 100 $ unwords $ repeat Yes; that sort of thing) convinces me that words is lazy as one would expect.

Next is mapM fetchArticleURLs. Here’s a problem. fetchArticleURLs does IO and so is strict, and mapM means it is little better than a loop over the list - the entire list. There goes our laziness since the whole list will be processed before being passed onto forkIO and archiveURL. But laziness and IO can co-exist in peace and harmony!

Remember getContents is lazy, and it can achieve this magic using unsafeInterleaveIO. Reading through the documentation, it seems to be safe in certain situations: when the IO target isn’t going to change, isn’t going to change as a result of whatever is done - or one doesn’t care when the IO is performed.

In our situation, it’s irrelevant whether we access an article at time nn or time n+1n+1. When you’re working on millions of articles, it doesn’t matter much if one article is changed or even deleted. So we can safely use unsafeInterleaveIO! We could stick it in main or fetchArticleURLs, but for convenience’s sake, I put it in the latter.

Let’s think about the flow now. getContents keeps on delivering a few characters to words, which is waiting for a newline, at which point it turns those characters into a String and passes them onto mapM which will immediately run fetchArticleURLs - and fetchArticleURLs will immediately finish, leaving behind some thunks or whatever holding a promise that it’ll run the IO actions (downloading a particular Web page) whenever it’s needed. And that promise gets passed into nub, which is busy concatenating and then uniqueifying the results of those thunks.

So now the program will run in constant space (I notice that even when I pass it 10000 or 20000 names it’ll use a steady 7.6% or so of memory as opposed to a varying ~40% with previous versions running on a few hundred names). It may take a while longer, but this is a small price to pay - now my computer will still be usable while the program is running.

Are we tired of this program yet? I’ve thought of one last and particularly nasty thing to do to speed things up and recover the speed lost to laziness. We have uniq in there to remove duplicate entries and be a little friendlier to WebCite, correct? But if the program is hardwired to always run on Wikipedia pages, then we can just run fetchArticleURLs on a few pages, note down all the duplicates to help pages and legal notices etc., and hard-wire in a filter to remove them. This is a hack.

But here is the final, ugly, fast and frugal version. This combines the lazy IO covered previously and adds in the hardwired filter.