Wikipedia for the iPhone/OLPC: storing the sum of human knowledge in 2GB

This talk details the technical story behind creating a GPL’d application for storing and reading a copy of Wikipedia on the iPhone and OLPC — basically, realizing The Hitchhiker’s Guide to the Galaxy. Now one of the most popular iPhone apps (with over 100,000 downloads), and a pre-installed OLPC app (bringing Wikipedia to kids who’ve never used the internet), we describe the challenges and hacks involved in making it all work.

To store Wikipedia in 2GB requires finding highly space-efficient compression for article text and search indices. Doing this as part of an interactive application on a device with very limited processing and memory resources introduces further constraints that require unusual solutions.

Some of the hacks we describe include:

partial decompression and indexing of bzip2 files;

fast prefix and substring matching using a single compressed index, based on James A. Woods’ work (“Finding Files Fast”, USENIX ;login: February 1983), to enable near-instantaneous search of article titles;

efficient storage, parsing and rendering of MediaWiki markup, using fairly computationally-intensive preprocessing to enable a rapid, single-pass parser on the device itself.

In addition, we describe some of the systems involved on the server side, such as creating an ad-hoc CDN for distributing our customised dumps.

People planning to attend this session also want to see:

Patrick Collison

Stripe

Patrick is an Irish Lisp, Smalltalk and C hacker. He won the Irish Young Scientist of the Year award in 2005, for work on a new dialect of Lisp. Later in 2005, he came second in the European Union Contest for Young Scientists. He started college at MIT in 2006, but deferred to cofound Auctomatic in early 2007. Ten months later, Auctomatic was acquired by Live Current Media for circa $5 million. He’s currently working on his second start-up.