Offline copies of wikipedia

I have been involved for a
number of years with Hilton Theunissen and the Shuttleworth Foundation and
their efforts to bring computers to township schools. A part of that software
suite was an offline copy of wikipedia.

Early attempts

I have blogged before about
my own project Wizzy Digital
Courier putting Thin Client labs down in South African classrooms. That
also included a copy of the english language wikipedia.

Initially in 2003 I took the whole of the then-existing English wikipedia,
installed a copy of the mediawiki software in conjunction with mysql and apache
as database and webserver respectively. The whole thing was around 18 Gigabytes
- quite a handful.

It worked well, but I had various complaints on the unsuitability of the
material - it was a single snapshot, and had not (could not) be proofread, so
it had vandalism, and quite explicit articles around sex. Oops. But - it had
search, it had a vast amount of useful information on all manner of
subjects.

Very soon the English wikipedia bloomed to hundreds of Gigabytes, making it
completely unmanageable in terms of size. I couldn't download it, and I
couldn't proof it. What to do ?

Wikipedia Version 1.0

Wikipedia has its own community of people, and among them I connected with
Andrew Cates, of the
SOS Children website, who came up with a selection of articles (1000 or so) as
an HTML dump that he and some others painstakingly proof-read for suitability
as a children's educational resource. (He has a larger article collection now).
Jonathan Carter helped
package this for the tuXlab installs for the Shuttleworth Foundation.

There is a lot of work to do preparing such a collection.

Which articles should be selected ?

all the article text in the HTML dump must be stripped of links to articles
not in the dump

it must be proof-read

associated pictures must be incorporated

Other people became involved, in particular Martin Walker from the State
University of New York at Potsdam. Systems were put in place to help on article
selection. A project was started - the Wikipedia Version 1.0 Editorial Team. Articles were assessed - both for quality (from Featured Article down to
Stub) and for importance (from Top to None). These assessments are placed on
the article Talk page, and a robot goes through them all on a regular basis and
collects the results on project pages like this and this - conveniently doing all the heavy lifting to present
sortable tables of the state of all the articles.

Thus you can find Top Importance articles of poor quality, and can highlight
that page for improvement. You can cherry-pick Featured Articles to add to the
collection. These tools made the article selection process far more
manageable.

I assisted in the post-processing by writing a script that would search all
the chosen articles for 'bad
words' - an indication that the article has been vandalised - and then a
cleanup crew has to go through these to check and possibly remove material.

Now to package all this conveniently. A French company called Linterweb came up with Okawix - all these articles and
pictures packaged in a file, with a cross-platform reader to navigate the
collection. Why do we need a reader ? To implement search.

Search

For many of the places we put an offline wikipedia down, it became the
'Internet' for the children in the classroom. They had no net connection, but
the principles of self-paced learning, hyperlinks for tangential information,
and other net paradigms made it the 'killer app' of their little school. For
Internet, you need search. For a wikipedia collection of a few thousand
articles, you need search. Search needs a computer - you cannot put search on a
CD or DVD or USB stick.

For a standalone computer, a Reader is needed to perform this function. From
a basic HTML dump, navigated with a browser, Javascript can be pressed into
service, but I have found it inadequate. In the tuXlab Thin client labs, a
small network of old computers is networked to a powerful server - and I want
the wikipedia collection to be browsed via HTTP, and the search to be performed
server-side.

Categories

A related problem is how to organise all these thousands of articles. They
are mushed together in a big web of information, but where is the structure ?
Wikipedia proper has categories - useful for grouping similar articles, but the
arbitrariness of the invented categories means that it has been a problem
incorporating them into the static dump. Martin battled for days to make a
river in Poland appear 'automatically' and conveniently under a Poland
heirarchy.

Metadata

In computer jargon, this is called metadata. It is structure beyond
the mere linking of articles. There is other metadata - like the
Importance assessment scale. We need to extract all that metadata and
place it alongside the article tree so it can be used for indexes.

On the 24th November we had an IRC meeting - an online chat between all
interested parties spread around the world. Much of this was discussed - and
one thing became clear - the Wikimedia foundation needs to concentrate more on
the process of generating a release, rather than the end product like Okawix.
That means tools to work with the Metadata, tools to package the pictures and
article references in such a way they can be optional. Perhaps targeted article
collections, like Mathematics, Chemistry, Africa, Oceans. Let other
organisations do the work of packaging and marketing.

To allow computers to do the work - we need good metadata. Assess articles.
Rugby articles are not Top importance, except in the context of sport.

I think the article collection should consist of a number of different
pieces, to be incorporated as necessary.

The text of the selected articles

Pictures for those articles

metadata to support this collection

a text search index, like one created by namazu, for those
tools that can use it.

Future efforts

Though a lot of effort on offline wikipedia collections is targeted at
schools, and Third World, there are other target markets. One we have not
really addressed yet is the cellphone as a wikipedia platform. A cellphone
implies connectivity, but these days it is becoming a universal platform - a
camera, a music player, a gaming box, GPS. Personally, I would like a text-only
wikipedia collection of lots of articles, but only the lede paragraph
- the first section of a wikipedia page that introduces the subject. It is a
song by Black-eyed Peas. It is a river in Poland. That way, I can carry all of
this on my phone without paying airtime.

Cellphones

Cellphones have huge penetration in the Third World. I tell tourists I take
to the townships that South Africans spend their money on cellphones and hair.
Maybe we should concentrate there, as much as schools ?