Oracle Blog

Blog for migi

Wikipedia for indexers testing

Mikkel from tracker mailing list had nice idea to dump Wikipedia to text files for testing indexers. See:http://mail.gnome.org/archives/tracker-list/2007-January/msg00180.html
So I've created small application that can be run from the command line as well with the swing gui called JWikiTextGetter.
To run it, just grab the binaries from HERE
Unpack them and run with >jre1.5 as follows:
java -jar JWikiTextGetter.jar --gui
or:
java -jar JWikiTextGetter.jar --help
This is very fast written application, so don't expect to much! It just do what it should. The quality of grabbing text from wiki depends on the htmlparser library. Each file (for en.wikipedia) contains 59 lines that maybe should be deleted, but I left them :-)
There are configuration files for each wiki, because different localized wikipedias have different url's, if you want to create your own, go to the wikipedias folder in the main application folder and write your own as two examples.
UPDATE:
According to the wikipedia page, it is not welcome to use web crawlers to download large numbers of articles.
http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

Rather than downloading text from wikipedia, I did once search for some stuff with google and downloaded that whole google search (all pdf documents, about 700MB in total) --- nobody complained :) - may this be an option?

I know that wiki administrators asks to not use crawlers, that is why I have added update to my weblog, but on the other hand downloading all the wiki dumps just to have few pages can make other people (Internet providers, other users of the same subnet) unhappy. Even worst case if we want to get few documents in each language. So in my opinion downloading 50 pages of each language, which is one time process, is "better worst" than downloading whole dumps.

There is another thing, legal one. Getting documents from google search doesn't tells you anything about license of those documetns. Probably nobody ever will notice that someone is using such dataset, but the one that I am making is also used inside my company and I have to be aware of the licenses.