WikiMiner

WikiMiner is a search engine dedicated to the DVD edition of Wikipedia. It has been created for DVD edition of Polish Wikipedia (235,000 articles), but can be easily localized to any other language version. By now two language files has been created: Polish and English one. The program is now being tested under various operating systems, and some minor changes are being implemented. The first name of the program (WikiBrowser) has been changed because of the conflict with other Wikimedia project, WikiBrowse.

Main features of the application:

The program is a standalone Java application. It requires Java Runtime Environment (JRE) ver. 1.5 or higher. Index can be placed on hard drive(fast search) or DVD.

It supports case-insensitive searching

Boolean phrases are supported.

Search result entries include page title, number of occurrences of the searched keywords, Wikipedia categories and excerpts from the article content

Index is to be installed on the user's hard drive, therefore the DVD is not used until a user clicks on a link to an article

The resulting index for the whole Polish Wikipedia, that is 2GB of text, takes 120 MB.

In a minimum installation mode, only Java Runtime Environment has to be installed on hard drive

Only part of the index is loaded into memory.

Searching is fast, once the index is loaded at the program startup.

You can search in Japanese as well as in French or Polish - full unicode is supported

Grammatical suffixes can be specified and cut during indexing and searching

Command line mode, stopwords, redirects are also supported

Alphabetical sorting is done using simplified UCA algorithm, which respects order of non-ASCII characters.

The index is being created from HTML pages which have to be formatted in a specific way (UTF-8 coding, article text should start from <p> and end with <div id='footer'>, etc.).

The program is released under the GNU GPL license.

It is independent from the operating system (at least tested and working under Windows and Linux)

In opposition to some GNU tools like Regain, no WWW server is being installed on a client hard drive, the program doesn't raise security alert on WinXP, and demands no special rights for any applet or application. Standard security settings are ok.

Search results are written to a temporary HTML file, and then the default HTML browser is called (or some other, depending on program configuration). While opening the result page, the program checks if Wikipedia DVD with article base is present, and if not, shows appropriate warning. Temporary files are removed when program exits.

Snapshot of the search panel in English version of the interfaceA - expression to be foundB - starts searchingC - link to DVD article #1 (for example to the main page of the project)D - link to DVD article #2 (for example to the help page)E - database sizeF - maximum number of results on a single HTML pageG - number of the first result on the output pageH - previous page (substracts F from G) and starts searchingJ - next page (adds F to G) and starts searching

Program searches for any words in all articles in the main Wikipedia namespace (ns=0). Grammatical suffixes (like -s in English, or about 30 suffixes in Polish) can be cut. Their list can be configured.

Search is case-insensitive. All unicode non-ascii characters similar to latin characters, like ą, Ü and about 500 other letters, can be typed as their nearest ASCII equivalent as well. Standard transliteration of German and Dutch letters (Ü=ue, etc.) is also supported.

While searching for a sequence of words, default and operator is assumed, and program finds all articles that contain all required words (in any order).

The resulting list can be navigated using hotkeys, which is especially important for the blinds. On Windows system Alt+1 jumps to the first result, Alt+2 to the second one, etc.

Operators, if not modified by parenthesis, are executed in the following order (from the first to the last one) :

title:, categ:

not

and

or

These keywords can be also translated to other languages with no programming required. For example in Polish Wikipedia we were able to use kateg, tytuł, i, lub and nie together with categ, title, and, or and not.

Program, in order to launch HTML browser and show the search results, uses modified BrowserLauncher class. Its copyleft:

This code is Copyright 1999-2001 by Eric Albert (ejalbert at cs.stanford.edu) and may be
redistributed or modified in any form without restrictions as long as the portion of this
comment from this paragraph through the end of the comment is not removed. The author
requests that he be notified of any application, applet, or other binary that makes use of
this code, but that's more out of curiosity than anything and is not required. This software
includes no warranty. The author is not repsonsible for any loss of data or functionality
or any adverse or unexpected effects of using this software.
Credits:
Steven Spencer, JavaWorld magazine (Java Tip 66)
Thanks also to Ron B. Yeh, Eric Shapiro, Ben Engber, Paul Teitlebaum, Andrea Cantatore,
Larry Barowski, Trevor Bedzek, Frank Miedrich, and Ron Rabakukk
@author Eric Albert (ejalbert at cs.stanford.edu)
@version 1.4b1 (Released June 20, 2001)

Java sources of the program are included in its jar file. You can open it using for example WinRAR program.