Hook into Wikipedia using Java and the MediaWiki API

The Mediawiki API makes it possible for web developers to access, search and integrate all Wikipedia content into their applications.
Given that Wikipedia is the ultimate online encyclopedia, there are dozens of use cases in which this might be useful.

I used to post a lot of articles about using the webservice APIS of third party sites on this blog. This is going to be another post like that.
This post describes how to use the Java Wikipedia API to fetch and format the contents of a Wikipedia article.

The Wikipedia API

The Wikipedia API makes it possible to interact with Wikipedia/Mediawiki through a webservice instead of the normal browserbased web interface.

A request to this url will return an xml document which includes the current wiki markup for the page titled “Web service”. As the request parameters indicate, these requests are highly configurable. For example, other formats than xml, such as json, are possible. For a full list of available parameters, visit http://en.wikipedia.org/w/api.php.

Basic example: getting the contents of an article

However, the basic usage example given in the documentation, at this time, does not compile with the current version of the lib.
Therefore, we will start with a basic usage example of which no variant is listed there and extend this example.

We are going to list the code to fetch the content of the “Web service” page and render it as html. Note that to get a specific page, you need to know its title.
If the page does not exist, a result with one empty page will be returned.
For ambiguous titles, the disambiguation page will be given too, so even if you get a non-empty result, you still need to check it thoroughly.

We are instantiating a user on the English wikipedia endpoint. Since we are only going to read, we can login anonymously.
We query the english Wikipedia for the specified titles and get one page as result in the listOfPages variable.
We then instantiate a WikiModel. This class will render the html and its constructor parameters – imageBaseUrl and linkBaseUrl – determine where the rendered images and links will point too. For example, if you want these to point to local files, you would supply a local path. In the example, I made it completely relative. In the official documentation, these are “http://www.mywiki.com/wiki/${image}” and “http://www.mywiki.com/wiki/${title}”, which you would use if you were putting a Wikipedia copy at http://www.mywiki.com/wiki/.
We then render the page as html and print it out to the console.

The outputted rendering is very rudimentary and is far from complete though:

By default, all markup is rendered. However, you might need to leave certain parts out or modify the content a bit before it is displayed for your particular use case.

Handling magic variables

Most magic words are not supported by the Java Wikipedia API.
We need to implement their rendering ourselves.
If you want to do some advanced converting of the Wikipedia content, such as handling these magic words, you need to extend the WikiModel class. More info about this is at http://code.google.com/p/gwtwiki/wiki/Mediawiki2HTML.

The overriden getRawWikiContent in the above MyWikiModel code returns null for most magic words in its default implementation. A magic word such as {{InfoBox}} would pass through this code with the default namespace=”Template” and articleName=”InfoBox”. If null is returned, the magic word will be outputted in the rendered html as is(so for {{InfoBox}}, this would be {{InfoBox}}). So, the resulting html is full of these unreadable tags, which does not make it look pretty printed.
What we are doing in the above code to solve this is returning “” instead of null, so the magic word does not get rendered at all.
Nothing is stopping you from returning something else though.

Controlling the rendering of the html by implementing an ITextConverter

For my particular use case, I also did not want to render any html links, I did not want to render any references and I did not want to render any images. The WikiModel class does not implement support for leaving out these things. However, the overloaded render method of WikiModel can take an ITextConverter object as an argument, the object that is responsible for converting the parsed nodes to html(or another format, like pdf or plain text). The default ITextConverter, used when none is specified as an argument, is HTMLConverter, with its property noLinks set to false by default.
However, there is a HTMLConverter constructor which sets the noLinks boolean. By passing true to this constructor, no links will be rendered. Their content will be rendered as plain text instead.
Since I still had to leave out the reference and image elements, I still ended up subclassing the HTMLConverter.
First, I made a more extensible version of it:

The functionality of the above HTMLConverter is almost the same as the original one, but the code is divided into more methods, for easier overriding, and a noImages boolean is added as well, which leaves out all the images at render time if set to true.

If the converter encounters a “ref” html tag, it does not render the html tag. This results in no references getting rendered at all.
I also changed the rendering of the content a bit. This is because returning of “” for the magic words(see above), might leave (, or () in the text, and the line that replaces these cleans up the rendered html.

The end result

We now have a properly formatted, completely offline article, with all its external links, images and references stripped:

Comments

5-4-12

Pietro says:

I have some difficult to import info.bliki.wiki Library into Eclipse. I try to import Zip Archive in a new empity Project but when I make a new class I have an “import” error. Also I try to make new Build Path whit the same Zip Archive but incurring in the same erro.

Can you help me giving a simple guide how to use (import) Bliki Engine with (in) Eclipse??

Everything worked pretty well for me, but I’m having trouble accessing some wikipedia articles. It work ok, for example, with Rome, but I can’t get it to access St. Marys’s College of Maryland. Is there any way to get the corect title or to pass directly the link to the api?

I must say your explanation is very clear, and I always enjoy such concrete examples because they’re easy to implement, but for some reason this one won’t work for me.

There’s basically two scenarios:
- The one is where I try this piece of code with the links you provided (so the MediaWiki API page). I then get a connection time out. Most likely this has something to do with the proxy settings, I’ll have to ask IT about that.
- In the second scenario I try to access a local Wiki and get information from the Main Page. The problem is that no matter how many input Title Strings I give, I always get back 2 results and both have all “null” values.

Do these problems sound familiar to you? Is there any way around them that you know of?

Hello,
thank you for your codes, it’s very helpfull,
for me, i search to have liste of pages that content the words of query and note the page that have the title the same as the query(i mean i must do the full text search in wikipedia) do you have any idea how to do this??
thank you

The examples are great. I’m trying to figure out how I can extract the XML of an article (instead of HTML in your example). I want to do this without the use of a XML dump. I want to obtain the XML for each article query. Can bliki do this?

I do not want to use Maven , my project goes very very slow when using maven .
I go to the website and at the download page all projects are source .
Can you provide the exact link to get the Jar files to run this ???