We are pleased to announce the first release of the Wiki2Tei software. Wiki2Tei is a converter from the mediawiki format to XML (TEI vocabulary).

The mediawiki format is used by wikimedia fundation wikis (Wikipedia, Wikibooks, Wikisource), and many other wikis using the mediawiki software. Large amounts of free hight-quality structured texts are available in this format. These texts are used more and more often in NLP (natural language processing) projects. However, the mediawiki parser is oriented towards rendition and the mediawiki syntax is complex and hard to parse.

The Wiki2Tei converter makes available the information contained in wiki syntax (structuration, highlighting, etc.), and allows to properly retrieve the plain text. This conversion is intended to preserve all the properties of the original text. Wiki2Tei is closely coupled with the mediawiki software, allowing to convert all the features of the mediawiki syntax.

The Wiki2Tei converter provides a rich set of tools for converting mediawiki text from several sources (file, mediawiki database) and managing collections of files to be converted. The TEI vocabulary used is documented, according to the TEI Guidelines, in an ODD document. The code is open source and may be downloaded from the SourceForge download area:

This entry was posted
on Wednesday, October 10th, 2007 at 23:21 and is filed under General, Tools.
You can follow any responses to this entry through the RSS 2.0 feed.
You can skip to the end and leave a response. Pinging is currently not allowed.