24 Nov 2009

Wikipedia

Wiktionary

25 Nov 2009

Interview w/Kelson re: OpenZim

OpenZim is a storage and retrieval database/compression format requiring an external reader. OpenZim works exclusively from html output, such as from htmlDump.

Follow-up questions regarding redlinks - if files are not part of the repository, how they are handled is up to the reader software. It's possible to import an article subset into a MW install with the noredlinks skin extension, avoiding the issue.

"the database dumps require a lot of things which are not 100% possible to recreate to render into usable form"

Example issue: template are not expanded.

directly related, parser does not have a spec and cannot be run outside MW software.

direct result: templates cannot be expanded.

"It would be really nice if dumps were made available with templates already expanded. You never know what a template does until you expand it. and as often as not it uses other templates so you still don't know what it does until you also expand those. It's a recursive process."

Template expansion came up 4 separate times during this interview.

Parser specification - because without it there is no way to build a parser which can expand templates.

The output of the Mediawiki parser is unusual in part because it isn't a real parser, has no error fails, and has acreted over time.

"There are quirky heuristics and special cases all through the parser. The ones for french punctuation are a famous one."

"There is no description of the parser so that it can be independently reproduced in other programming languages."

Working with dumps is expensive in computing/networking/storage considerations.

As an independent developer working with only a netbook, it's impossible for him to consider downloading a dump and importing it to local MW installation, which has led to the FF remote dump reader development.

Working remotely is also less than ideal: the toolserver set up does not allow server-intensive tasks nor is the full content available.

Current project to parse content from en.Wiktionary, extracting every dictionary field or attribute into a database.

Allow querying on a huge range of variables and words

Allow relational querying as well (parent/child sections)

Suggestions

"a parser spec is the #1 item"

"i would dump in more formats."

Got any specific ones?

"one with full HTML but probably without the "wrapping" page. just the generated part as you would see for action=render or action=print"

"and another is flat text that people can use without having to handle either wikitext or HTML, including mediawiki interface elements such as tables of contents and edit links etc"

"but my personal pet wish is for a minimally formatted dump which preserves only generic block level elements and converts inline elements to plain flat text. this would preserve a minimal amount of sentence/paragraph context for applications that want to analyse how language is used"

Note: this latter format would be used to develop a corpora on which linguistic usage and frequency studies could be based.

(in discussion regarding a dictionary-specific dump of Wiktionary) Make Wiktionary content more regular

Templates are easier to parse than prose text, but is harder for contributors to work with.

In prose text it is very hard to get contributors to use in a regular way which is easy to parse.

If you do build a parser to manage one language's templates, you'll need a different parser for each language.

Wiktionary needs a voice among the developers.

"well the most obvious thing is that nobody who is a developer or sysadmin at WMF is a wiktionarian so the foundation has little idea what we need. when we go to the trouble of learning sql and php and make a mediawiki extension it never gets installed"

"wiktionary needs a voice inside wmf. at least one person who cares to represent us."

"otherwise all we can do is offline processing and javascript extensions or wait patiently and faithfully for a few more years."