Description

As part of trying to figure out the best way of supporting language variants in Parsoid that are more data/table based, @cscott explored the use of Finite State Transducer formalism for this. Specifically, he explored the Foma based tools for this.

Note to reviewers: Scott has been working on a Finite-State-Transducer based framework for language variant conversions. You can think of FSTs as Finite State Machines with bidirectional edges and output in each direction .. so, as the FSM processes the input and ends on the output, it emits an output string .. effectively this is a converter. He is using an existing FST implementation (FOMA) to generate a transition table for the formal schema representing a conversion.

Additionally, instead of using the C-binary (that is part of the foma implementation) to execute the conversion, he has written a tool to compress the foma output to a json output and run it with a JS runner he has implemented. This second step is optional and we could run this with C code as well. But, that is something to evaluate as well.

The hope is that the formalism will make it easier to implement, maintain, and read bi-directional language variant implementations.

See the followup patch that uses this code to implement language variants on a DOM. Effectively, it walks the DOM, and translates text nodes ... this is where the FSTs are used.

As the piglatin example shows, some conversions like piglatin aren't as well suited for this, but are still doable. But, other variants seems more straightforward.

Note that the ATT and JSON files will let us use them on the PHP side as well - all that needs to be "ported" is the runner (or use the C impl) and the classes that hook up the FSTs into the parser.

That long note is to give you the context for this patch. Take a look at this approach and leave feedback. This is not a finished implementation yet. zhwiki is still to be completed. There is also a test runner which we plan to use with round trip testing framework. Some of that work is not completed yet.

Really Impressive work. I agree on the choice of FST for the convertor. I am less experienced with the formalism of foma, but have good experince with SFST formalism(Actually my pet project that I have been working on for last one year is FST based).

Can we simplify foma->att->json/js->convertor path in this approach further? The custom code we written to adds maintainance burder, and it is radically different from the kind of code usually one see in WMF. Specifically how about having a node-gyp binding on foma and use the generated node module for the whole conversion? Have you considered this path? (I see a python binding in their repo https://github.com/mhulden/foma/blob/master/foma/python/foma.py) The language converter historically is a blackbox of code which very small number of people understood, wondering whether we are improving that situation or creating another blackbox around FST defnition and wrapper code.

Here we are not abstracting the att file format knowledge, its json/js representation and interpretation. These are complex and can we expect the awesomeness of CSCott in future maintainers?

Good thoughts Santhosh. That is precisely the feedback I was looking for. Scott and I were discussing some of these bits (the custom att -> js converter + custom fst runner) earlier.

As you noticed, there are two pieces to this puzzle here. (a) .foma based files for lang conversion (b) custom pieces to process them. vs using foma provided runners.

If we went with just (a), I think it improves the lang. variant situation significantly since it would be far more maintainable - you still need to learn the fst format, but that is just a bunch of regexp and additional features on top of that and is not complex.

With (b), yes, it does impose an additional maintenance burden and bus factor issues. Whether we want to use this custom path is still something to evaluate.

Scott is exploring those custom bits for 2 reasons (1) whether in node.js or php, to eliminate the dependency on the C binary - the .foma / .att / .json files will be portable across language boundaries. (2) compressed format and potentially performance (? I don't remember our conversation at this point).

Just in case it wasn't clear, I think this (use of FSTs) is a very promising approach and something that makes sense adopting - the only unresolved thing for me is the path to integrating it into the codebase and build/deploy processes, i.e. whether we write custom tools, or rely on foma's tools. I can go either way at this time. We (as in Parsing team, as well as the WMF) maintain a lot of packages and tools internally. So, by itself, custom tools are not a problem. But, the con is that we are adding to the list of things we will be on the hook for. One option would be to get these tools and runners into the upstream foma repo. @cscott is on vacation and might be able to weigh in better when he is back.

One option would be to get these tools and runners into the upstream foma repo

@ssastry, The att->json conversion and processing in js might not be a general solution for js binding/integration because in practice, the morphology analysers(foma's main usecase) are very big. The compiled trasducers in binary formats grows to many magabytes. For example, http://wiki.apertium.org/wiki/Foma lists some real world transducers and their compilation time, RAM requirements and final sizes while explaining the usage. Some of them easily goes beyond 50MB. The reason for large sized transducers are the inclusion of lexicon subsets. Something to consider when we think aout upstreaming or making it a general solution.

I'm glad to see that future use in a PHP parser producing Parsoid-style output is being considered. IMO it would be best if such a parser could be written entirely in PHP, as @ssastry mentioned "all that needs to be "ported" is the runner and the classes that hook up the FSTs into the parser". Second-best would be if we can distribute standalone "runner" C binaries for common platforms, much as Scribunto does with standalone Lua interpreters. If we get to the point where end users of MediaWiki commonly would be required to download or compile something to make language conversion work, IMO we'd have to make language conversion be an optional feature that is off by default. To be clear, it's fine with me if developers have to install Foma in order to edit the definitions and compile them into the att and json files, just that using stock MediaWiki to run a wiki shouldn't have such requirements.

Another potential concern is compatibility with edge cases and features of the existing LanguageConverter code in MediaWiki. Or am I wrong in thinking the goal is to handle everything LanguageConverter does?