The Adobe Moses Corpus Tool – And Crossing That Bridge When You Come To It.

This article was originally written in English. Text in other languages was provided by machine translation.

Here is the scenario:

It’s the 1950’s. You are at the head of an expedition in Nepal, and the brave leader of a dozen mountaineers plus a couple hundred porters all walking deep into the Himalayas in search of an unclimbed summit. The risks of the journey are high but you will be showered in glory by your nation, ticker tape parade and everything, when you return home successful. Entering a deep valley you come upon a long and narrow rope bridge which the whole expedition will have to cross. The bridge is too weak to hold more then one person at a time and it takes 5 minutes for each person to cross.

You can get the the first 12 climbers across in an hour.

(12 Climbers x 5 minutes each = 60 minutes) so 1 hour to cross.

But the very last porter won’t make it across until almost 2 days after the first climber starts out.

The success of the entire expedition is a stake. Valuable resources, food, tents, climbing gear, etc. are going to end up spread all up and down the trail with their respective porters. This means they won’t be arriving at base camp when and where you need them. This is not a good way to get started.

The bridge crossing metaphor used here is a textbook example of encountering the limiting factor in your process chain. No matter how many resources you can bring to bare on the project there is a choke point. It can take many forms but identifying and solving this problem will be critical to reaching your goals. It doesn’t matter how fast you proceed through all the other steps of your plan, you are going to lose those 2 days here unless something changes.

Does the narrow rope bridge which will only let one person across at a time sound like an unlikely obstacle to face in your machine translation project? It’s not. When we launched the Adobe Moses MT project last spring getting across this bridge was the first problem was faced. Why? Quite simply we had years of translation memory stored up from Adobe localization projects. All those years of TM were the raw materials to be used in building Adobe specific engines. We knew with them that we could build better engines for translating Adobe products then we would ever find on the open market. However, the sheer volume of TM that needed to be processed into a Moses ready corpus represented a blockage of serious proportions.

A quick back of the napkin metric to put this inperspective:

We found, given the existing tooling for corpus work, that it required 1-2 weeks of an engineer’s time to process 5-10 million words of translation from .tmx format into a pair of aligned flat corpus files. (i.e. Moses ready)

Moses does come with a set of support scripts for working these problems. (tokenizer.pl, clean-corpus-n.perl, etc.) and they are functional. That said, the effort is time consuming. The scripts are all run from the command line. A great deal of organization and discipline is required of the user or all the required steps can quickly get confusing.

If you have millions of words across multiple languages, as Adobe did, you can see it’s going to take a long time for that one engineer to process those .tmx files. If you add a couple more engineers then you can speed up the process but the overall time required per unit of .tmx cleaned hasn’t gone down. This would be the equivalent of building a couple of more bridges across that chasm in the Himalayas. It speeds things up but it’s expensive now and doesn’t lower costs in the future.

So if we’ve only got one bridge to cross then the solution is to reduce the time it takes us to cross that bridge.

The Adobe Moses Corpus tool was our solution to this problem. While none of the individual steps in taking a .tmx file to a Moses ready state are too time consuming, those small steps all add up. We decided to solve the problem once and for all and to develop a light weight, modular, GUI based, AIR app which any user could install and use to process TM files for Moses. What does it do? Quite simply it lets you automate your corpus cleaning to improve efficiency. It takes the multiple command line options available and allows the user to orchestrate using them on any .tmx without the worry of calling scripts and passing parameters. How much does it help? While these numbers are loose, we’ve been able to increase the productivity of a single engineer working on corpus cleaning by up to 10x.

We can now do it in 2 days what used to take 2 weeks.

When you have millions of words of translation memory this is a big deal. If you want to do MT for yourself you will need to solve this problem. For us, the Adobe Moses Corpus tool continues to evolve as we learn more about the cleaning steps we want access to and how to order these steps. It is our vision that it will fit into a greater more comprehensive package of MT related tools which may include the automatic testing and tuning of engines. We continue to consider all the possibilities this tool would open up for the greater MT interested public and are open to ideas and collaborations with others around it’s improvement and extension.

There are plenty of bridges to cross on the way to building MT systems. Corpus handling is just one of them. Hopefully this knowledge makes your journey a bit more clear. Now get out there and build an engine!

A quick (but by no means complete) list of things of things that could be done to improve MT engine quality:

This is a short list of the steps the Adobe Moses Corpus tool can currently perform. We are open to suggestions about adding other steps or refining the nature of these steps.

Clean Placeholder Tags

Clean URLS

Tokenize

Lowercase

Clean Numbers

Clean Duplicate Lines

Clean Long Segments

Clean Misaligned Pairs

The efficacy of each of these steps could be debated around the MT round table but in general most people will need to process their TM files through these steps before the can be used with Moses for engine building as well as to improve quality.

Testing the tool – We are considering options for making it available for other’s to test in the near future. When we find the right solution it will definitely be announced here.

To clean placeholders tags TM should be parsed twice. – Can you clarify what you mean here? Are you suggesting to run the placeholder removing regex against the same document twice? Or against both language pairs?

To clean duplicate lines – These strings are definitely duplicates after the white space delimiters are removed but trying to understand why you would detokenize the strings here. Are you looking to prepare the Corpus for a particular MT engine style?

@Jeff – this post is brilliant, thanks for sharing and also for taking the time to use figurative examples to develop a clear idea about the problem in the reader’s minds.
What happened to the tool at the end? Is it maybe integrated in m4loc?