Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

TheLocustNMI asks: "I'm currently employed as a systems analyst for a mid-sized consulting firm, and we have been charged with the task of finding a multilingual solution for an entire enterprise system. After much study, the question remains: Is there an effective multi-lingual machine translation system out there? Could one be built? Would it be a massive distributed knowledge system, akin to Everything2? Could it be free to the net public? Ideas?"

An even worse problem is ambiguity due to cultural issues. The phrase "cold slice" in the US almost always means leftover pizza. How to translate something like that into another language for a culture that may not eat pizza, may not eat convenience foods, may not eat leftovers, etc.

I think a major problem with the undertaking is that (last time I checked) the fundamentals of how human brains deal with semantic/pragmantic issues (i.e. how we process concepts) was far from nailed down. Half the linguistics community thinks Jerry Fodor [rutgers.edu] is out there, and half of them think he has a point, when he says all concepts are innate.

Now whether or not you believe what he says, the fact that there is so little concensus about something as fundamental as what a concept is and how we process them is a bit worrisome. This is going with the assumption that whatever we build on the software side will mimic *us* on the software side (yet another good question).

You may wonder what all this has to do with machine translation. Well, one of the difficulties mentioned before is known as the "scope problem" -- i.e. if the computer is to use any semantic knowledge (of the world - i.e. concepts) to sort through ambiguity, where does it begin? How do you create an understanding of the world so that there is an understanding of definitions?

I once wrote a PROLOG based Natural Language Parser in college, and thought it was pretty cool until I realized that it was the biggest can of worms I ever opened...

the *same* word can have different meanings in different countries that share the same language... For example, in Mexico and most of Latin America, "pollo" means chicken; in Spain, it means "cock," in the less traditional sense. Moreover, how could you translate the word "boy" into Spanish when in Mexico they may call him "chamaco," in El Salvador "cipote," etc.?

Thats why I mentioned the Everything2-sorta thing. When it first starts out, the translator will only be as good as current engines are, but if it is developed to allow "learning", and by learning i mean the application of grammatical rules etc from users, it could become rather extensive and deep. It would be quite the undertaking, but if enough analysis and design was put behind it, it could be done. Of course, that's like saying we could develop a matter transporter in two years if we had all the scientists and engineers in the world working on it, too... hehe.

The problem with translation systems is grammatical ambiguity. Computers lack the massive "database" of context which we humans have for resolving ambiguities.

How do you translate "plane"? It could be "plane" as in mathematics, or as in flying...

That's just one simple type of ambiguity. When you have a fluid grammar like English -- for which a BNF grammar cannot be made, regardless of how many tokens of lookahead you have -- the lack of context makes translation all but futile.

Assembling such a database is not a theoretical impossibility. It is however a *practical* impossibility for the present and foreseeable future.

Using user submitted data is a double-edged sword. On the one hand it gives you a more realistic way to gather the neccesary information but then you have quality control issues. Bad data will necesarily lead to bad translations and just because you speak a language fluently or natively does not mean that you can accurately relate the rules of the language or describe contextual/circumstantial disambiguating information.

Such an engine would *not* be as good as current systems at the start because current systems are designed with their own weaknesses in mind. To get translation comparable to what current systems can do you'd either have to *use* a current system until enough data was provided, or you would have to start with a VERY large dataset.

But then there is an insurmountable theoretical problem: The smaller the corpus being translated, the less accuracy you will be able to get no matter how large your database. A single ambiguous sentence by itself can often not be disambiguated by even a human. If it is in the context of a paragraph or a page then more context is provided.

But the extraction and interpretation of contextual information is a task which by itself remains almost wholly unaddressed.

But let us for a moment assume that you have a system that using a large enough database (I'd guess we're talking TBs here, but that's pure speculation...) and was able to gather contextual information and disambiguate sentences. So now, you have a data structure that represents all the ideas you wish to convey in your text in a language-neutral fashion.

Then you have the problem of going the other direction. Now, you need a similar database for your target language, and your databases both must contain information about language-specific idioms and customs.

As my Japanese professor said "you do not *translate* English to Japanese (or vice versa), you *restate* your idea."

Even if your program could understand cultural context and understand the weight certain grammatical constructs/words/phrases are given in different contexts, how does the software "restate" your idea in a culturally meaningful manner?

Consider the Japanese phrase: "Nodo ga kawakimashita." Accurately translated it is "My throat has become dry." But a translation is clumsy at best here. A better way of stating it is "I am thirsty."

So you would need for each language, a HUGE, extremeley detailed database of grammatical, lexical, contextual, idiomatic, and cultural information for a given language that is interrelated at a higher level than just words or sentences. It would have to have culturally-significant weighting of concepts, ideas, grammatical structures, and words. On top of that you would need a very complex, detailed mapping between the database of each language.

Just as it is theoretically possible to move three tons of sand from New York to California using only a unicycle and a pair of tweezers, such a project is possible. However, the complexity and resource constraints make it all but impossible even for thousands go-get-em open source coding wizards and hundreds of thousands of community members.