Share

Comment

In his 2002 book Lost Languages, Andrew Robinson, then the literary editor of the London Times’ higher-education supplement, declared that “successful archaeological decipherment has turned out to require a synthesis of logic and intuition … that computers do not (and presumably cannot) possess.”

Regina Barzilay, an associate professor in MIT’s Computer Science and Artificial Intelligence Lab, Ben Snyder, a grad student in her lab, and the University of Southern California’s Kevin Knight took that claim personally. At the Annual Meeting of the Association for Computational Linguistics in Sweden next month, they will present a paper on a new computer system that, in a matter of hours, deciphered much of the ancient Semitic language Ugaritic. In addition to helping archeologists decipher the eight or so ancient languages that have so far resisted their efforts, the work could also help expand the number of languages that automated translation systems like Google Translate can handle.

To duplicate the “intuition” that Robinson believed would elude computers, the researchers’ software makes several assumptions. The first is that the language being deciphered is closely related to some other language: In the case of Ugaritic, the researchers chose Hebrew. The next is that there’s a systematic way to map the alphabet of one language on to the alphabet of the other, and that correlated symbols will occur with similar frequencies in the two languages.

The system makes a similar assumption at the level of the word: The languages should have at least some cognates, or words with shared roots, like main and mano in French and Spanish, or homme and hombre. And finally, the system assumes a similar mapping for parts of words. A word like “overloading,” for instance, has both a prefix — “over” — and a suffix — “ing.” The system would anticipate that other words in the language will feature the prefix “over” or the suffix “ing” or both, and that a cognate of “overloading” in another language — say, “surchargeant” in French — would have a similar three-part structure.

Crosstalk

The system plays these different levels of correspondence off of each other. It might begin, for instance, with a few competing hypotheses for alphabetical mappings, based entirely on symbol frequency — mapping symbols that occur frequently in one language onto those that occur frequently in the other. Using a type of probabilistic modeling common in artificial-intelligence research, it would then determine which of those mappings seems to have identified a set of consistent suffixes and prefixes. On that basis, it could look for correspondences at the level of the word, and those, in turn, could help it refine its alphabetical mapping. “We iterate through the data hundreds of times, thousands of times,” says Snyder, “and each time, our guesses have higher probability, because we’re actually coming closer to a solution where we get more consistency.” Finally, the system arrives at a point where altering its mappings no longer improves consistency.

Ugaritic has already been deciphered: Otherwise, the researchers would have had no way to gauge their system’s performance. The Ugaritic alphabet has 30 letters, and the system correctly mapped 29 of them to their Hebrew counterparts. Roughly one-third of the words in Ugaritic have Hebrew cognates, and of those, the system correctly identified 60 percent. “Of those that are incorrect, often they’re incorrect only by a single letter, so they’re often very good guesses,” Snyder says.

Furthermore, he points out, the system doesn’t currently use any contextual information to resolve ambiguities. For instance, the Ugaritic words for “house” and “daughter” are spelled the same way, but their Hebrew counterparts are not. While the system might occasionally get them mixed up, a human decipherer could easily tell from context which was intended.

Babel

Nonetheless, Andrew Robinson remains skeptical. “If the authors believe that their approach will eventually lead to the computerised ‘automatic’ decipherment of currently undeciphered scripts,” he writes in an e-mail, “then I am afraid I am not at all persuaded by their paper.” The researchers’ approach, he says, presupposes that the language to be deciphered has an alphabet that can be mapped onto the alphabet of a known language — “which is almost certainly not the case with any of the important remaining undeciphered scripts,” Robinson writes. It also assumes, he argues, that it’s clear where one character or word ends and another begins, which is not true of many deciphered and undeciphered scripts.

“Each language has its own challenges,” Barzilay agrees. “Most likely, a successful decipherment would require one to adjust the method for the peculiarities of a language.” But, she points out, the decipherment of Ugaritic took years and relied on some happy coincidences — such as the discovery of an axe that had the word “axe” written on it in Ugaritic. “The output of our system would have made the process orders of magnitude shorter,” she says.

Indeed, Snyder and Barzilay don’t suppose that a system like the one they designed with Knight would ever replace human decipherers. “But it is a powerful tool that can aid the human decipherment process,” Barzilay says. Moreover, a variation of it could also help expand the versatility of translation software. Many online translators rely on the analysis of parallel texts to determine word correspondences: They might, for instance, go through the collected works of Voltaire, Balzac, Proust and a host of other writers, in both English and French, looking for consistent mappings between words. “That’s the way statistical translation systems have worked for the last 25 years,” Knight says.

But not all languages have such exhaustively translated literatures: At present, Snyder points out, Google Translate works for only 57 languages. The techniques used in the decipherment system could be adapted to help build lexicons for thousands of other languages. “The technology is very similar,” says Knight, who works on machine translation. “They feed off each other.”

Comments

allen mendelson

July 1, 2010

The universe is a machine. And everything in it is the "submachine" of it, we human being included. Some machines only have hardware, and others have hardware and software at the same time. For our human being, the software saved in our brain and DNA and controls the movement of our hardware. We can make a greater machine than us. So artificial-intelligence can do everything.

John

July 7, 2010

Theoretically your are right. But your argument would be meaningless if making a machine greater than human being themselves will take you an amount of time that makes the project unrealistic.

hasibuddin

July 7, 2010

its great that computers can automatically decipher ancient languages. Ultimately its humain brain who has made computers.

bert

July 7, 2010

Fascinating new technology!

But surely, to be fair to Andrew Ronbinson, telling the program that the language is similar to Hebrew comes under the "logic and intuition" part?

Dave

July 7, 2010

As a Christion I would question the assumption "the universe is a machine". That subset of it which is susceptible to rational investigation, maybe. I don't believe we humans are just "hardware and software".

An un-deciphered text presumably had rational meaning to both its author and its intended readers (assuming both were sane), but their understanding might depend crucially on contextual information no longer available to us.

I look forward to better and better machine translators; but as to "AI can do everything", the jury must remain out until AI has actually done everything. I'm not holding my breath...

Ravi

July 8, 2010

Cheers to Regina and her team!!

After 500 years of science, wonder why homocentricism still persists.

william reed

July 8, 2010

There are many instances where the context provides a better explanation of a phenomenon that decomposing the phenomenon into component parts. So, viewing the "machine" as a component of a larger "machine" often provides more insight than analysing the "submachines." In fact, isn't this what this software system does? So in fact, one step is to select the correct embedding system with which to seek relationships to illuminate the phenomenon under study - here selecting the analogue language. Another way to look at this technique: The woman had a heart attack. Analysis said she had a weak heart. The systems view said she was poor and had to wlak up stairs on a hot day carrying her groceries, causing her heart attack. What was the cause of the heart attack?

Carlos Andres

July 12, 2010

The languages are something too common and at the same time so necessary for us, is the base of our society and the truth is that we are so interested in other things in live that we let our language take its own path. Nevertheless now more than always is important for us to go directly to the base of communication itself in order to find ways to improve our language using all the knowledge of ourselves so in that way the process of the globalization together with the technology redirect the concept that we have about our way to communicate

Thomas Petzold

July 27, 2010

"Whilst languages may stop to exist due to its last speakers dying, the interaction between languages would never cease."

This impressive result shows the power of clever use of good statistical methods. Mapping an unknown script into a known script for a related language is a hard problem, and this is an important step forward on that problem.

The title both of the original article and the MIT News article, however, might be a bit misleading for those who don't read the content. It is not really *languages* that are being deciphered, but *scripts* for languages. And though the MIT News article says it could "help expand the number of languages that automated translation systems like Google Translate can handle", that is a different problem: *language translation*, not *script decipherment*. In fact, it is this work that borrows from techniques in language translation rather than vice versa.

I'll be very interested to learn more about the current technique, and to follow further work on it. How closely related do the languages and writing systems need to be for it to work? I would guess it would do nicely in mapping Russian in Cyrillic to Czech in Latin, -- both Slavic -- but might not do so well mapping Moldovan in Cyrillic to French in Latin -- both Romance --, because the languages are too different; or mapping Ancient Greek in Greek script to Linear B, because the scripts are too different (alphabetic vs. syllabic) and also perhaps because the corpora are too different). Can it improve on the currently accepted decipherments of various scripts?

So, lots more to be done here, but this is an interesting and important piece of work.