One of my favorite things about watching old movies is hearing how people might have spoken in eras past -- the expressions they used, their old-school smack talk. But what did the languages from thousands of years back sound like? Hollywood, as far as I know, has yet to make a movie in which characters talk in authentic proto-Austronesian.

The Rosetta Stone in the British Museum. Inscriptions in hieroglyphics and Demotic and Greek yielded early first clues to the deciphering of Egyptian hieroglyphics.
The British Museum

The language nerd in me, was, therefore, excited to discover that scientists from UC Berkeley and the University of British Columbia have created a computer program to rapidly reconstruct vocabularies of ancient languages using only their modern language descendants.

The program, a linguistic time machine of sorts, can quickly crunch data on some of the earliest-known "proto-languages" that gave rise to modern languages such as Hawaiian, Javanese, Malay, Tagalog, and others spoken in Southeast Asia, parts of continental Asia, Australasia, and the Pacific.

The speed and scale of the work is key here, as proto-languages are typically reconstructed through a timely and painstaking manual process that involves comparing two or more languages that hail from a shared ancestor.

"What excites me about this system is that it takes so many of the great ideas that linguists have had about historical reconstruction, and it automates them at a new scale: more data, more words, more languages, but less time," says Dan Klein, an associate professor of computer science at UC Berkeley and co-author of a paper on the system published this week in the journal Proceedings of the National Academy of Sciences.

The program relies on an algorithm known as the Markov chain Monte Carlo sampler to examine hundreds of modern Austronesian languages for words of common ancestry from thousands of years back. The computational model is based on the established linguistic theory that words evolve as if along the branches of a genealogical tree.

Related stories

In their paper (if you want more specifics on how the researchers did things like associate a context vector to each phoneme, read this PDF), the team details the process of reconstructing more than 600 proto-Austronesian languages from an existing database of more than 140,000 words.

They say more than 85 percent of the reconstructions come within a character of those done manually by a linguist.

Linguists needn't worry about competition from the computer, however. "This really is not an attempt to put historical linguists out of a job," Klein tells CNET. "It's not like chess where we're seeing if we can one-up the humans.

"It would take a very long time for people to be able to cross-reference every language and every word that we feed into the program," Klein adds. "Of course in principle a human could do that, but in practice they just won't. Humans are not well-suited to crunching all of this data. Humans are well-suited to other things."

They can, for example, examine linguistic influences a computer can't yet grok -- the sociological and geographical factors that lead to morphological changes that turn, say, "cat" into "kitty cat" and how spelling errors reveal sound confusions for languages that are also written.

Implications for future-speak?
"It's the same way that astronomers haven't been replaced by computerized telescopes," Klein says. "We hope that linguists can use tools like ours to see further into the past -- and more of it, too."

Clues from the past, yes. But the work may also be able to extrapolate how language might change in the future, according to Tom Griffiths, associate professor of psychology, director of UC Berkeley's Computational Cognitive Science Lab, and another co-author of the paper.

Which of course begs the question: how will Twitter-speak look 6,000 years from now?

Ever wondered what a family tree for Austronesian languages looks like? You're welcome.
Proceedings of the National Academy of Sciences