Languages are like species. They evolve in mostly predictable ways, splitting into new species or dying out over time. Now, a group of linguists and computer scientists in the US and Canada have created a piece of software that can analyze enormous groups of languages to reconstruct what the earliest human languages might have sounded like. It sounds like a subplot from Neal Stephenson's novel Snow Crash, but it's quite real. By using this program and others like it, linguists may one day know how people sounded when they talked 20,000 years ago, long before there was writing.

University of British Columbia statistician Alexandre Bouchard-Côté began working on the program when he was a graduate student at UC Berkeley. He used common algorithms to compare sounds and cognates — words that are the same in multiple languages — across hundreds of different modern languages. By doing this, he could predict which language groups were most related to each other, and which kinds of sounds would preserved most often. A sound that remained the same across distantly-related languages was probably a sound that existed early in our linguistic evolutionary tree.

By putting these sounds together, Bouchard-Côté's program was able to reconstruct the sounds and words were most likely to have been used in languages from pre-history. Linguists speculate that the languages that led to today's modern ones include Proto-Indo-European, Proto-Afroasiatic and Proto-Austronesian. Bouchard-Côté and his colleagues focused on Proto-Austronesia, which led to today's Polynesian languages, as well as languages in Southeast Asia and parts of continental Asia. They were able to reconstruct over 600 ancient Proto-Austronesian languages.

In their paper, published this week in Proceedings of the National Academy of Sciences, the researchers write:

We have developed an automated system capable of large-scale reconstruction of protolanguage word forms, cognate sets, and sound change histories. The analysis of the properties of hundreds of ancient languages performed by this system goes far beyond the capabilities of any previous automated system and would require signiﬁcant amounts of manual effort by linguists. Furthermore, the system is in no way restricted to applications like assessing the effects of functional load: It can be used as a tool to investigate a wide range of questions about the structure and dynamics of languages.

"Functional load" is a mid-twentieth century theory that suggests some sounds are more important than others in a language because they're used to distinguish between words that sound the same. For example, in the words "dog" and "tog," there's one important sound used to distinguish between them — it's the voicing of the "d". Your tongue is in the same place to make both letters. The only difference is that "d" requires you to use your voice, and "t" is just expelling air. An important sound like that voicing is probably going to be preserved over time, because it's used in a lot places to distinguish between words.

Ultimately, this program could allow linguists to hear languages that haven't been spoken in millennia, reconstructing a lost world where those languages spread across the world, evolving as they went.

Over time, this program could be used for linguistic futurism, too. In a release, UC Berkeley cognitive scientist Tom Griffiths said:

Our statistical model can be used to answer scientific questions about languages over time, not only to make inferences about the past, but also to extrapolate how language might change in the future.