This month the world’s linguistic diversity took another high-profile hit with the loss of the dialect Cromarty. Bobby Hogg, the last native speaker, passed away in early October and was mourned internationally, along with his unique linguistic knowledge. Starting in the 01950s, the traditional fishing methods of the remote village of Cromarty began to be replaced and industrialized. In turn, the formerly strong bond between the village’s cultural, economic and linguistic identities started to erode. By 02011 Bobby was the only remaining speaker of the dialect. He and his brother can be heard speaking Cromarty in these recordings hosted by the Highlands cultural archive Am Baile.

Cromarty was a dialect of Modern Scots; a language with numerous spoken varieties, mostly found in the Scottish lowlands, with a total of around 200,000 speakers. Despite being part of the same linguistic family, the dialects of Modern Scots differ from each other so greatly that they often lack mutual intelligibility. For instance, a speaker of Northern Scots, although only separated by a few hundred miles, may not understand or be understood by a speaker of Southern Scots. The highly distinctive nature of Scots dialects means that each variety, such as Cromarty, is a crucial element in the unique linguistic landscape of Scotland.

The loss of Cromarty is undoubtedly both significant and tragic, but it is essential to recognize that even in death, languages and dialects leave us with more than cause for lamentation. Perhaps most notably, the death of Cromarty tells us a great deal about the role of culture and community in the complex phenomenon of language loss. With its origins in a traditional fishing village, the dialect was highly specialized for talking about particular nautical techniques and equipment but simultaneously vulnerable to obsolescence as a result of technical and practical developments in this field. The recorded vocabulary of the Hogg brothers is peppered with these highly culturally specific terms such as “beetyach: a small knife for beeting (mending) nets” and “aave: the boy who acts as scummer (bailer) as well as the instrument he uses”. Understandably, as these tools, roles and techniques began to disappear, the words used to refer to them followed suit.

Additionally, for communities in similarly precarious cultural and linguistic situations, the disappearance of Cromarty may well reveal the truly precious nature of regional languages and dialects. The island of Tangier, set in the Chesapeake Bay, Virginia, plays host to an equally intriguing dialect. Like Cromarty, the Tangier Tidewater dialect is similarly tied to the local fishing methods and, as a result of its dramatically remote location, is hypothesized to sound similar to the English spoken by original American colonists. As such, in addition to encoding unique cultural information about the Tangier community, this dialect potentially reveals tantalizing details about the history and evolution of the English language. Understanding the sudden vulnerability and subsequent loss of Cromarty may be key in equipping communities such as Tangier with the tools and motivation to record, document or even galvanize the use of their own tongue.

Finally, the lesson of Cromarty reveals the importance of thorough documentation and archiving when dealing with highly endangered languages. In the final years of Cromarty’s use, a number of researchers (primarily Janine Donald) set about recording the dialect, as spoken by the Hogg brothers, with the hope of preserving and publicizing their unique tongue. Research findings, along with numerous original audio clips were published in an easily accessible online cultural archive named Am Baile as well as being printed in this document for local, public distribution. Since the death of Cromarty, these materials have delivered key information about the dialect’s words, phrases and sounds as well as providing fascinating cultural insight into the lives of Cromarty residents, to a truly international audience. Although these documents are by no means exhaustive, they set a great example of responsible and culturally sensitive ways to record, document and potentially preserve a language.

The Transcribe Bushman project was recently developed by Ngoni Munyaradzi, a master's student at the University of Cape Town's Computer Science Department. The goal is to transcribe the large number of manuscripts of |xam and !kun language documentation in the Bleek and Lloyd Collection.

To get through the laborious transcription process sooner rather than later, Ngoni Munyaradzi has developed an elegant interface where anyone can log in and help. One of the most impressive aspects of the interface is his elegant system for entering complex characters needed to capture some of the most complex phonetics found anywhere in human language:

As Khoisan languages, |xam and !kun are famous for having clicks as consonant sounds, several of which are written with unique symbols not used in other writing systems. They also have several tones, as well as many secondary articulations for both the manner of articulating vowels as well as their voice quality - collectively these are represented by multiple diacritics both above and below vowels.

If you'd like to give it a try, the project could use your help. There is a tutorial video to watch first, and then you can jump right into transcribing. You can do just a single page, or as many as you wish.

Reported in Science today, scientists George Church, Yuan Gao and Sriram Kosuri report that they have written a 5.27-megabit "book" in DNA - encoding far more digital data in DNA than has ever been achieved.

Writing messages in DNA was first demonstrated in 1988, and the largest amount of data written in DNA previously was 7,920 bits. The challenge in writing more information than this has been creating long perfect sequences. The current project uses shorter sequences, each encoding 96-bit data block, along with a 19-bit address that specifies the location of the data block within the larger data set. Then redundancy reduces errors: each base only encodes a single bit (A and C are both "0", G and T are both "one"), and each data block has several molecular copies.

DNA has several advantages for archival data storage - information density, energy efficiency, and stability. With regard to stability DNA offers readability "despite degradation in non-ideal conditions over millennia" - by which they mean 400,000 years! (See Church and Regis, in their forthcoming book on the subject.)

If we wish to intentionally use this technology for active long-term information storage (imagine some crucial message we need to convey to the future), we should probably anticipate the possibility of a discontinuity in technological knowledge and access to tools that could read the information. This raises questions of discoverability, decodability, and readability.

Ubiquity aids discoverability - if the information is everywhere it is easier to find, even stumble upon, by accident. Still, clear signals / signposts could aid discovery (neon green cockroaches anyone?). With regard to decodability, I'll simply mention there several layers of encoding to be unraveled here: spoken human language > written language in text form > digital / binary > DNA. And presumably readability requires tools on the order of at least what we have available today, unless you can make the expression of the information obvious in some biological way.

Wonderfully exciting new stuff to conjure with from the perspective of technologies for the Long Now Library. We are also delighted to be working with Dr. George Church to provide Rosetta / PanLex data that may be written in a new "edition" of the DNA book, so check back for updates!

This weekend, the New York Times published an article about the extremely endangered Silitz Dee-ni language - an Athabaskan language spoken in the coastal Northwestern United States. It is striking that this is not a story of last-speaker language death - such stories are of course highly newsworthy, but also quite depressing from the vantage point of those working to preserve global linguistic diversity. Instead here is a story of linguistic and cultural restoration and revival, and the incredible efforts of a few people that are bringing it about. We are increasingly seeing such stories in mainstream media, and it is encouraging.

At the core of the Silitz Dee-ni language revitalization project is the creation of a now 10,000+ word dictionary, assembled over the course of many years, from materials and recordings created by tribal members as well as those compiled by linguists over the past century, now housed in many different language archives and university library special collections. Bud Lane, one of the main dictionary developers has recorded most of the 10,000 entries himself.

For several years, the dictionary database was maintained off-line and password protected so that only tribal members could access it. Recently, however, the project team decided to create an open online version - a "talking dictionary" - that has significantly raised the profile of the dictionary team's efforts, the language itself on a global stage, as well as highlighting the value the language has to the Silitz people.

You can explore the talking dictionary through its Silitz Dee-ni / English search interface. Some default searches produce some extensive results, some with pictures as well as sound - for example "basket" or "salmon" or a basic verb like "put" that illustrates the internal complexity of words that Athabaskan languages are famous for.

It is often said by lexicographers that a dictionary is never finished - this is partly because the task of compiling them is gargantuan, but also because a healthy language is always changing. Some words become obsolete, falling into disuse, while other novel words emerge as speakers name and talk about new entities in their world. For a language that has gone through a period of obsolescence, many new words need to be created to name and talk about the modern world. At this point, the Silitz appear to be focusing on compiling all of the vocabularies available to them, but with language expansion being a primary goal, one can imagine a future effort devoted to vocabulary creation.

[N.B. the interface and database designer for the Silitz Talking Dictionary project is former Rosetta Project intern Jeremy Fahringer, now at Swarthmore College ITS. Well done, Jeremy!]

The Rosetta Project was created to begin the work of filling Long Now’s 10,000 Year Library and in 02011 student filmmaker Scott Oller offered to help tell the story of the project’s aspirations and achievements. This short documentary, Oller's senior thesis, was shot over the course of several weeks in the Spring of 02012 and explores the contents of the Rosetta Project’s collection of linguistic data, the Internet Archive’s role in hosting and making accessible that data, and the aesthetics and functionality of the Rosetta Disk itself.

The Rosetta Project and PanLex Project at The Long Now Foundation are excited to announce that we are participating in a new initiative called the Endangered Languages Project, which is backed by the Alliance for Linguistic Diversity.

As member organization of the Alliance, we will be providing support for the Project, which aims to:

support communities engaged in protecting or revitalizing their languages, and

raise awareness about ways to address threats to endangered languages.

Through the Endangered Languages Project, endangered language communities and scholars are able to contribute their own materials by uploading language documentation via Google tools such as Google Docs and YouTube. Alliance members will help maintain the project as an open space so that any user can find, share, and discuss the most comprehensive and up to date information and primary data on endangered languages.

As part of our contribution, the PanLex project has offered to make accessible its compilation of a half-billion pairwise translations among 17 million lexemes in 6,000 languages. Our hope is that this data can be made available through the Endangered Languages Project to promote collaboration with researchers and enable more than a trillion additional inferred lexical translations.

For those in the San Francisco area looking for a great Friday night out, the San Francisco Center for the Book is opening a new exhibit tonight, "Exploding the Codex, Theater of the Book" which includes a Rosetta Disk. The event runs from 6:00 to 8:00 pm at the San Francisco Center for the Book, 300 De Haro Street and is free and open to the public. The Exhibit runs through August 31 in the Austin Burch Gallery.

On July 9, Rosetta Project director Laura Welcher will be giving a talk in the Long Now museum on "Bringing the World's ~ 7,000 Languages Online." This talk is part of an ongoing series offered by SF Globalization, a San Francisco meetup group interested in software localization and internationalization.

"There are nearly 7,000 languages spoken in the world today, but the vast majority of them are contracting dramatically in use, rapidly approaching obsolescence and extinction. While computers, mobile devices and the Internet could offer an entirely new domain of language use – infusing these languages with modern vitality and vigor – there are few languages that can be used with ease in this domain today. In this talk, Dr. Laura Welcher will present the work of The Rosetta Project that she directs at The Long Now Foundation, their efforts to build resources and capacity for all human languages, and what it takes to bring these languages online."

This summer, the Rosetta Project is working on a series of Record-a-thon events. While previous Record-a-thons have collected recordings from many languages at once, the Record-a-thon events planned this summer are focused on specific languages. The aim is to capture, through video recording, spoken language samples from a number of speakers. We aim to record small groups of speakers for each language we record, with at least 2 hours of recording for each language.

The Record-a-thon events that will be taking place this summer will collect many hours of video and audio linguistic data. While this is certainly valuable information, it would be much more useful for future research if it were in an easily searchable format. Generally, as the amount of information becomes greater, it becomes less possible for a human to get anything out of that information. This is, of course, the problem of the information age.

With the advent of modern recording techniques, this is also a large problem for speech scientists and speech technologists. A common way of formatting audio data is to segment the speech, which is continuous (no pauses between sounds or words), into speech sounds (phonemes) and to label them (eg, the second sound in the word Rosetta would be labeled 'o', and would be marked as occurring from the end of the 'r' sound to the beginning of the 'z' sound). Having myself segmented and labeled speech sounds, I can attest that is an extremely laborious process. To have data that is already formatted, processed, and ready for analysis is an enormous boon.

Recordings of spoken language have tremendous value, but are not always immediately useful for those who would have the greatest interest in the data. It is my hope to make the data collected this summer immediately useful to researchers, both in the academic world and in industry. In particular, I am interested in creating an audio corpus (body of data) for each language recorded, ideally which will be divided into individual speech sounds (segmented), with each sound labeled. This type of corpus is called an aligned speech corpus. Recently, there have been attempts to automate this process, letting a computer segment the speech sounds. This greatly reduces the amount of time needed to turn raw speech data into formatted, more immediately useful data.

Who exactly is interested in spoken language data? I previously mentioned phoneticians. Unlike Professor Henry Higgins in the movie My Fair Lady, today's phoneticians are interested in the question of how language is actually spoken rather than how it is supposed to be spoken. One question a phonetician might be interested in, is whether vowels are longer at the end of a word than at the beginning. Generally, English-speakers don't think of speech sounds as being longer or shorter, but in fact different sounds differ from one another in length, and even the same speech sound may vary depending on its location in a word. This kind of question is answerable by looking at spoken data that has the beginnings and ends of sounds marked (segmented), and the sounds themselves labeled.

Speech and hearing scientists could also find value in speech data that has been segmented and labeled. Speech and hearing science has the goal of identifying and treating language disorders. But in order to do this, one must have examples of "normal" language. How is language normally spoken? What is a sign of a disorder, vs normal variation within the language? Having many examples of a language, from different speakers, would be helpful for answering this question.

Finally, the language technology community could also find labeled and segmented speech data extremely useful. Babies are not born knowing a particular language, but require continuous exposure to a particular language in order to learn it. An automatic speech recognition system is the same; it needs to be trained in a particular language in order for it to recognize words, phrases, and sentences in that language. Many speech recognition systems are trained on aligned speech corpora.

Several aligned speech corpora exist for English. What would be especially valuable about creating aligned corpora from data collected this summer is that the languages we anticipate collecting are not the extremely well-studied, more common languages. Having corpora for these languages provides wider access to them. A researcher in Tucson (where I live) might struggle to find 10 speakers of Latvian, but could have access to a substantial spoken Latvian corpus with access to the internet. Further, having a corpus that is already labeled would allow the researcher to have a much larger quantity of usable data than is collectible by a single person, at least within a reasonable time-frame. Having data for less common languages allows for a better understanding of language in general, better understanding and diagnosis of language disorders, and the expansion of speech technologies to new populations.

The Rosetta Project was just featured in the radio show "Lingua Franca," presented by Maria Zijlstra and broadcast on ABC Radio National Australia. The full program is available here as a podcast on the Lingua Franca website.