My original question in 2008 came from looking at trends in digitizing books – notably Google Books – and machine translation (MT). It elicited some interesting responses, including Kirtee Vashi’s mention of an Asia Online project planning to link these two trends.

So while the technologies for digitization and for MT – the two pieces in localizing libraries of information – are established and improving, each has encountered some combination of legal, political , or funding issues limiting their use individually for mass expansion of access to knowledge, as well as their potential use in tandem.

However, could the Norwegian program, announced in 2013, and the project it has with Nigeria, announced earlier this year, introduce a new dynamic, at least for mass digitization? Could and should large national libraries take the lead in this area?

The idea of digitizing libraries has generally been advocated in terms of access to knowledge, without particular reference to the languages in which publications are written. But languages are critical not only for access to knowledge, but also for facilitating scholarship and the interfacing of ways of knowing. Hence the need to associate mass digitization and MT.

There is at least one proposed project mentioning the potential for translation of books that have been digitized – Internet Archive’s initiative to digitize 4 million books (a semifinalist in the MacArthur Foundation’s 100 & Change grant competition).

Any such digital data produced by the Nasjonalbiblioteket, Google Books, Internet Archive, or any other organization could be translated with MT into other languages, with a few caveats (quality of optical character recognition [OCR]; how well resourced a particular language is; and of course the accuracy of the MT). This means that potentially any mass digitization could be mass translated into a large number of languages, given legal cover and sufficient funding.

What about the accuracy of MT, and how useful could mass MT of mass digitization be if there are inaccuracies? These are critical questions for any project to use MT to translate digitizations. Responses could reference, for instance, domain-specific MT, which is generally more accurate than general MT, provided of course that the material matches the domain used. Or perhaps some system for post-editing could be devised.

This is an exciting area that needs more attention and policy support. Books and other production in print can be digitized on a mass scale, making the knowledge in them more widely available. Digitized text can be machine translated into other languages, and the quality of that can be made high enough for use by speakers of the target languages. As much as the printing press revolutionized access to knowledge of that age, so too the potential to digitize and translate what is in print promises another revolution benefiting more people directly.

In the previous post, I looked at a possible ramification of “mass digitization” of text. But what about the spoken word? And more precisely verbal presentations, performance, and broadcast in languages often described as having “oral traditions” (and generally less material in writing)? Can we do something significant to capture significant amounts of speech in such languages in digital recordings?

There are some projects to digitize older recordings on tape, and certainly a need for more effort in this area, but what I am thinking of here is recording contemporary use of language that is normally ephemeral (gone once uttered), along with gaining access to recordings of spoken language that may not be publicly accessible. One place to start might be community radio stations in regions where less-resourced languages are spoken.

The object would be to build digital audio libraries for diverse languages that don’t have much if any publication in text. This could permit various kinds of work. In the case of endangered tongues, this kind of thing would fall under the rubric of language documentation (for preservation and perhaps revitalization), but what I am suggesting is a resource for language development for languages spoken by wider communities.

Digital audio is more than just a newer format for recording. As I understand it, digital storage of audio has some qualitative differences, notably the potential to search by sound (without the intermediary of writing) and eventually, one presumes, to be manipulated and transformed in various ways (including rendering in text). Such a resource could be of use in other ways, such as collecting information on things like emerging terminologies in popular use (a topic that has interested me since hearing how community radio stations in Senegal come up with ways to express various new concepts in local languages). Altogether, digital audio seems to have the potential to be used in more ways than we are used to thinking about in reference to sound recordings.

Put another way, recordings can be transcribed and serve as “audio corpora,” in a more established way. But what if one had large volumes of untranscribed digital recordings, and the potential to search the audio (without text) and later to convert it into text (accuracy in this area, which would not involve the normal training involved with use of current speech-to-text programs will be one of the challenges)?

Can digital technology do for audio content something analogous to what it can do for text? What sort of advantages might such an effort bring for education and development in communities which use less-resourced languages? Could it facilitate the emergence of “neo-oral” traditions that integrate somehow with developing literate traditions in the same languages?

The question is not as crazy as it might seem. Projects for “mass digitization of books” have been using technology like robots for some years already with the idea of literally digitizing all books and entire libraries. This goes way beyond the concept of e-books championed by Michael Hart and Project Gutenberg. Currently, Google Book Search and the Open Content Alliance (OCA) seem to be the main players among a varied lot of digital library projects. Despite the closing of Microsoft’s Live Search, it seems like projects digitizing older publications plus appropriate cycling of new publications (everything today is digital before it’s printed anyway) will continue to expand vastly what is available for digital libraries and book searches.

The fact of having so much in digital form could open other possibilities besides just searching and reading online.

Consider the field of localization, which is actually a diverse academic and professional language-related field covering translation, technology, and adaptation to specific markets. The localization industry is continually developing new capacities to render material from one language in another. Technically this involves computer assisted translation tools (basically translation memory and increasingly, machine translation [MT]) and methodologies for managing content. The aims heretofore have been pretty focused on particular needs of companies and organizations to reach linguistically diverse markets (localization is relatively minor still in international development, and where markets are not so lucrative).

I suspect however that the field of localization will not remain confined to any particular area. For one thing, as the technologies it is using advance, they will find diverse uses. In my previous posting on this blog, I mentioned Lou Cremers‘ assertion that improving MT will tend to lead to a larger amount of text being translated. His context was work within organizations, but why not beyond?

Keep in mind also that there are academic programs now in localization, notably the Localisation Research Centre at the University of Limerick (Ireland), which by their nature will also explore and expand the boundaries of their field.

At what point might one consider harnessing of the steadily improving technologies and methodologies for content localization to the potential inherent in vast and increasing quantities of digitized material?