Polyglot processing

22 August 2013

English, the internet, and the future of Southasian languages online.

flickr/ Alpha Tauri

From the 1990s to the early 2000s, the phrase ‘digital divide’ saw its popularity peak. Now that less is heard of it, has the divide been bridged forever? The issue of access seems not to be as big a problem now as it appeared back then. With smartphone technology, the availability of cheap consumer electronics, and improving wireless technology, getting more people online seems only like a matter of waiting. If the present trend continues, it won’t take very long before almost everybody on the planet becomes a part of this massive network. The question that’s now beginning to be asked is whether merely having access to the technology means having bridged the divide, or if we should be concerned about new technological divides based on differences in patterns of access, gains in productivity, improvements in the quality of life, and the way in which modern communication promotes desirable societal and political goals?

One of the world’s first fully functioning computers was developed in 1941 by a German inventor named Konrad Zuse, who also described Plankalkül, considered the forerunner of modern programming languages. His work went unnoticed for a long time in the non-German-speaking parts of the world. Had it been otherwise, the language of computing could possibly have been German rather than English. As it turned out, many of the more widely recognised and disseminated advances in computing came from the English-speaking world. English had other advantages too. At the dawn of the computer age, English had already been introduced to vast stretches of the world by the British Empire. It was then helped by the United States becoming the dominant global centre of economy and culture after the Second World War and the Cold War. The end of the Cold War opened a global market for the likes of MTV and CNN, and with them the English language. However, when it comes to the proliferation of English, nothing comes close to the immense spread, rate of adoption and effect of the internet.*

The internet as we know it today is largely an American phenomenon. Our daily online needs are served almost exclusively by US internet giants based in the Silicon Valley: Google, Facebook, Twitter, Youtube, Dropbox, Amazon, Ebay and more. As a result, the internet’s design and evolution has been shaped by Western democratic values. We’d likely not have the internet in its relatively unstructured and decentralised current form had it come out of Soviet Russia. But with those values also came the language – English. The American Standards Association’s original ASCII code, the dominant encoding scheme of the web until a few years ago, uses only 128 characters to represent all textual information necessary for a computer, to the exclusion of characters alien to English. A German equivalent, if Germans had got the lead, for instance, would certainly have accommodated accented characters. Still, regardless of which Western culture computing advances might have come from, for Southasia and other regions with non-Latin alphabets computing would still have had to be done in a foreign language and alphabet, or in unintuitive versions of their own languages.

The UTF-8 (more commonly known as Unicode), popularised in the last decade, has transcended the limitations of ASCII to represent thousands of characters with a single encoding scheme. This has made it possible to represent many different writing systems using one encoding scheme, instead of having to use separate ones for each. Today, this is the most popular standard of character encoding on the web.

Do you speak English?

Computer users from Southasia will remember the pains of typing and reading their native languages on computers until a few years ago. Today, the smoothness with which one can communicate in regional languages is remarkable, even though the transition to Unicode is not yet complete. In Nepal, for example, several government organisations and news outlets still use old standards, but online discourse that used to be dominated by Romanised transliterations has been replaced by streams of conversation in the original alphabet, accompanied by almost an abhorrence of Romanised variants.

The Unicode standard makes communication easier not only for the present, but also for the future and with the past, with the potential to accommodate many Southasian languages that do not yet have a digital presence. Languages that have become extinct can also benefit; with the development of proper Optical Character Recognition technologies, copies of ancient texts can be preserved in their original languages and forms.

But speakers of languages other than English are not the only ones struggling with the internet’s linguistic peculiarities. There have been worries about how the web has changed, or even ruined, standard English. Txt-speak, the abbreviated spellings and usages of a young generation enamoured of short mobile-phone messages and space-delimited tweets, has introduced new words into the language (hashtag, LOL), and also altered the way many words are spelled.

For many, the use of English online might initially have been just the result of a lack of other linguistic alternatives, but the old habit has persisted even with the more varied options available today. On the internet, native English speakers have now been outnumbered by those who use it as a second language. Consequently, English language on the internet exists in different, ‘unpure’ forms, mixed with Chinese, Indian, Spanish or Singaporean specificities. These non-native speakers usually lack idiomatic knowledge of the language, and, together with the txt-speakers, are generally uninterested in the correctness of their English since their only purpose is to be understood. They make easy scapegoats for the falling standards of English online, where polished and elegant English prose is allegedly becoming rarer. And the same trap might be awaiting other languages now finding their footing online.

But that view might be too simplistic. Some analysts of the internet are quick to point to the nature of the network, which seems to be governed by the power-law distribution, or what is popularly called the ‘long tail’. For many things on the internet (search engines, for instance), there is a ‘long tail’ accounting for thousands of similar things, and a ‘head’ accounting for a few items that are most sought after (this would be Google). Someone at the online retailer Amazon described the phenomenon like this: “We sold more books today that didn’t sell at all yesterday than we sold today of all the books that did sell yesterday.” Put another way, the internet allows Amazon to profit from selling small numbers of each of thousands of less popular books in addition to the blockbusters, where a traditional bookshop would only stock the more popular titles. From this perspective, it’s not that high-quality material (and language) is disappearing online, but simply that there’s a lot more diversity of quality on offer. Since almost everyone who can write now publishes on the web, one may be mistaking the increasingly long tail of the distribution for the complete picture. The greater variety of linguistic usage online will not necessarily erode the standards of language in general, just as the sheer number of search engines available will not, in and of itself, challenge Google’s supremacy.

The power-law can also be observed in the representation of different languages on the web. Until 2003, over 85 percent of all pages on the web were in English, forming the ‘head’ of the distribution. Slowly, the number of non-English webpages is increasing, with just about half of all pages now in English. Some years from now, Chinese users will surpass any other nationality on the internet. With the speed at which things change and are re-shaped by the internet, accurately predicting the impact all of this will have on language is a risky endeavour. What is certain is that the number of Chinese-language webpages will see a sharp rise, as will the number of pages in other languages.

Davids and Goliaths

The Chinese e-commerce website Alibaba is slated to become the world’s largest such venture in the next few years, of a size and nature unlike anything we have seen in the past. Chinese web services, like the search engine Baidu and the social network YY, will follow similar trajectories. Baidu already has a trans-Chinese presence, with regional offices in Singapore and designs to venture into the other languages of South-East Asia, of which Tamil is one. The search engine economy dictates that a company that already has more information attracts further information. Users will naturally be pulled towards services that provide better search results, and further improving searches will require knowing a large number of users’ preferences and search behaviour. Meanwhile, giants like Google and Amazon will keep attracting global talent, enabling them to improve and innovate faster. Internationally, other e-commerce websites and search engines serving local markets and languages persist – in France, Germany, Russia, Vietnam, Japan, Korea, and elsewhere – but, in the present environment, is there really a future for small, local-language online services, or are they simply waiting to be acquired by larger giants?

The scenario looks bleak for small local companies, but there is reason for some hope on the linguistic front. Online services are a market, after all, and can be understood better if we treat them as such and consider the incentives that service providers face. Internet companies that provide free services make their profits by selling advertising. They invest massively in developing models that can target ads more accurately to their users, which means showing each user those ads most likely to result in her clicking on them, preferably followed by a desirable action such as a purchase. In order to understand the user better, web service providers collect huge amounts of data from millions of users. But to understand what you’re looking for and discussing on the web, they need to understand your language, creating an obvious incentive to invest in computers that understand new languages, and hence make those languages usable online.

Given the world’s linguistic diversity, however, companies will, at least initially, prioritise linguistic groups deemed to have greater purchasing power. That explains why Google’s translation service is currently available for Latvian, which has about two million speakers, but not for Nepali, with about thirty million speakers. Other factors come into play too: a linguistic group’s members’ aptitude for computer technologies; a language’s affinity to that of other linguistic groups with high earning potential, such as several European languages.

On a pessimistic note, however, this does not bode well for the immediate future on the internet of the languages of Southasia’s more unstable and less economically prosperous people and areas. Nepal, with its more than 100 languages, is a good example. The country’s long-running political and economic stalemate has reduced the incentives for facilitating the use of its languages online, denying potential benefits to many minority groups and languages. The same holds true for other Southasian countries, where at least 500 different languages are spoken. Major languages like Hindi, Urdu and Bengali have a respectable presence in the global market and on the internet, although these are not yet comparable to European languages. But Sinhala, Assamese, Maithili and many others lag even further behind. Nor are market incentives alone sufficient to guarantee the online adoption of all the region’s languages: some languages are spoken by too few users to tempt attention and investment from private, profit-seeking companies.

The complexity of Southasian languages can also make the internet experience less than easy. Spelling variations and common misspellings can make it difficult to find what you are looking for using today’s search engines, which by and large are not designed to accommodate the sensibilities of our languages. For example, consider my search for an editorial in Nepal’s Kantipur daily titled Māobādī Atibād (Maoist extremism). Google cannot find the relevant webpage if one makes the somewhat minor and common mistake of spelling the first word as Māobādi (with a change in the direction of ikaar turning the final i sound short). A search engine familiar with the language would probably have suggested the correct spelling and search terms to the user, or would have automatically searched for variant spellings of the search words.

The positive side of this is that it presents exciting challenges for language engineers, but how much demand there is for refined searches in ‘non-mainstream’ languages remains unknown. Most Southasian internet users, for instance, tend to be educated people capable of using some amount of English when online, and who even seem to prefer it over their local-languages, perhaps because of the higher quality of services and richness of experience offered in English, not to mention the experience of participating in a global community.

Social software

The real good news is that computers are getting better at understanding some human languages, and that this understanding may then apply to all human languages. Recent advancements have allowed researchers to quickly decrypt ancient texts in forgotten languages – work that would otherwise have taken months or years of manual effort.

Automatic language comprehension and translation could even help bring dead languages back to life, or to preserve and spread languages currently in limited use, by creating forums for their organic use, development and growth even when their users are geographically scattered. This is in stark contrast to the view that the internet is dumbing languages down and driving them to extinction, as critics would have us believe. There are already efforts by social media users and companies to encourage more content and communication in diverse languages, and to record, preserve and teach them. A great example is indigenoustweets.com, which documents the use of diverse languages on Twitter.

One could debate whether it is technology alone that drives changes in society, and to what extent social factors play a role. But, borrowing from internet pundit Clay Shirky, it seems that “when we change the way we communicate, we change society”. The internet is a new global medium of communication, and it carries certain cultural, linguistic and political values with it. What could that mean for our languages in the foreseeable future? Efforts to make it as neutral and free a medium as possible, both linguistically and otherwise, will take a long time, if they succeed at all. The decrease in the share of English-language webpages does not indicate true representation on the internet of the full diversity of the world’s languages. Advances in automatic translation could make people less interested in learning non-native languages in the future. That technology has not yet advanced to the stage where it can understand the intricacies of idiomatic usage, satire, complicated sentence constructions, and much else. Too much reliance on such limited automatic translation methods in order to understand other tongues could also lead to a simplification of common usage in certain languages, not because of the language users but due to the need to allow translation by machines. Time will soon tell if languages like Nepali and Sinhala will have to evolve simplified versions suitable for online use.

Still, how strong a language is on the internet does also depend on factors beyond the technology. Societies with strong traditions of maintaining public-domain resources, keeping information open and enriching public archives, seem to get ahead in the internet-language race. For instance, current language technologies like automatic translation require huge amounts of archived translations – what is called a parallel corpus – to allow for comparison between original texts and various translations. But even more limited, monolingual text collections have their benefits, as do publicly available electronic dictionaries and thesauri, creating strong incentives for digitisation and electronic archiving.

Perhaps it is not technology itself that brings about positive change, but the way we deal with technology. Existing social-political structures, the state of the civil society, the condition of scientific and human resources and the inclinations of the market are among the factors that determine what benefits a technology can bring about. One change leads to another, and snowballing effects enable societies that are well poised to take advantage of one wave of change to remain ahead during subsequent waves. Technological globalisation may have its downside, but hope for preserving rare languages and cultural heritage is found in some of the same technologies that can sometimes threaten them. What is clear is that we have never before had so many tools at our disposal to help our languages survive and flourish.

* In this article, for the sake of simplicity, the terms internet and World Wide Web are used interchangeably. While the former is the huge network of devices connected to each other, the WWW is the system of documents linked to each other and resides on top of the internet.