Presentations and working papers in the area of encoding

Encoding

For those working with minority languages, one of the first needs is the ability to work with the orthography of the language on a computer. The prerequisite for this is an adequate description of the orthography. But how does one go about that description? What is needed to form such a description? This paper provides a detailed outline for such a description with practical suggestions for a variety of script families and addressing particular domain specific issues.

The following presentation was prepared for the Non-Roman Technical Consultation 2001. It presents how an orthography statement can help in understanding a particular writing system and implementing it.

The following paper was presented at the 31st Internationalization & Unicode Conference in October 2007. It gives an overview of the challenges encountered in the process of writing a Unicode proposal for the Tai Viet script.

Language Identification

An Analysis of ISO 639: Preparing the way for Advancements in Language Identification Standards

Authors: Peter Constable, Gary F. Simons

Abstract

Globalisation has led to an interest in an increasingly diverse variety of languages. Across industry, academic, and government sectors, there is a felt need for language identification standards that go beyond what are currently available. In response to these needs, ISO TC 37/SC 2 has resolved to begin a new work initiative to extend the ISO 639 family of standards.

Toward a Model for Language Identification: Defining an Ontology of Language-Related Categories

Author: Peter Constable

Source 21st International Unicode Conference Dublin, Ireland May 2002

Abstract

This paper proposes a set of notions for language-related categories that are relevant for information technologies (IT) and examines the relationships between them. It explores some usage scenarios and considers what ways of formulating identifiers might be appropriate for the various scenarios.

Language Identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale

Authors: Peter Constable and Gary Simons

Abstract

Many processes used within information technology need to be customized to work for specific languages. For this purpose, systems of tags are needed to identify the language in which information is expressed. Various systems exist and are commonly used, but all of them cover only a minor portion of languages used in the world today, and technologies are being applied to an increasingly diverse range of languages that go well beyond those already covered by these systems. Furthermore, there are several other problems that limit these systems in their ability to cope with these expanding needs. This paper examines five specific problem areas in existing tagging systems for language identification and proposes a particular solution that covers all the world's languages while addressing all five problems.

Software Issues

Unicode on the Front Lines: Endangered Languages and Unicode

This paper was presented at the 31st Internationalization & Unicode Conference in October 2007. It was part one of a 4 part presentation and panel discussion on Endangered Languages and Unicode. This paper shows a little of the state of Unicode in relation to smaller language communities such as endangered languages. It comes from the perspective of SIL projects.

Transitioning a Vastly Multilingual Corporation to Unicode

In 2001 SIL International began a concerted effort to transition the organization to using Unicode.

Many steps were involved in our transition. These included getting computer support people on board; helping upper management to understand the complexities and the need; and developing tools for converting legacy data to Unicode, fonts which cover Latin and Cyrillic repertoires as well as non-Roman fonts, software to use Unicode and training in using the tools of Unicode.

We have discovered that training is an ongoing process. We continue to find areas where we need to focus our training efforts. This paper addresses the steps taken, the problems that arose, and the solutions that were found.

This document was written for presentation at the 26th International Unicode Conference.

The State of Software Technologies Affecting NR Solutions

This document outlines the state of the art with regard to script-related technologies as it existed in early 1997, and as it exists as of October 2000. This was originally prepared as a departmental report for the benefit of NRSI staff and members of the NRSI Field Advisory Board in anticipation of the February 2000 meeting, the agenda of which included discussion of a new multi-year plan. I have made some revisions, primarily to add more up-to-date information (a surprising number of changes have occurred in the past eight months) but also to make it more suitable for a wider audience.

This gives a wide (but not deep) description of the technologies that affect us. Since several members of the NRSI staff and the Advisory Board found it helpful, I thought it might also be beneficial for others in SIL. Please note, however, that this is not a highly polished paper, and it is not intended as an introduction to script-related technologies. Explanations of technologies and technology issues are generally not very detailed, and a fair amount of familiarity with these things is assumed.

That said, there is a lot of material that will be of interest for people involved in language software support for field linguists. This includes both those that work with non-Roman scripts, but also those working with extended Roman script and IPA.

Understanding Multilingual Software on MS Windows

Significant changes are occurring in the software industry to make software more capable in terms of handling multilingual and multi-script data. We are in a transitional period while these changes take place, and users often do not understand the newer standards and technologies that recent software is built on, or encounter problems bridging between old ways of doing things and the newer standards and technologies.

The paper is intended to help advanced users and computer support personnel by explaining the various paradigms that have been used for working with multilingual data on Microsoft Windows. This covers various versions of Windows ranging from Windows 3.1 up to the current versions, Windows Me and Windows 2000.

The discussion is somewhat technical, but an attempt has been made to keep this to a minimum so as to reach the widest possible audience while still providing a complete explanation.