Engineering Language Documentation

This introductory chapter has three sections: 1) a discussion of difficulties in current practice of digital language documentation; a first introduction to a more abstract model of the information which constitutes data in documentation, and why that model is designed as it is; a detailed description of of a software library called docling.js, which implements the object model and allows the design of user interfaces for manipulating it; the last section gives a bird’s-eye view of the set of technical skills necessary to understand sections 2 and 3.

Problems in current practice

While there are many applications for language documentation available to and in broad use by linguists , there is also widespread dissatisfaction about the interoperability of such software. While both ELAN and FLEx (to mention two very popular tools) are feature-rich and mature — ELAN for time-aligned transcription and FLEx for morphological analysis — it is no simple matter to use both tools in concert, as they were designed independently. And such difficulties grow factorially as other tools are added into a workflow: if one wants to add Praat into a workflow which already includes ELAN and FLEx, (a step which its fantastically robust range of features for phonetic analysis surely warrant), then the relationship of data from Praat to both ELAN and FLEx must be managed. Considering that many linguists work with as many again tools during their documentation (a spreadsheet such as Excel, a metadata tool such as CMDI Maker, a web dictionary generation tool like Lexique Pro, etc), then there may be dozens of interrelated steps in relaying data back and forth between the complete tool set.

This problem is primarily a problem of data: to add morphological glosses to a time-aligned ELAN transcript, one must first export ELAN’s XML format to something readable by FLEx, annotate, and then export for ELAN again, finally importing into ELAN. Such practice is essentially recognized as the standard practice by influential linguists. In a description of a typical days work, one source list the following steps as what it takes to digitize work after a day’s worth of fieldwork with a speaker:

Later that afternoon I use the handwritten notes in conjunction with the recording of the morning sessino to digitally transcribe the legacy text in ELAN. I then export it to Fieldworks [FLEx], where I simultaneously create consistent multi-tier IGT [Interlinear Glossed Text] and populate my growing lexical database. Once IGT is complete, I export the data back to ELAN to create the final time-aligned transcription. The transcript and the Fieldworks database are also prepared for archiving, entered into my metadata catalogue, and backed up locally. (Berez & Thieberger 2011: 114)

As an even more extreme example, consider the fact that FLEx doesn’t work by default on the Macintosh operating system, and as a result linguists must (and do) jump through many hoops in order make it do so. One common solution is to run a Windows “emulator” — an entire version of the Windows operating system— on top of the Macintosh operating system so that FLEx and ELAN can be used on a single computer.

To step back and consider what data is actually being produced in such a workflow, imagine a single sentence with a transcription and a free translation:

The only difference in “information” between these two interlinears is that the one on the right has a timestamp - a start and a stop time. Enabling users to capture that information is, ultimately, the primary function of tools such as ELAN. Consider in that light the fact that adding such timestamps to a sentence, at least for some users, requires emulating an entire operating system:

Put simply, the problem is not just that the described workflow is a lot of data management work for one day of fieldwork (although it is). The problem is that the amount of work in the workflow is so out of proportion to the scale of the changes that are produced. Annotating a sentence with a timestamp should not require emulating an operating system.

Hierarchy and documentation

This is an extreme example of the kinds of difficulties documentary linguists face. But it is symptomatic of what I believe is the key shortcoming of current documentation software considered as a whole: linguistic data is replete with hierarchy, and software which are used to model linguistic data must treat it as being a part of that hierarchy. I will have much more to say about the specifics of the model in Chapter 3, but let us briefly consider, in a general way, what it means to “treat” linguistic data as hierarchical.

The most familiar visualization of hierarchy in linguistics is the syntactic tree, where the “leaves” of a tree usually correspond to words, and other intermediary nodes correspond to syntactic constituents. But other hierarchical relationships are also implied in a syntactic tree. For instance, the tree itself corresponds to some sort of entity with an identifiable status: in some linguistic schools, a syntactic tree is meant to be understood as a grammatically complete, independent unit. I hope to demonstrate that the question whether a “grammatical sentences” or some more discourse-based entity like an intonation unit or turn is in fact largely orthogonal to the approach described here (as this author’s theoretical bias, which is firmly within the discourse-based tradition).

Principles for a data model

We aim for a simple, generic, practical model of linguistic data:

A simple model can be remembered easily.

A generic model doesn’t impose too many assumptions about what a linguist will wish to annotate.

A practical model is one which enables the design and implementation of interfaces that are intuitive and useful for working linguists.

We will name the levels of the hierarchy using most familiar terminology available, but those terms will be used in a very general way Thus, we will define a “sentence” as the documentary equivalent

(We will see in Chapter 3 that Sentences may also have attributes in their own right: particular, one or more orthographic transcriptions, one or more free translations, and so forth.)

Of course, this begs the question of how a “word” is to be identified. Again, I take the most generic, least committal definition imaginable: the only non-negotiable criterion for a word is that it has a form and a gloss. (The default definition for the content of forms and glosses will be “a string describable by the Leipzig Glossing Rules,” but even that latter convention is maleable, see Chapter 3.) Terminology relating to the inner structure of words is extensive, and quite old: labels such as “affix,” “stem,” and “root” are all used in particular documentary and descriptive traditions. Appealing again to a reductionist, theory-agnostic interpretation of the most generic term available, I will emphasize the notion of “words as contains” by referring to their parts simply as “morphemes.”

This simple kind of containment hierarchy, where a unit may be analyzed into smaller units, and composed into higher units, is the basis of this dissertation. We will see that there are many advantages to modeling documentary data in accord with a basic, “core” hierarchy which extends from (at least) the level of the morpheme to the level of the corpus, and beyond.

For much of linguistic theory, the sentence > word > morpheme relationships are perhaps the most common focal points. But documentary linguists especially must climb higher, characterizing data as exhibiting higher levels which bleed into the domain of discourse and even archiving. As a sentence may be generically labeled “a container of words,” a text may be labeled (very) generically as a container of sentences. And in turn, texts may be collected into a sequence which constitute a “corpus” level.

corpus

text

sentence

word

morpheme

phoneme

A containment hierarchy for standard levels in documentation

The actual implementation of this hierarchy will be developed in Chapters 2 and 3, and put to use in application development in Chapters 4 and 5. Each higher-level element contains a sequence of elements at the adjacent lower level. We will see (in Chapter 3) that this simple design allows cross-level searches which are of such utility to the process and utility of documentation: one should be able to search a corpus for words of particular kinds; it should be possible to filter out particular morphemes from an entire text , and so forth.

However, many existing software tools for language documentation are built on a model of documentary data which is fragmented in accord with particular steps in the documentation process, as opposed to beginning with a complete, hierarchical model and designing tools that address data within that hierarchy.

A change in direction

Introduction: Next steps for language documentation

@> What next for language documentation?

Status quo:

Train linguists to better use existing tools

Train linguists in workflows which allow integration of data between existing tools

Institutionalize knowledge gained in order to increase the reusability (Bird & Simons 2003) of documentary data

If we look at Thieberger and Berez (2011), this seems to be the practical advice.

This dissertation advocates a drastic change in direction, for at least some subset of linguists working in language documentation:

Train linguists to design and implement new user-facing applications for language documentation.

This is a drastic change in direction, is it worth it? Is it possible?

Here I suggest that not only is it possible, it may be the most feasible course of action for the field moving forward. But only if we meet certain criteria:

prioritize universality

This means using web standards, and that means using the most durable aspects of the web technology ecology: HTML, CSS, and Javascript.

Structure of this dissertation

Without a general model of documentary data which addresses (at least) all levels of the hierarchy described above, application development within the field will continue to result in applications that “silo” the data they consume and produce within a subset of of that hierarchy. For instance, time alignment applications will not allow users to directly address the morphological structure of words, and conversely, applications for producing morphologically annotated texts will not evolve to allow time alignment information to be integrated into their interfaces. This state of affairs is a serious problem for task of increasing the number of linguists who are effectively trained to apply their theoretical understanding to the task of language documentation, because inevitably good documentation must “shift levels”, describe not only the morphology and syntax independently, for instance, but also how the two interact. And interactions may span more than two levels: the study of prosody, for instance, refers to phonetic data, but its associated phenomena may be expressed at the syntactic or or discourse levels.

It is the current problems of this kind — those arising from an incompletely expressed hierarchical model — that this dissertation will take on. My proposed solution is bipartite: firstly, it includes an “object model” of documentary data. This model will extend beyond the core hierarchy above to include lexicons, grammatical inventories, language profiles, and a flexible multi-language or multi-dialect level called a compendium. This description will be described in Chapter 2 as

An understandable implementation which uses that model to reflect actual workflow practices

Linguistic units as computational objects

What will be described here, then, is an “object model” of all documentary data which represents each unit in the data hierarchy with a computational mechanism called an object, which, put briefly, is simply a bundle of attributes. Those objects are grouped into the aforementioned sequences by means of the array (or ‘list’) mechanism, which is specifically designed to capture the concept of a sequence, in that it may be sorted, filtered, combined with other arrays, and so forth. We will begin to broach the details of how these objects are implemented (programmed) at the end of Chapter 2. In the meantime, we will simply address objects and arrays as abstractions themselves, using a convenient visualization called a tabulation.

Designing applications for documentary workflows

The second component of this dissertation is a detailed description of a functioning software library which creates a usable computational implementation of the object model, as well as a detailed account of how to build user interfaces which allow linguists to create, edit, display, and repurpose documentary data. Considerations of data portability, archivability, accessibilty, and maintainability are brought to bear suggesting the use of web browser technologies (HTML, CSS, JSON, and Javascript) are an excellent choice for this implementation. (chapters 4-5)

The two stages must be described together, because the object model imposes a constraint on the design of any application which makes use of it: the application must never sever the link between any data that it creates or modifies from other objects in the object model hierarchy. For example, an application which edits time alignment information at the sentence level must not ignore data about morphological analysis which is associated with one of its sentences. Adhering to this constraint does not restrict the range of applications which may be designed and implemented to carry out specificy steps in a documentary workflow. To the contrary, it expands it by orders of magnitude.

The sixth and final chapter is a plan for adoption: how can a community be built to participate in the creation of new kinds of interfaces for documentation workflows? Some linguists will wish to learn to program, others will not. Even so, learning to understand the object model is a valuable component of understanding the system as a whole, and well within the reach of linguists who already have the technical skills to manage complex software — that is to say, essentially all linguists.

Data: an object model for documentary data

One important benefit of creating an object model is that it separates content from presentation. An interesting way to appreciate this fact is to consider the the Boasian Trilogy. Documentary linguists are familiar with this notion, which Evans (2010, 22:223) describes as follows:

the so-called Boasian trilogy, named after Franz Boas, which can present a rich portrait of a language in three mutually illuminating volumes: grammar, texts, and dictionary.

There is something of a paradox lurking in this topic, as Boas himself, as well as other well-known “Boasians” (most notably, perhaps, Edward Sapir), did not themselves seem to actually produce dictionaries in anywhere near the volume with which they produced grammars and corpora.

Haviland ascribes this to the sheer scale of the lexicon:

In the Boasian trilogy for language description of grammar, wordlist, and text, it is surely the dictionary whose compilation is most daunting. The process begins with a learner’s first encounters with a language, and it ends, seemingly, never. (Haviland 2006:129)

This is certainly true, but there is another way of thinking about “dictionaries”: the dictionary is the aggregation of all unique words in the corpus. From this stance, Boas and company did produce dictionaries, or at least, a “lexicon,” insofar as they provided word-level glosses for every word in their corpus. Perhaps they knew that deriving a dictionary from a corpus was at least feasible, whereas the opposite was not, and they prioritized accordingly. Take, for example, this short text in San Martín Duraznos Mixtec:

Concatenate all the glossed words in all the sentences into a single list

Remove duplicates from the list

Sort the list alphabetically

…then the result may be called a “dictionary”, of sorts. Because the object model developed makes the notion of “all the glossed words in a sentence” an addressable unit, processes such as “concatenation” and “sorting” may be made meaningful. Thus, the following small “dictionary” is derived directly from the text above; there is no separate lexicon file; it is simply a rearrangement of the word objects nested within the text.

Note also that changing the appearance of the words from a text may also changed drastically; for instance, here is a rather anachronistic view presenting the same lexical data in a format one might expect to see in a footnote in an early 20th-century edition of IJAL (except, of course, that the glosses would not be in Leipzig Glossing Rules notation!):

Note, again, that both of these presentations are generated dynamically. There is no “lexicon” file behind those presentations, there is only a single text. Of course it will be useful to combine the vocabularies of multiple texts (a corpus) into a lexicon in order to support search and semi-automated glossing, processes which will be described in Chapters 3-5.

Here is one more example of the sort of dynamic display which may be built on top of an object model: once again, we render the same text, but this time in a side-by-side, “parallel” text view. Such a rendering may be useful in revitalization or literacy contexts. Note that hovering over either side highlights the corresponding sentence on the other “side”.

Once again, note that all three of these “views” are derived from a single text.

Information, not artefacts

This type of transformation of documentary data are possible only because the display of the data is separated from the the way in which it is stored. One may speak of “artefacts” as opposed to “information”. Whie it is traditional in language documentation to speak of physical objects as “data”, the term is very diffuse in meaning. In many documentary contexts, it seems to be used in a sense which is close to “artefact” — “data” is to be understood as “all the physical and virtual objects which are produced during fieldwork.” For instance, in a general discussion of data in documentation, Thieberger and Berez (2012:91) refer to the explicit labelling of the “status” of elements in the construction of a dictionary (touching as we did above on the issue of word class):

For this chapter, ‘data’ is considered to be the material that results from fieldwork, which may include: primary observational notes, recordings, and transcripts; derived or secondary material such as lexical databases and annotated texts; and tertiary material, in the form of analytical writing, published collections of texts, or dictionaries. In addition, fieldwork results in elicited data: elicitation is always a part of fieldwork, and many linguists also use questionnaires and experimental field methods […]. All of these types of data and their associated metadata (see below) together make up the larger documentary corpus (Thieberger and Berez 2012:91)

In a seminal paper (Bird & Simons 2003) on the portability of data in language documentation — that is to say, the degree to which data is not tied to a particular software tool — the authors state at the outset that data and software are separate issues:

Portability is usually viewed as an issue for software, but here our focus is on data.

However, they immediately explain that their definition of data is to be understood as “artefacts” of the documentary process: equating files, publications, and manuscript objects:

By ‘data’ we mean any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of handwritten index cards.

While the current dissertation will discuss a specific software implementation and will describe (in chapter 3) a common file format for storing data produced by that tool, it is important to point out that “data” may also be understood in a way which is more abstract than any sort of “artefact”, digital or physical. As a preview of the sense of “data” which will be developed in Chapter 2, we may define (documentary) data as all recorded information about a language, organized in such a way that all of its component parts (and component subsets) may be recovered, either by retrieving the value of an attribute by property or by numerical offset within a list. There will be much more to be said about this definition as we progress, but let us proceed to quickly summarize the types of information which documentary linguists find useful in their work, taking as evidence the kinds of data handled by the most commonly used applications for language documentation.

Och’s “decisions” on layout versus modern views

It is worth noting that even the possibility of this kind of flexible re=rendering of data has only fairly recently come within reach of the larger research community. In a classic 1979 paper, Elinor Ochs is describes the question of whether a transcript of a child-adult interaction should be displayed with the child or the adult on the left:

“In this situation [where there is a side-by-side transcription with a child and an adult transcribed as columns], the transcriber who has opted for parallel placement-of speaker turns has to decide which speaker is to be assigned to the leftmost speaker column and which to the right.” (Ochs 1979)

Ochs is saying that if you don’t decide wisely with regard to the layout of your data, then you stand to limit the reader’s understanding of the data: there are theoretical consequences which arise from choices about layout and presentation. There is an implied assumption that such decisions are made once, and then the result is printed or otherwise published in a “final” form — in a print journal, a book, or as an online PDF, for instance. Of course, some form of “print” publication containing linguistic analysis will remain the primary currency of linguistics, writ large. But the thorough way in which the computational world has insinuated itself into our work processes demands a reconsideration of just what documentary “data” is.

I hope that the brief gallery of transformations of textual and lexical data above demonstrate that such decisions are no longer as final as they were when Ochs was writing. Designing useful and even novel displays of data can bolster the research process, rather than constrain it.

The range of data types in current use

Documentary linguists currently use an array of software tools to capture data. Many of these tools have considerable capability in allowing linguists to create and interact with the data which they were designed to handle. I would suggest that current difficulties in software for language documentation are not due to the design or functionality of any of the individual applications — indeed, several are quite mature and robust — the problem is that tools are designed only in terms of a subset of the levels of hierarchy mentioned above.

Phonetics

Morphology

Syntax

Time Alignment

Media

Word Processor

no

no

no

no

no

Spreadsheet

no

?

?

no

no

ELAN

no

no

no

yes

yes

Toolbox

no

yes

yes

no

no

FLEx

no

yes

yes

no

no

Praat

yes

no

no

yes

yes

Note that no tool covers all the categories of data. In this disseratation I hope to show that it is possible to build an ecosystem of applications which are truly interoperable (addressing a unified object model) using the Open Web Platform.

Hierarchy, Search, and Digital Documentation

In principle, the workflow of documentation imposes no particular constraints on just what medium is used to inscribe the documentatary data: a notebook and fileslip records of documentation are just as capable of creating a completely categorized documentary corpus as any one might create with a computer, through the traditional means of textual cross-reference. (For an astonishingly detailed instance of a purely textual cross-referenced grammar, see Heath 1975.)

In this dissertation, I posit the task of how to use computers to enable documentary workflows that are entirely digital. What does digital documentation really offer in terms of efficiently making data creation and usage efficient and effective?

One feature of digital representations must be foregrounded, as it is particularly germane to data about human language: digital representations are particularly well-suited for encoding and manipulating hierarchical relationships. This is because computers can address and recover data via reference in a way that in a physical guise would require duplication. This issue is addressed with exemplification from the history of linguistic scholarship in Chapter 2, but briefly, consider the fact that the word kuní-un ‘PAST-2S’ may be simultaneously thought of as a “word” with an interpretable independent meaning, and also as the second token in a sentence (utterance) of four words. To capture both of those conceptualizations, in a paper world, would require at least two pieces of paper. In a computer, if the system is carefully designed, the “word” and “token” guises of the word may be synchronized, in the sense that a database can be made to recognize that the word and the token are in some useful sense the same thing. In short, documentary data itself will be represented as a tree structure. Much more about the details of the tree-shaped data structure will be said in chapter 3.

Applications: using the object model

Understanding a model of documentary data is a useful first step, but without the ability to interact with that data it is of only theoretical interest. Language documentation is in part a craft. It has workflows and interleaved stages of acquisition and analysis. It is undeniable that some of those stages may involve tedium. If the physical work of entering data is too onerous, it may even become unergonomic, or even physically painful. This work is not primarily a theory of language documentation in the abstract, although it includes a simple and extensible model of documentary data. What it primarily is, though, is a sort of construction manual: a guide to understanding and building tools which are in turn used for building and using documentary databases.

This might seem at first blush to be a task which is hopelessly remote from the painstaking work of doing documentation itself, but it is my opinion after working in this field that most working linguists are unhappy with their software options, and spend an fairly shocking percentage of their working time trying to force independently designed software applications to interoperate across a motley array of data formats, operating systems, and sundry other variables.

Implementation with the Open Web Platform

Fortunately, we do not have to start from scratch. We can build on the most familiar and widely supported application development environment ever created: the web browser. I emphasize at the outset that I will be describing a system of building out applications within the web browser, as opposed to “on” the World Wide Web (the internet).

More specifically, we will be employed what is sometimes called the “Open Web Platform” — a combination of technologies which make the use of web browsers possible. The Open Web Platform (here, henceforth referred to as the OWP) is a technical standard which has evolved and continued to evolve since its first introduction in 1980 by Tim Berners Lee. Nearly 40 years later, the basic design of that system continues to function. When one bears in mind that HTML predates the first iPhone by 27 years, and even the lowly Compact Disc, the fact that the an HTML page from 1980 is still usable on essentially all web-enabled devices today is something quite remarkable. (Astonishingly, the first web page ever created is still visible at http://info.cern.ch/hypertext/WWW/TheProject.html.)

Since that time, the functionality of the OWP has emerged far beyond the simple document exchange format of HTML. 1993 saw the first inclusion of images in web pages in the now-extinct NCSA Mosaic browser. In a pattern which repeat, the extension to HTML that Mosaic introduced outlived the browser itself, and eventually became part of a web standard, with its own independent international standards body, the W3C. Images were eventually followed by full support for showing Unicode, audio, and most recently video content within web pages.

Most consequentially from the point of view of this dissertation, in 1995 a full computer programming language called Javascript was added into the browser. This meant at first that simple modifications of the content of a web page could be programmed directly within the HTML file itself: so, for instance, a title heading might be made to blink or exhibit some other irritating and largely useless animation. In its early days, Javascript thus garnered a reputation as a sort of “toy” programming application, useful only for trivial antics on web pages. But its scope has grown to the point that now it can be used to control essentially all of the content with in an HTML document.

Thus, it can be programmed to “listen” for user behavior such as clicks, and then to respond accordingly, by updating or otherwise modifying the content of a rendered HTML page. Crucially, because Javascript is itself subject to a standards process like that used by the W3C, the same program can run identically in most modern browsers. Meanwhile, browsers themselves are now “evergreen” — automatically updating to handle the newest standard specification of Javascript. These factors combine to create a totally unprecedented programming environment, and one which stands to address some of the vexing cross-platform issues documentary linguists face today. Fix them, that is, if we can create the necessary applications.
That is challenge into which I hope to invite my colleagues. We have before us an opportunity to re-create the way we interact with our data, if only we can “raise the boat” enough in terms of the level of expertise in application development in the field.

Where we want to get

We saw above a few “views” of a text and a simple lexicon. Below I show a slightly different display of the same content with a more complex interface. This is the “fundamental” application in this disseration: an editable, searchable, interlinear text interface with word search and time-aligned media playback.

In order to implement new applications, one needs to learn how to structure data, how to query data, how to present data, how to make those presentations interactive, and, closing the loop, how to use an interactive presentation to modify data (“editing”). We will use a programming technique called “Object-oriented programming” to implement this functionality.

Finally, modern documentation is not exclusively research-oriented. It should also serve as the basis for revitalization and language learning. Accordingly, we should consider those uses throughout the documentation process. I strive to help newcomers to programming understand how the code in the implemented system works. This work is intended to motivate documentary linguists to understand that they can participate in the design of the next generation of interfaces for language documentation, whether they choose only to participate in the discussion of how interfaces could best present data, or to delve fully into the process of learning to program in their own right. This author had no formal training in programming, as is the case for many other web developers.

View Classes a class that associates the presentation and data of a particular object (such as the parallel view of a text above)

The models be exhaustively detailed in Chapter 3 but by way of preview the list includes: Language, Word, Sentence, Text, Corpus, Lexicon, (Grammatical) System, and Grammar. Each of these models of data has a basic default view, called WordView, SentenceView, TextView, etc., which enables basic presentation and editing of that model. These basic views can be composed as the subviews of larger “applications,” each of which is programmed to manage the interactions of its subviews. Thus, a GlossingView could orchestrate the interactions of its subviews SentenceView, a TextView, and LexiconView, to assist in interactive, semi-automated interlinear glossing of a narrative or other text content. This process will be laid out in chapter 4.

References

Evans, Nicholas. 2010. Dying Words: Endangered Languages and What They Have to Tell Us. Vol. 22. John Wiley & Sons.