We present MExiCo (short for "Multimodal Experiment Corpora"), a programming API and library for the management, creation and analysis of multimodal corpora and data collections. It is being created in the infrastructure project of a Collaborative Research Centre that investigates the phenomenon of alignment in communication, introduced by Pickering & Garrod (2004). Its 13 projects bring together researchers from many different fields (among them psychology, linguistics, computer science, cognitive science) who have produced a variety of data sets containing communicative phenomena. These cover multiple modalities (speech, gesture, gaze, facial expressions, etc.) and are represented and stored in different formats, and coded according to different theories. With interoperability as one of the major goals, our infrastructure project evaluated several theories and data models for corpus and data management, looking for a candidate that was able to
• express microstructural entities and relations such as elements in transcripts, annotation documents, treebanks, etc.;
• express macrostructural entities and relations such as the experimental setup (consisting of roles and types of participants, variables, number, type, and duration of trials, etc.) and the resources that together form a corpus or data collection (such as audio and video files, speech transcripts and annotation documents as a whole, etc.)
• allow for multiple versions of data sets (e.g., for multiple annotation of the same phenomenon as a basis for agreement calculations)
Our result was that none of these models (although perfectly eligible for special subsets of our data) was able to handle the entirety of our data collection. In many cases we identified the main problem being a different understanding of central terms such as „corpus“ or „transcript“. Although „corpus“ is usually defined as „a finite set of concrete linguistic utterances that serves as an empirical bases for linguistic research“ (Bußmann 1996:106), along with subsequent annotations, this definition is too narrow for our field. Even with the addition of an abstract timeline for anchoring multiple events (as in, among others, Bird & Liberman 2001, or Evert et al. 2003) we require an even more complex axis system that also supports multiple timelines (for cases where data sets are bound to multiple timelines for which no synchronisation has been defined yet), and also spatial systems (necessary for modeling, e.g., gestures, head movements, actions in dialogue games where spatial actions are of interest, as in, for instance, object arrangement games).
On the basis of those theories we propose a generic data model capable of dealing with such heterogeneous data collection as present in our Collaborative Research Centre: MExiCo, which will be available to researchers in different ways: As a library to be used in console scripts, as a HTTP API that can be accessed as a web service, and, finally, as a backend of Phoibos, a web-based corpus management application (Menke & Mehler 2011, Menke & Cimiano 2012) where researchers can benefit from its functionality without being required to perform actual programming – although even this is not difficult: Being implemented in Ruby, MExiCo’s core functionality benefits from Ruby’s flexible syntax and is designed as a DSL (domain-specific language). This means that researchers can formulate queries, scripts and batch processes in an easy-to-understand language that attempts to be as close to human language as possible, with as few formal requirements of a programming language as possible.