Baumgarten N, Herkenrath A, Schmidt T, Wörner K and Zeevaert L (2007), "Studying Connectivity with the Help of Computer-Readable Corpora: Some Exemplary Analyses from Modern and Historical, Written and Spoken Corpora", In Connectivity in Grammar and Discourse. Amsterdam Vol. 5, pp. 259-289. Benjamins.

Abstract: This paper discusses methodological aspects of the use of electronic language corpora for the study of connectivity. We demonstrate how a corpus-based approach was used to investigate functional characteristics of coordinating elements in sentence- or utterance-initial position across different languages (English, German, Old Swedish and Turkish), across different modalities (written and spoken) and across the diachronic dimension (historic and modern languages). Our focus is on the difficulties we encountered in this study when attempting to transfer corpus-based methods developed for the analysis of corpora of modern, written language to the analysis of corpora of historic or spoken language. We suggest an abstract corpus-linguistic workflow and discuss where and how this workflow differs according to the corpus type, and how well its individual steps are supported by current corpus technology.

@phdthesis{Becher,
author = {Viktor Becher},
title = {Explicitation and implicitation in translation. A corpus-based study of English-German and German-English translations of business texts},
school = {Universität Hamburg},
year = {2011}
}

Belz M and Klapi M (2013), "Pauses following fillers in L1 and L2 German Map Task dialogues.", In Proceedings of Disfluency in Spontaneous Speech, DiSS 2013.

Abstract: This article describes the corpus of spoken Catalan elaborated within the research project “Phonoprosodic development of Catalan in its current bilingual context”. The corpus contains 174 interviews with speakers from three districts of Barcelona varying on the presence of Spanish. The subjects belong to three age groups: children aged 3 to 5, young people aged 19 to 23 and adults aged 32 to 40. The collected data consist of semi-spontaneous speech, free conversations, a role-play, a reading task and a sociolinguistic questionnaire. The goals of the project include auditory and acoustic analyses of Catalan segments (exemplified here by some results on vowels), the study of loan words and of cognates with different gender across Catalan and Spanish, as well as prosodic analyses of intonational phrasing of declaratives and interrogatives.

Abstract: This paper explores the question how language corpora can enhance discourse analytic research as well as communication trainings. To do this, we refer to the language corpus “Interpreting in hospitals”, and begin by describing it in detail. Subsequently, the paper exemplifies how the corpus was used to analyse ad-hoc-interpreting in medical settings, focusing on the function of specific linguistic elements and speech actions (Bührig & Meyer 2004). Finally, the paper shows how research findings based on the corpus and data from the corpus can be used in communication trainings, describing a training for bilingual hospital employees. The corpus allows trainers to identify relevant training contents, and it offers the possibility to integrate sections of authentic discourse in the training. The paper illustrates how the training participants accepted and worked with the discourse data, and draws conclusions concerning the use of corpus based analyses in trainings on workplace communication.

Abstract: The present paper reflects on methodological aspects of the data gathering, analysis, and reuse and will present the practical experience from designing a test battery and selecting and approaching the participants, to conducting the experiments. Our project aims to provide a descriptive survey of contact-­induced change in two groups of bilingual (L1 Polish) speakers currently living in Germany. The corpus contains written and spoken, elicited and free data that complement each other with regard to investigating morphosyntactic phenomena. This paper gives a description of each part of our tests: selecting the topics, conducting the experiment, and choosing technical equipment needed for recording of the speech data; stimuli presentation and description of the software used for grammaticality judgments; construction of a gapped text, and finally the sociolinguistic questionnaire and self-evaluation tasks.

Abstract: This paper describes a new research initiative addressing the issue of sustainability of linguistic resources. This initiative is a cooperation between three linguistic collaborative research centres in Germany, which comprise more than 40 individual research projects altogether. These projects are involved in creating manifold language resources, especially corpora, tailored to their particular needs. The aim of the project described here is to ensure an effective and sustainable access of these data by third-party researchers beyond the termination of these projects. This goal involves a number of measures, such as the definition of a common data format to completely capture the heterogeneous information encoded in the individual corpora, the development of user-friendly and sustainably usable tools for processing (e.g. querying) the data, and the specification of common inventories of metadata and terminology. Moreover, the project aims at formulating general rules of best practice for creating, accessing, and archiving linguistic resources.

Abstract: The paper outlines and illustrates the design of the Hamburg Corpus of Argentinean Spanish (HaCASpa, compiled 2008−2009), which comprises oral data from two varieties of Argentinean Spanish (Buenos Aires and Neuquén, Northern Patagonia). Both varieties are characterized by prosodic features that can plausibly be traced back to the contact with Italian during the period of large streams of immigration between 1830 and 1950. After providing the reader with general information on the historical situation of Spanish-Italian bilingualism in Buenos Aires, the contribution focuses on the data types contained in the corpus and the speakers recorded. In addition, the main findings stemming from the analyses performed thus far based on the corpus are summarized.

Abstract: We give an overview of the content and the technical background of a number of corpora which were developed in various projects of the Research Centre on Multilingualism (SFB 538) between 1999 and 2011 and which are now made available to the scientific community via the Hamburg Centre for Language Corpora.

Abstract: This article discusses questions concerning the creation, annotation and sharing of spoken language corpora. We use the Hamburg Map Task Corpus (HAMATAC), a small corpus in which advanced learners of German were recorded solving a map task, as an example to illustrate our main points. We first give an overview of the corpus creation and annotation process including recording, metadata documentation, transcription and semi-automatic annotation of the data. We then discuss the manual annotation of disfluencies as an example case in which many of the typical and challenging problems for data reuse – in particular the reliability of interpretative annotations – are revealed.This article discusses questions concerning the creation, annotation and sharing of spoken language corpora. We use the Hamburg Map Task Corpus (HAMATAC), a small corpus in which advanced learners of German were recorded solving a map task, as an example to illustrate our main points. We first give an overview of the corpus creation and annotation process including recording, metadata documentation, transcription and semi-automatic annotation of the data. We then discuss the manual annotation of disfluencies as an example case in which many of the typical and challenging problems for data reuse – in particular the reliability of interpretative annotations – are revealed.

Hedeland H and Wörner K (2012), "Experiences and Problems creating a CMDI profile from an existing Metadata Schema", In Proceedings of LREC-Workshop "Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR". ELRA.

Abstract: The paper presents a methodology for empirical multilingual data analysis that combines quantitative and qualitative research. The data is a bilingual Turkish-German and a monolingual Turkish corpus of spoken child language. The methodology proceeds in several steps: (1) description of transcribed data (PartiturEditor) and of the concepts of ‘constellation’ and ‘Evocative Field Experiment’ (EFE), (2) the methodological role of the linguistic unit ‘utterance’, its marking as ‘segment’ in transcriptions and its importance for corpus formation (CoMa), (3) search procedures and frequency assignment of the findings (EXAKT), (4) classification according to constellative features of the data, (5) contextual interpretation of the items, (6) consultation of the transcript where needed, (7) contextually based categorisation of the items resulting in an empirical determination of their varieties. The objective of the methodological stages is an empirical foundation of discourse-based linguistic analysis of multilingual corpora, which we call ‘Pragmatic Corpus Analysis’ (PCA).

Abstract: The synchronic and diachronic variability of historical texts poses substantial difficulties in the annotation and analysis of historical corpora. One main problem is that ongoing language change and particularly grammaticalisation phenomena lead to syntactic ambiguity. This contribution shows how such issues are dealt with in the TEI-based Hamburg Corpus of Old Swedish with Syntactic Annotation (HaCOSSA). The focus is on the development of strictly operational, explicitly defined, largely theory-neutral, language-specific and diachronically broad annotation categories.

Abstract: This paper presents results from an interdisciplinary cooperation within the Collaborative Research Centre on Multilingualism. First results of this cooperation were published in an earlier paper (BAUMGARTEN et al. 2007) concentrating on an investigation of functional characteristics of coordinating elements in English, German, Old Swedish and Turkish corpora. The aim of the second part of the cooperation was to develop corpus linguistic methods in order to be able to examine word order change in subordinate clauses in older Swedish and Danish texts in comparison to Old West Norse. The starting point for the investigation was the observation that the word order in Swedish main clauses is rather stable from the earliest written sources up to contemporary Swedish, whereas in subordinate clauses, from a diachronic perspective, far-reaching changes can be observed. Starting from the hypothesis that language contact triggered this change, a comparison of an Old Swedish, an Old Danish and an Old West Norse version of the Story of Charlemagne was performed. The West Norse version almost exclusively shows verb second order and no examples of verb late order. In the Danish and the Swedish versions, verb second is also the main option, but more examples of the finite verb in a later position can be found in both texts. In our opinion it seems to be reasonable to suggest that the development of new text types based on Latin models triggered the change that can be observed in the East Norse texts.

Abstract: This paper describes how to access and use a corpus of comparable consecutive and simultaneous interpreting (Brazilian Portuguese and German). The corpus is available free of charge. Our aim is to stimulate discussions on the use and the accessibility of corpora in interpreting studies, and, more generally, the need for corpus-based studies of interpreting.

Abstract: This contribution deals with the possibilities of distinguishing features of an established contact variety from singly occurring, transient elements using a corpus-based approach. It emphasizes the potential that lies in including different language registers (informal spoken language and formal written language) in the analysis of language contact, and hypothesizes that the register-specific establishment of contact phenomena is possible. This is shown through the example of Danish as it is used on the Faroe Islands, represented by the only two existing digitized and annotated corpora that reflect the bilingualism on the Faroe Islands.

Abstract: The HABLA-corpus (Hamburg Adult Bilingual LAnguage) comprises data in the form of semi-structured interviews gathered in the project E11, Linguistic Aspects of Language Attrition and Second Language Acquisition in adult bilinguals (German-French and German-Italian). E11 investigated the language of adult bilinguals (2L1 speakers) who grew up in Germany, Italy or France being exposed to two languages simultaneously from birth, comparing them to advanced second language (L2) learners. In this contribution, we explain the motivation for creating the corpus and introduce the corpus design, including information about the subjects, data acquisition and labelling, quality and transcription conventions, with the purpose of providing an overview of the corpus and facilitate its use.

Abstract: This article describes two longitudinal language corpora of child German and child Spanish. One of the corpora, PAIDUS, is comprised of the utterances produced by monolingual German and monolingual Spanish children, between the ages of 1 and ca. 3 years. The German children grew up in Hamburg (Germany) and the Spanish children in Madrid (Spain). The other corpus, PhonBLA, is comprised of utterances produced by German-Spanish bilingual children, between the ages of 1 and ca. 7 years, growing up in Hamburg (Germany). The bilingual children have a Spanish-speaking mother and a German-speaking father. All corpora were collected, transcribed and analyzed within various research projects supported by the DFG, between 1986 and 2011. Several analyses of the data have been published in international journals and books (see References).

Abstract: EXMARaLDA is a system for creating, managing and analyzing spoken language corpora (Schmidt & Wörner 2009, Schmidt et al. 2011), developed between 2000 and 2011at the Research Centre on Multilingualism (SFB 538) at the University of Hamburg. It is now maintained at the Hamburg Center for Speech Corpora (HZSK)1, and since November 2011, also in cooperation with the Archive for Spoken German (AGD) at the Institute for the German Language (IDS) in Mannheim. It comprises tools for transcribing spoken language (Partitur-Editor), managing metadata (Corpus Manager), and querying spoken language corpora (EXAKT). The software components are freely available and operate on all platforms (Windows, Linux, Macintosh). EXMARaLDA forms the basis for 23 multilingual corpora of spoken language at the Hamburg Center for Speech Corpora. Its primary scope of application covers discourse and conversation analysis, first and second language acquisition studies, and dialectology (cf. Schmidt 2009: 158). This paper reviews the software from the perspective of its application in the GeWiss project, one of several larger corpus projects that have used EXMARaLDA.2 As a starting point, the review will introduce the software requirements of the project, and their role in choosing the EXMARaLDA package for the creation of the GeWiss Corpus. As we worked with all three components of the software, the review will then deal in turn with the Partitur-Editor (version 1.5.1), the Corpus Manager (version 1.9), and EXAKT (version 1.1). In conclusion, some remarks concerning support and compatibility of the software will be made.

Schmidt T (2015), "Good practices in the compilation of FOLK (Research and Teaching Corpus of Spoken German).", In Compilation and Annotation of Spoken Corpora: Towards Best Practice. (Special issue of the International Journal of Corpus Linguistics). John Benjamins Publishing Company.

Abstract: The Database for Spoken German (Datenbank für Gesprochenes Deutsch, DGD2, http://dgd.ids-mannheim.de) is the central platform for publishing and disseminating spoken language corpora from the Archive of Spoken German (Archiv für Gesprochenes Deutsch, AGD, http://agd.ids-mannheim.de) at the Institute for the German Language in Mannheim. The corpora contained in the DGD2 come from a variety of sources, some of them in-house projects, some of them external projects. Most of the corpora were originally intended either for research into the (dialectal) variation of German or for studies in conversation analysis and related fields. The AGD has taken over the task of permanently archiving these resources and making them available for reuse to the research community. To date, the DGD2 offers access to 19 different corpora, totalling around 9000 speech events, 2500 hours of audio recordings or 8 million transcribed words. This paper gives an overview of the data made available via the DGD2, of the technical basis for its implementation, and of the most important functionalities it offers. The paper concludes with information about the users of the database and future plans for its development.

Abstract: FOLK is the ""Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK)"" (eng.: research and teaching corpus of spoken German). The project has set itself the aim of building a corpus of German conversations which a) covers a broad range of interaction types in private, institutional and public settings, b) is sufficiently large and diverse and of sufficient quality to support different qualitative and quantitative research approaches, c) is transcribed, annotated and made accessible according to current technological standards, and d) is available to the scientific community on a sound legal basis and without unnecessary restrictions of usage. This paper gives an overview of the corpus design, the strategies for acquisition of a diverse range of interaction data, and the corpus construction workflow from recording via transcription an annotation to dissemination.

Abstract: This paper presents some concepts and principles used in the devel-opment of a database of multilingual spoken discourse at the Univer-sity of Hamburg. The emphasis of the first part is on general consid-erations for the handling of heterogeneous data sets: After showing that diversity in transcription data is partly conceptually and partly technologically motivated, it is argued that the processing of transcrip-tion corpora should be approached via a three-level architecture which separates form (application) and content (data) on the one hand, and logical and physical data structures on the other hand. Such an archi-tecture does not only pave the way for modern text-technological ap-proaches to linguistic data processing, it can also help to decide where and how a standardization in the work with heterogeneous data is pos-sible and desirable and where it would run counter to the needs of the research community. It is further argued that, in order to ensure user acceptance, new solutions developed in this approach must take care not to abandon established concepts too quickly. The focus of the second part is on some practical experiences with users and technologies gained in the four years’ project work. Con-cerning the practical development work, the value of open standards like XML and Unicode is emphasized and some limitations of the “platform-independent” JAVA technology are indicated. With respect to users of the EXMARaLDA system, a predominantly conservative attitude towards technological innovations in transcription corpus work can be stated: individual users tend to stick to known functional-ities and are reluctant to adopt themselves to the new possibilities. Furthermore, an active commitment to cooperative corpus work still seems to be the exception rather than the rule. It is concluded that technological innovations can contribute their share to a progress in the work with heterogeneous linguistic data, but that they will have to be supplemented, in the long run, with an ade-quate methodological reflection and the creation of an appropriate in-frastructure.

Abstract: This paper attempts a new look at computer assisted transcription as it is commonly practised within the fields of discourseanalysis and language acquisition studies.The first part proposes a bridge between discourse analytical methodology and text technological methods with the concept ofmodelling as its central idea. The secondpart demonstrates the EXMARaLDA system, a set of formats and tools for computerassisted transcription that builds on the ideas developed in the first part and implements them in a way that can lead to significant improvement in current research practice.

Abstract: This paper describes EXMARaLDA, an XML-based framework for the construction, dissemination and analysis of corpora of spoken language transcriptions. Departing from a prototypical example of a “partitur” (musical score) transcription, the EXMARaLDA “single timeline, multiple tiers” data model and format is presented alongside with the EXMARaLDA Partitur-Editor, a tool for inputting and visualizing such data. This is followed by a discussion of the interaction of EXMARaLDA with other frameworks and tools that work with similar data models. Finally, this paper presents an extension of the “single timeline, multiple tiers” data model and describes its application within the EXMARaLDA system.

Abstract: EXMARaLDA is a system for computer transcription of spoken discourse that is being developed at the SFB ‚Mehrsprachigkeit’ as a basis of a multilingual discourse database into which the transcriptions in use at the SFB will be integrated at a later point in time. The present paper describes the theoretical background of the development – a formal model of discourse transcription based on the annotation graph formalism (Bird/Liberman (2001)) – and its practical realisation in the form of an XML-based data format and several tools for input, output and manipulation of the data.

Schmidt T (2001), "The transcription system EXMARaLDA: An application of the annotation graph formalism as the Basis of a Database of Multilingual Spoken Discourse", In Proceedings of the IRCS Workshop On Linguistic Databases, 11-13 December 2001. Philadelphia , pp. 219-227. Institute for Research in Cognitive Science, University of Pennsylvania.

Abstract: This paper describes EXMARaLDA, a system for computer transcription of spoken discourse developed and used by the SFB "Mehrsprachigkeit" at the university of Hamburg. EXMARaLDA consists of several DTDs for XML coding of transcription data and some input and output tools for these formats. Apart from being a transcription system in its own right, EXMARaLDA also plays the role of a mediator between older existing data formats at the SFB and between these formats and a planned database of multilingual spoken discourse.

Abstract: This paper describes a new research initiative addressing the issue of sustainability of linguistic resources. The initiative is a cooperation between three collaborative research centres in Germany – the SFB 441 “Linguistic Data Structures” in Tübingen, the SFB 538 “Multilingualism” in Hamburg, and the SFB 632 “Information Structure” in Potsdam/Berlin. The aim of the project is to develop methods for sustainable archiving of the diverse bodies of linguistic data used at the three sites. In the first half of the paper, the data handling solutions developed so far at the three centres are briefly introduced. This is followed by an assessment of their commonalities and differences and of what these entail for the work of the new joint initiative. The second part then sketches seven areas of open questions with respect to sustainable data handling and gives a more detailed account of two of them – integration of linguistic terminologies and development of best practice guidelines.

Abstract: We present some recent and planned future developments in EXMARaLDA, a system for creating, managing, analysing and publishing spoken language corpora. The new functionality concerns the areas of transcription and annotation, corpus management, query mechanisms, interoperability and corpus deployment. Future work is planned in the areas of automatic annotation, standardisation and workflow management.

Slavcheva A and Meißner C (2014), "Building and Maintaining the GeWiss Corpus: Perspectives on the Construction, Sustainability and Further Enrichment of Spoken Corpora, A Showcase", In Best Practices for Speech Corpora in Linguistic Research. , pp. pp. 20-35. Cambridge Scholars Publishing.

Abstract: This article describes a database of Spanish recorded speech comprised of four corpora. The corpora contain cross-sectional data of Spanish spoken in contact with German. The first corpus, ALCE-BLA (Bilingual Language Acquisition at school age), is comprised of the utterances of 23 Spanish-German simultaneous bilingual children living in Germany and attending the Spanish complementary school at the first level. The second corpus, Phon-cL2, contains the utterances of 15 German children who have learned (or are learning) Spanish after the age of 2;0. The third corpus, Madrid-PhonBLA, contains utterances of 71 Spanish-­German simultaneous bilingual children from Madrid (Spain). The fourth corpus, PhonMAS, contains utterances of monolingual Spanish children, who have been recorded with the purpose to be compared with the bilingual corpora.

Abstract: The paper takes a look on existing metadata schemes for transcriptions of spoken language as well as written texts and emphasizes on their advantages and disadvantages. It introduces the metadata model of EXMARaLDA, which has an implementation in the EXMARaLDA Corpus Manager (Coma). The paper jusitifies the decisions that led to a data model that does not presuppose many metadata items (thus risking inconsistencies) and relies on XML files (thus potentially sacrificing performance).

Abstract: Linguistic corpora have been annotated by means of SGML-based markup languages for almost 20 years. We can, very roughly, differentiate between three distinct evolutionary stages of markup technologies. (1) Originally, single SGML tree-based document instances were deemed sufficient for the representation of linguistic structures. (2) Linguists began to realize that alternatives and extensions to the traditional model are needed. Formalisms such as, for example, NITE were proposed: the NITE Object Model (NOM) consists of multi-rooted trees. (3) We are now on the threshold of the third evolutionary stage: even NITE's very flexible approach is not suited for all linguistic purposes. As some structures, such as these, cannot be modeled by multi-rooted trees, an even more flexible approach is needed in order to provide a generic annotation format that is able to represent genuinely arbitrary linguistic data structures.

Abstract: The ALesKo learner corpus is a small-scale comparable corpus consisting of two subcorpora: annotated essays by advanced Chinese learners of German and comparable essays by German native speakers. The motivation for its compilation was the investigation of discourse-related phenomena such as local coherence in second-language acquisition of German. After introducing how the texts were compiled and annotated, the article focuses on quantitative studies at the token level. We discuss problems of tokenisation and part-of-speech tagging and compare the inventory of the two subcorpora in terms of frequently used N-grams and lexical richness, among other aspects. We conclude the article by describing possible applications of the study in foreign language acquisition research and language teaching.