Text Typology and Selection Criteria for a Balanced Corpus: the Integrated Language Database of 8th-21st-Century Dutch

Text Typology and Selection Criteria for a Balanced Corpus: the Integrated Language Database of 8th-21st-Century Dutch

Abstract

The Institute for Dutch Lexicology is compiling the Integrated Language Database of 8th-21st Century Dutch, which will consist of three components: a text component, a dictionary component and a lexicon component. In this paper we describe the work done up till now on the text component. This will contain a balanced, diachronic corpus of texts. Paragraph 2 shows the role of the text typology, helping the designers to build a corpus that is a good representation of the Dutch language, and enabling users to make many different subcorpus selections. We describe how the text typology for the IntegratedLanguage Database was designed, with the primary aim of the texts as governing principle. Paragraph 3 presents additional classification features that will give the user still more possibilities for subcorpus selection. The selection criteria for individual texts are discussed in paragraph 4, dividing texts into 'originals' and 'editions' and presenting the rules by which to choose from the available originals or editions. Paragraph 5 concludes this paper with a short description of the next empirical step: the building ofa prototype.