We are annotating the complete 20 million Dutch PAROLE corpus with PoS and lemma.
The morphosyntactic tagging of 250,000 words during the PAROLE project was the first
confrontation of the fine-grained Dutch PAROLE tagset and its 'functional' mode of
application, with real corpus data. The correction of the manual tagging and the
compilation of a 100,000 words training corpus for the automatic tagger initiated the
evaluation of the suitability of the tagset and the methodology of tag assignment, which
topics will both be discussed in this paper. The reality of corpus data brought about a
number of adaptations, linguistic restrictions and generalisations. The most salient tagger
results will be presented. Our experience is relevant for a new project: the Integrated
Language Database of 8th - 21st Century Dutch (ILD), which will contain a text corpus
covering all these centuries. The corpus will be annotated with lemma and PoS, in which
process historical lexica will be used. Obviously, we will have to tailor tagset and
methodology of tag assignment optimally to these purposes.