Swedish multiword-entity extraction

Goal

.Finding long (>2 words) word/lemma n-grams

Background

Multiword entitites have received much attention lately both in computional linguistics and in general linguistics. There is a long tradition in computational and corpus linguistics of mining multiword entities from text by applying (a wide range of) collocation measures to pairs of entities (text words, lemmas, syntactic dependencies), contiguous or non-contiguous, in order to find two-word lexical units or terms. Attempts to discover longer units are much more rare in the literature, in part because good collocation measures seem to be lacking for this problem.

Problem description

The aim of this work is to refine a purely frequency-based way of finding contiguous word n-grams in annotated text, for instance by applying methods from work on automatic word segmentation. The preferred target language is Swedish. English is also acceptable, but in this case, Språkbanken can provide only limited support wrt annotation tools and linguistic expertise.