Multiword expressions

Multiword expressions are lexical units that consist of two or more words (tokens), however, they exhibit special syntactic, semantic, pragmatic or statistical features. From an NLP point of view, their treatment is not free of problems since - on the one hand - the system should recognize that they count as one lexical unit (and not two or more words connected) therefore it is advisable to store them as one unit in the lexicon. On the other hand, special rules for their treatment should also be included in the system.

Identifying multiword expressions is not unequivocal since constructions with similar syntactic structure (e.g. verb + noun combinations) can belong to different subclasses on the productivity scale (i.e. productive combinations, light verb constructions and idioms). That is why well-designed and tagged corpora of multiword expressions are invaluable resources for training and testing algorithms that are able to identify multiword expressions.

A compound is a lexical unit that consists of two or more elements that exist on their own. Orthographically, a compound may include spaces (high school) or hyphen (well-known) or none of them (headmaster).
Light verb constructions (LVCs) consist of a nominal and a verbal component where the noun is usually taken in one of its literal senses but the
verb usually loses its original sense to some extent e.g. to give a lecture, to come into bloom, the problem lies (in).
Verb-particle constructions (VPCs, also called phrasal verbs or phrasal-prepositional verbs) are combined of a verb and a
particle/preposition that can be adjacent (as in put off) or separated by an
intervening object (turn the light off).

Corpora

Several manually annotated corpora have been created by us, listed below.

SzegedParalellFX
The SzegedParalell English-Hungarian parallel corpus constitutes the basis of the SzegedParalellFX, in which light verb constructions are manually annotated. Three novels, texts from magazines and language books and economic and legal texts were selected for annotation. Light verb constructions are annotated in both languages. The corpus has 14,261 sentence alignment units, which contain 1370 occurrences of light verb constructions.

Szeged Treebank FX
The Szeged Treebank - a database in which words are morphosyntactically tagged and sentences are syntactically parsed - was annotated for light verb constructions manually. Corpus texts involve the following topics: student essays, short business news, newspaper texts, laws, computer texts, literature. This version of the Treebank contains 6734 occurrences of 1215 light verb constructions altogether in 82,099 sentences.

Wiki50
The Wiki50 corpus contains 50 English Wikipedia articles (4350 sentences), in which several types of multiword expressions and four classes of Named Entities were manually annotated by professional linguists. This is the first corpus in which multiword expressions and named entities are annotated at the same time. Corpus data make it possible to investigate the co-occurrences of different types of MWEs and NEs within the same domain and also to train and evaluate MWE detectors and NER applications.

4FX
The 4FX corpus contains English, Spanish, German and Hungarian legislative texts from the JRC-Acquis Multilingual Parallel Corpus, which are manually annotated for light verb constructions, following standardized annotation principles. The corpus contains 673 LVCs in English, 806 in German, 938 in Spanish and 1059 in Hungarian.

CoNLL-2003 dataset annotated for LVCs
The CoNLL-2003 dataset was originally developed for named entity recognition in short news domain. 500 randomly selected pieces of short news were
taken from the CoNLL-2003 dataset and LVCs in them were annotated. This corpus contains 381 occurrences of manually annotated LVCs in 8,467 sentences.