Multiword Expressions

Multiword expressions (MWEs) are expressions which are made up of at least 2 words and which can be syntactically and/or semantically idiosyncratic in nature. Moreover, they act as a single unit at some level of linguistic analysis. According to Sag et al.[1] we could define MWEs roughly as „idiosyncratic interpretations that cross word boundaries“.
MWEs can be regarded as lying at the interface of grammar and lexicon, usually being instances of well productive syntactic patterns but nevertheless showing a peculiar lexical behaviour.[2]

Besides, they are communly used in any field of language – Jackendoff[3] estimates the number of MWEs in a speaker's lexicon as comparable to the number of single words. Examples for MWEs would be idioms as „kick the bucket“, compound nouns as „telephone box“ and „post office“, verb-particle constructions as „look sth. up“ or proper names as „San Francisco“. Due to the high frequency of MWEs there is a growing awareness in the NLP (Natural Language Processing) community for the problems they pose.

Contents

2 Classification of MWEs

MWEs can be split up in lexicalized phrases which have at least in part idiosyncratic syntax or pragmatics, and institutionalized phrases which are syntactically and semantically compositional. Lexicalized phrases can be further subclassified into fixed expressions, semi-fixed expressions and syntactically flexible expressions.[4]

2.1 Fixed expressions

Fixed expressions are fully lexicalized and can neither be variated morphosyntactically nor modificated internally. Examples for fixed expressions are: in short, by and large, every which way. They are fixed, as you cannot say in shorter or in very short.

2.2 Semi-fixed expressions

In semi-fixed expressions word order and composition are strictly invariable, while inflection, variation in reflexive form and determiner selection is possible.

In non-decomposable idioms (i.e. idioms in which the meaning cannot be assigned to the parts of the MWE) such as kick the bucket the verb can be inflected according to a particular context: he kicksthe bucket. On the other hand non-decomposable idioms do not undergo syntactic variability. For example, a passive sentence as the bucket was kicked is not possible. (or at least it does not have the same meaning.)

Another type of semi-fixed expressions are compound nominals as car park or peanut butter. They are syntactically-unalterable but can inflect for number: 2 car parks.

Proper names are semi-fixed expressions as well since they can occur in different forms. For example the name of the U.S. sports team the San Francisco 49ers can occur as the 49ers or as a modifier in the compound noun a 49ers player etc.

2.3 Syntactically-Flexible Expressions

Syntactically-flexible expressions have a wider range of syntactic variability than semi-fixed expressions. They occur in the form of decomposable idioms, verb-particle constructions and light verbs.

Decomposable idioms are likely to be syntactically flexible to some degree. Examples are let the cat out of the bag and sweep under the rug. Yet, it is hard to predict which kind of syntactic variation a given idiom can undergo.

Verb-particle constructions, such as write up and look up are made up of a verb and one or more partcicles. Either they are semantically idiosyncratic as brush up on or compositional as break up in the meteorite broke up in the earth's atmosphere. In some transitive verb-particle constructions as call s.o. up an NP argument can occur either between or following the verb and particle(s): call Kim up or call up Kim, respectively. In addition adverbs can often be inserted between the verb and particle as in fight bravely on.

For light verb constructions, as make a mistake, give a demo it is difficult to predict which light verb combines with a given noun. Though they are highly idiosyncratic they have to be distinguished from idioms: "the noun is used in a normal sense, and the verb meaning appears to be bleached, rather than idiomatic."[5]

2.4 Institutionalized Phrases

Institutionalized phrases are conventionalized phrases, such as salt and pepper, traffic light and to kindle excitement. They are semantically and syntactically compositional, but statistically idiosyncratic. Regarding the phrase traffic light, traffic and light both retain simpex senses but produce a compositional reading by combining constructionally.

3 Problems for NLP

One problem that occurs in NLP given that MWEs are treated by general, compositional methods of linguistic analysis is the overgeneration problem. A system could deduce from given expressions other putatively possible expressions that are equivalent in meaning but do not exist due to a lack of institutionalization. "A generation system that is uniformed about both the patterns of compounding and the particular collocational frequency of the relevant dialect would correctly generate telephone booth (American) or telephone box (British/Australian), but might also generate such perfectly compositional, but unacceptable examples as telephone cabinet, telephone closet, etc."[6]

Another problem is the idiomaticity problem. It is difficult to predict the meaning of an expression like kick the bucket since the meaning is not related to the meanings of kick, the, and bucket. Even though the expression seems to conform the grammar of English verb phrases.