Finite-state automata and Morphology

Similar presentations

2 Outline Morphology What is it? Why do we need it? How do we model it?Computational Model: Finite-state transducer

3 Structure of Words What are words?Orthographic tokens separated by white space.In some languages the distinction between words and sentences is less clear.Chinese, Japanese: no white space between wordsnowhitespace  no white space/no whites pace/now hit esp aceTurkish: words could represent a complete “sentence”Eg: uygarlastiramadiklarimizdanmissinizcasina“(behaving) as if you are among those whom we could not civilize”Morphology: the structure of wordsBasic elements: morphemesMorphological Rules: how to combine morphemes.Syntax: the structure of sentencesRules for ordering words in a sentence

4 Morphology and Syntax Interplay between syntax and morphologyHow much information does a language allow to be packed in a word, and how easy is it to unpack.More information  less rigid syntax  more free word orderEg: Hindi: John likes Mary – all six orders are possible, due to rich morphological information.

5 Why Study Morphology? Morphology providessystematic rules for forming new words in a language.can be used to verify if a word is legitimate in a language.efficient storage methods.improving lexical coverage of a system.group words into classes.ApplicationsImproving recall in search applicationsTry “fish” as a query in a search engineText-to-speech synthesiscategory of a word determines its pronunciationParsingMorphological information eliminates spurious parses for a sentence

16 Finite-State TransducersFinite State Acceptors represent regular sets.Finite State Transducers represent regular relations.RelationIf A and B are two regular sets; relation R ⊆ A x BExample: {(x,y) | x ∊ a*, y ∊ b*}FSTs can be considered asTranslators (Hello:Ciao)Parser/generators (Hello:How may I help you?)As well as Kimmo-style morphological parsingExamples of fsts on board

17 Finite State Transducers – formally speakingFST is a 5-tuple consisting ofQ: set of states {q0,q1,q2,q3,q4}: an alphabet of complex symbols, each an i:o pair s.t. i  I (an input alphabet) and o  O (an output alphabet) and  ⊆ I x Oq0: a start stateF: a set of final states in Q {q4}(q,i:o): a transition function mapping Q x  to Qq0q4q1q2q3b:ma:o!:?

23 Role of Morphology in Machine TranslationEvery MT system contains a bilingual lexiconBilingual lexicon: a table mapping the source language token to target language token(s).Two options:1. Full-form lexiconevery word form of the source token is paired with the target tokenlarge table if the vocabulary is large for morphologically rich languages2. Root-form lexiconpairing of stems from the two languagesReduces the size of the lexiconrequires morphological analysis for source languagebats  (bat, V, 3sg) (bat, N, pl)morphological generation for target language(bat, V, 3sg)  batsUnknown words: words not covered in the bilingual lexicon- with morphology, one can guess the syntactic functionCompounding of words:English: simple juxtaposition (“car seat”), some times (“seaweed”)German: fusion is more common(“Dampfschiffahrtsgesellschaft”  steamship company