Based on research conducted by RDI s NLP group 2003-2009 RDI-eg - PowerPoint PPT Presentation

www.RDI-eg.com. Automatic Full Phonetic Transcription of Arabic Script with and without Language Factorization. Based on research conducted by RDI’s NLP group (2003-2009) http://www.RDI-eg.com/RDI/Technologies/Arabic_NLP.htm Mohsen Rashwan , Mohamed Al-Badrashiny , and Mohamed Attia

Copyright Complaint Adult Content Flag as Inappropriate

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

PowerPoint Slideshow about 'Based on research conducted by RDI s NLP group 2003-2009 RDI-eg' - lynton

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

This residual ambiguity arises due to our incomplete knowledge of the underlying dynamics of the linguistic phenomenon, and maybe also due to the lack of higher language processing layersconstraining such a phenomenon; e.g. absence of semantic analysis layer constraining morphological and syntax analysis.

 Statistical methods are well known to be one of the most (if not the ever most) effective, feasible, and widely adopted approaches to automatically resolve that ambiguity.

Sometimes, such ambiguous NLP tasks are not sought for the sake of their outputs themselves, but as an intermediate step to infer another final output.

An example is the problem of automatically obtaining the phonetic transcription of a given Arabic crude text w1 … wn , which can be directly inferred as a one-to-one mapping of diacritics on the characters of the input words. But these diacritics are typically absent in MSA script!

Some researchers, however, argue that if statistical disambiguation is eventually deployed to get the most likely sequence of outputs, why do not we go fully statistical; i.e. un-factorizing from the very beginning and give up the burden of rule-based methods?

 For our example; this means the statistical disambiguation (as well as the statistical language models) are built from manually diacritized text corpora where spelling characters and their full diacritics are both supplied for each word.

The obvious answer in many such cases (including the one of our example) is to overcome the problem of poor coverage when the input language entities are produced via a highly generative linguistic process; e.g. Arabic morphology.

However, that sound question may be modified so that it enquires about the performance (accuracy and speed) of statistically disambiguating un-factorized language entities (at least those frequent ones that may be covered without factorization) as compared to statistically disambiguating factorized language entities.

The rest of this presentation discusses 4 issues in this regard:

1- The statistical disambiguation methodology deployed in both cases.

2- The related Arabic NLP factorization models and the architecture of the factorizing system.

In other pattern recognition problems; e.g. OCR and ASR, the term P(O|I) referred to as the likelihood probability, is modeled via probability distributions; e.g. HMM.

Our language factorization models enable us to do better by viewing the availability of possible structures for a given i/p string - in terms of probabilities - as a binary decision of whether the observed string complies with the formal rules of the factorization models or not. This simplifies the MAP formula into:

where R(O) is the part of space of the factorization model corresponding to the observed input string; i.e.

 In case of the factorizing system; I is now restricted to only possible factorized sequences that can generate (via synthesis) that input sequence, and the ^ denotes the most likely one.

 In case of the un-factorizing system; I is a possible sequence of diacritics matching that i/p sequence, and the ^ denotes the most likely one.

The term P(I) is conventionally called the (Statistical) Language Model (SLM).

Let us replace the conventional symbol I by the more adequate for our problem, by Q which is more convenient for our specific problem/set of problems.

With the aid of the 1st graph in this presentation; the problem is now reduced to searching for the most likely sequence of qi,f(i); 1≤i≤L, i.e. the one with the highest marginal probability through the following lattice:

This creates a Cartesian search space:

A* search algorithm is guaranteed to exit with the most likely path via two tree-search strategies .

1-Heuristic probability estimation of the rest of the path to be expanded next. This is called the h* function.

combined with

2-Best-first tree expansion of the path with highest sum of start-to-expansion probability; the g function, plus the h* function.

It is then required to estimate the marginal probability of any whole/partial possible path in the lattice. Via the chain rule and the attenuating correlation assumption, this probability is approximated by the formula:

 These conditional probabilities are primarily calculated via the famous Bayesian formula. Due to the Zipfian sparseness, the Good-Turing discount and Katz’s back-off techniques are also deployed to obtain smooth distributions as well as reliable estimations of rare and unseen events respectively.

 While the DB of elementary n-gram probabilities P(q1…qn); (1≤n≤h) are built during the training phase, the task of the statistical disambiguation in the runtime is rendered to:

Despite Arabic is an intensively diacritized language, Modern Standard Arabic (MSA) is typically written by the contemporary natives without diacritics!

So, it is the task of the NLP system to accurately infer all the missing diacritics of all the input words in the input Arabic text, and also to amend those diacritics in order to account for the mutual phonetic effects among adjacent words upon their continuous pronunciation.

 Modern standard Arabic (MSA) is typically written without diacritics.

 MSA script is typically full of many common spelling mistakes.

 The extreme derivative and inflective nature of Arabic, which necessitates treating it as a morpheme-based rather than a vocabulary-based language. The size of generable Arabic vocabulary is within the order of billions!

 One (or more) diacritic in about 65% of the words in Arabic text is dependent on the syntactic case-ending of each word.

 Lexical and Syntax grammars alone produce a high avg. no. of possible solutions at each word of the text. (High Ambiguity)

 7.5% of open-domain Arabic text are transliterated words which lack any Arabic constraining model. Moreover, many of these words are confusingly analyzable as normal Arabic words!

 For transliterated (foreign) words, intra-word Arabic Phonetic Grammar is deployed to constrain the statistical search for the most likely diacritization that matches the spelling of each input transliterated word.

A comprehensive Arabic lexicon has been built to be the repository of the linguistic (orthographic, phonological, morphological, Syntactic) description of each Arabic morpheme along with all their possible mutual interactivities (with other morphemes) are registered as extensively as possible in a compact structured format.

 The un-factorizing diacritizer simply tests the spelling of each input word against a dictionary of final-form words; i.e. vocabulary list.

 The possible diacritizations of each word in a sequence of input words (called henceforth “Segment”) that are all covered by that dictionary are directly retrieved without any language factorization. The resulting diacritizations lattice of each segment is then statistically disambiguated.

 Uncovered segments (along with the disambiguated diacritizations of the covered segments) are then sent to the factorizing transcriptor for inferring the most likely diacritization of uncovered segments as well as for phonetically concatenating the words in all segments.

Both of the two groups evaluated their performance by training and testing their two systems using LDC’s Arabic Treebank of diacritized news stories (LDC2004T11; text–part 3, v1.0) that is published in 2004.

This Arabic text corpus which includes a total of 600 documents ≈ 340K words from AnNahar (Lebanese) newspaper text is split into a training data ≈ 288K words and test data ≈ 52K words.

In order to obtain a fair comparison with the work of Habash & Rambow’s group, and with Zitouni et al.’s group:

 We used the same aforementioned training and test corpus from LDC’s Treebank.

 We adopted their same metrics at counting the errors while evaluating our hybrid system vs. theirs.

As each of the other two groups deploys more sophisticated statistical tools than ours, one can attribute the superior performance of ours to hybridizing the un-factorizing transcriptor with the factorizing one in our system architecture.

It is very insightful not only to know how better is the hybrid transcriptor compared to the purely factorizing one, but also to know how the error margin evolves in both cases with increasing the size of the training annotated text corpora.

To this end; a domain-balanced annotated training Arabic text corpora of a total size of 3,250K words have been developed (over years) so that a manually supervised full Arabic morphological analysis and diacritization had been applied to every word.

Another domain-balanced (tough) test set of 11K words had also been prepared in both the annotated and un-annotated formats.

At approx. log-scale steps of the size of the training corpora, the statistical models (with the same equivalent h) had been built and the following metrics have been measured for each of the two architectures:

Justification: Despite being put in two different formats, the SLM’s of both systems are built form the same data and have hence the same information content.

 The hybrid system has a faster learning curve than the purely factorizing one.

Justification: The un-factorizing component suggests fewer candidate diacritizations (by looking the dictionary up) than the factorizing component (which generates all the possibilities) which in turn leads to less ambiguity. Due to the NLP’s Zipfian distribution, a small dictionary (built up from small training data) can quickly capture the frequent words.

N. Habash, O. Rambow, Arabic Diacritization through Full Morphological Tagging, Proceedings of the 8th Meeting of the North American Chapter of the Association for Computational Linguistics (ACL); Human Language Technologies Conference (HLT-NAACL), 2007.

I- Transcription System A given statistical disambiguation technique operating on either factorized or un-factorized sequences of linguistic entities asymptotes to the same disambiguation accuracy at infinitely huge size of annotated training corpora.