Main menu

Blog

Category Archives: Quranic Arabic Corpus

Today I presented a paper (co-authored with Nizar Habash) on statistical parsing at the IWPT 2011 conference. A three day event at Dublin City University, this was a great opportunity to meet some leading names in the fields of Computational Linguistics and Natural Language Parsing, and to discuss ideas for further research work and future collaboration.

Due to family commitments, I was only able to attend the first day of the conference to present my talk, and so I missed out on the full three day event. However, it was still a great experience.

Presentation

The paper I presented, One-Step Statistical Parsing of Hybrid Dependency-Constituency Syntactic Representations, was received well, judging by the feedback and response I got after the talk. I managed to get across the key points of the research in the presentation: The linguistic context for why the Quranic Treebank uses a hybrid syntactic representation, the rich morphological features annotated in the treebank, and the challenges this gives rise to for statistical parsing. I mentioned that although there could be many ways to solve the hybrid parsing problem, we focused on transition-based shift-reduce parsing, as opposed to a graph-based parsing algorithm. In other words, more like MaltParser as opposed to MSTParser.

At the end of the talk, I had time to answer a few questions.

In the first question, Joakim Nivre wanted some further clarification on exactly what the input to the parser was. Although the presentation described the input as gold-standard morphologically tagged text with segmentation, I did not make clear during the talk if empty categories were assumed in the input, or if these were generated by the parser. This was a fair point by Joakim. The paper does cover this in more detail – the parser handles elision directly and this is not assumed in the input. We take only the original source text, segmented and annotated with morphological features.

The section question by Feiyu Xu related to how the parser produced phrase structure. In particular, how it was possible to produce complete subgraphs under a phrase or clause. The assumption, I explained, was that at a certain point in its operation, the parser would learn to recognize the head of a sub-graph that should be raised to a phrase from the top of the stack. Of course, not all phrases could be formed this way, but given the strong accuracy of the parser for hybrid phrase structure reported in the paper, this would appear to be a reasonable assumption.

In the last question, Mark Steedman wanted to know more about the traditional Arabic grammar used as the linguistic framework for annotating the Quranic Arabic Treebank. In particular, the question was under what conditions the grammar would treat a chunk as a phrase and give it a phrase label as opposed to using only dependency structure. My answer was that as far as I could tell from having been through numerous examples from the grammatical gold-standard reference texts, was that phrases appear to be made explicit in the grammar when a chunk can stand alone, independent of the rest of the sentence, such as an embedded sentence or subordinate clause.

Opportunities for Future Work and Collaboration

I met a lot of interesting and smart people at the conference, too many to list all here by name. Overall, I received two pieces of common feedback when discussing my parsing research. The first, was that the hybrid representation was interesting and appealing as a research idea given that not much work has been done in this area, and that there is definitely merit in combining the best of both representations into a single treebank. Secondly, a lot of the feedback I received cantered around the next logical step in the research, which would be to integrate morphological analysis into the parser. This would allow the parser to run against raw text instead of using gold-standard morphological input. Different people had different ideas about how this could be done, but nearly everyone agreed it was an important next step.

I also learnt that although some recent initial work has been done on integrating POS-tagging and transition dependency parsing for Chinese, there does not appear to be any work on joint morphological analysis for transition dependency parsing in any language. Kenji Sagae confirmed my own hunch that for a full integrated transition approach some form of non-deterministic parsing would be necessary, in order to explore the joint disambiguation search space. He pointed me to his 2010 ACL paper on introducing dynamic programming into shift-reduce parsing. He suggested that I might want to get in touch with Takuya Matsuzaki (also at IWPT 2011) whose 2011 IJCNLP paper uses the same algorithm as Kenji’s to perform joint POS-tagging and syntactic dependency parsing for Chinese. Interestingly, Nizar had pointed out a related 2011 EMNLP paper to me back in July, also on joint tagging for Chinese, but with a focus on graph algorithms instead of transition parsing – another good paper.

I later met with Khalil Sima’an who it turns out can speak Arabic as well as Hebrew. Interestingly, Khalil was Reut Tsarfaty’s co-supervisor during her PhD thesis on joint morphological and syntactic analysis for Hebrew. Khalil also knows Eric Atwell, my PhD supervisor at the University of Leeds. He advised that research into joint morphological and syntactic analysis for Arabic was something definitely needed.

Finally, I ended the day with a follow-up discussion with Joakim Nivre after the main conference talks had ended. Joakim was open to the idea of collaborating on future research, especially if this involved doing further work on transition-based parsing. Some ideas could include revisiting his work on hybrid parsing for Swedish and German. He seemed impressed with my presentation and the paper, and especially liked the strong empirical results – achieving around 90% accuracy (near state-of-the-art) for dependency parsing. Confirming the other feedback I had received today, he thought that joint morphological and syntactic analysis would be the way to go for further research into parsing Classical Arabic. He also liked the way in which the basic MaltParser algorithm had been extended using additional parser actions to handle hybrid parsing – apparently something he had wanted to do himself for some time.

We also talked briefly about different possible ways to add non-determinism to the parser, as a step towards joint morphological disambiguation. Dynamic programming could be one way, but Joakim suggested that even experimenting with vanilla beam search would be a good first step.

Conclusion

All-in-all a great day. I met some very intelligent people, experts in their respective fields, and also got to listen to some very interesting and relevant talks. A shame I could only stay for the first day instead of the full three day conference, but I do have a lot else going on right now with work and family. I would definitely like to pursue some of the ideas discussed today for further collaboration. Hybrid parsing was of interest to the conference delegates given that it is a bit different and not often studied. I also heard from nearly everyone I spoke to that joint morphological disambiguation and syntactic parsing would be a very interesting next step. From what I could tell, the start-of-the-art in this particular research area for transition-based parsing was to include only POS tagging as a joint task. Joakim suggested that including morphological analysis directly into a transition-based parser would be new research, but something that other researchers might soon also be looking at as well.

I’ve been working hard the last few days on PhD research. I’ve submitted for Eric‘s review a suggested plan for our joint submission to Arabica. The current working title is Detailed Grammatical Analysis of the Quran using Artificial Intelligence. At the same time, I’ve also been running some machine learning experiments for parsing the latest version of the Quranic Arabic Dependency Treebank, for a separate paper. The good news is that I’ve finally managed to figure out how to use SVMs to parse the treebank, with an F-measure accuracy score of around 90%! This should lead to a stronger submission for SPMRL 2011.

Previously, I was working with a C4.5 decision tree classifier, which although competitive, had a slightly lower accuracy score. To get SVMs working, I closely followed Hall and Nivre’s approach to parsing German:

All symbolic features were converted to numerical features and we use the quadratic kernel of the LIBSVM package (Chang and Lin, 2001) for mapping histories to parser actions and arc labels. All results are based on the following settings of LIBSVM: gamma = 0.2 and r = 0 for the kernel parameters, C = 0.5 for the penalty parameter, and epsilon = 1.0 for the termination criterion. We also split the training instances into smaller sets according to the ﬁne-grained part-of-speech of the next input token to train separate one-versus-one multi-class LIBSVM-classiﬁers.

Not using Weka and instead going direct to LIBSVM has helped a bit with training time. But the the three main things I needed to do to get this working was (1) using the right kernel parameters as above, (2) binarization of features, and crucially (3) train multiple classifiers, one for each part-of-speech of the next input token – essential for reducing training time considerably.

Running a complete end-to-end test takes around only two minutes. This includes training on 90% of the data, and then testing against an unseen 10% of the data to work out the F-measure score. I’m very happy having finally got this working. SVMs are way cool!

I’ve got a couple of packed months ahead with regards to research. This is a good thing, as having this research going through peer-review and being accepted for publication can only increase the chances of being awarded a doctorate for the final thesis. As well as working on a submission for the Arabica Journal (co-authored with PhD supervisor Eric Atwell), I’m also working on a separate submission with another collaborator for SPMRL 2011.

For SPMRL it is possible to also dual submit to the main IWPT conference. The deadline for this is the 18th of July, with the conference and workshop being held at Dublin City University.

Workshop Important Dates (SPMRL)

Submission deadline: July 31st, 2011

Notification to authors: September 5th, 2011

Camera ready copy: September 20th, 2011

Workshop: October 6th, 2011

From the workshop’s FAQ: Can I submit the same paper to IWPT and SPMRL? Yes, double submissions are permitted but obviously the same paper will not be published at both venues. Another possibility for those working on parsing MRLs is submit two different papers: in this scenario, we encourage authors to view the SPMRL workshop as a venue for more detailed analysis papers.

I recently submitted an abstract to Brill’s Arabica Journal. Eric has kindly agreed to be a co-author on this paper, and his advice and guidance on this will be very valuable. Sébastien Garnier from the journal replied to us to let us know that such a paper would fall within the scope of the Journal’s publication, but that we must submit the full paper for formal peer-review. Sounds promising. Now we need to write the full paper! Here is the abstract we sent to Arabica:

Abstract

The Quranic Arabic Corpus is a recently annotated linguistic resource used to study the Quran through the historical grammar for Classical Arabic known as i’rāb (Dukes, Atwell and Sharaf, 2010). The website (http://corpus.quran.com) is used by Arabic linguists and Quranic students, and provides detailed morphological and syntactic analysis for each word in the Quran. This grammatical information was originally generated by an Artificial Intelligence (AI) computer program, by applying techniques from corpus linguistics to ‘learn’ how to recognize reoccurring patterns in Arabic text. To ensure a high level of accuracy, the website has been proofread by volunteers, who compare the automatic analysis against traditional sources of Quranic grammar through online collaboration (Dukes, Atwell and Habash, 2011). This paper is organized into two parts. In the first part, we describe this new annotated resource and its online interface, and illustrate how its features are used to study the Arabic language of the Quran. In the second part of the paper, we consider a further linguistic application of the resource and its use as a comprehensive grammar of Quranic Arabic. We discuss grammatical analysis for several representative examples from the Quran.

In part one, we describe the Quranic Arabic Corpus and its associated website. This annotated linguistic resource shows the Arabic grammar, syntax and morphology for each word in the Quran. The corpus provides three levels of analysis: morphological annotation, a syntactic treebank and a semantic ontology. Most other Quranic websites include the Arabic text of the Quran, English translations and possibly audio recitation. The Quran Arabic Corpus goes beyond this, by applying annotation techniques from modern corpus linguistics. The website provides detailed part-of-speech and morphological tagging, syntactic dependencies, a word-by-word interlinear translation into English, a hyperlinked concordance and a morphological dictionary organized by Arabic root and lemma. In addition, a comment based system allows online visitors to discuss the resource in detail, and to suggest corrections online. This feedback mechanism allows corrections to be reviewed and integrated back into the dataset over time, resulting in a highly accurate annotated resource. This has grown from a small research project into a significant worldwide study site for Arabic, now used by 100,000 visitors each month including academic researchers, Quranic scholars and students of Arabic, who have found the Quranic Arabic Corpus useful (Dukes et. al., 2010).

In part two, we provide further linguistic details, and discuss how dependency graphs are used to annotate grammatical relationships between words. This provides a novel way to visually understand the syntactic structure of Quranic Arabic. These graphs are collected into a treebank that models linguistic dependencies such as verb and subject (fi’il wa fa’il) and subject and predicate (mubtada wa khabar). Similar dependency treebanks have developed for English (Cinková et. al., 2009), Chinese (Liu et. Al., 2006) and more recently for Modern Standard Arabic (Habash and Roth, 2009). However, using the notation of dependency syntax presents a special set of challenges when applied to i’rāb. This paper addresses the question of how well modern dependency graphs can be used to represent traditional analyses, and we consider the differences between these two approaches. Central to this, is the relationship between the concepts of amal and amil from i’rāb (action and actor), and the modern notion of heads and dependents. We show that by extending standard dependency grammar to include hidden nodes, it is possible to support the key technique from i’rāb known as hadhf wa taqdeer (elision and reconstruction). In addition, to fully account for the classical treatment of conjunction and preposition phrases, it is necessary to go beyond the syntactic representation used in most other treebanks, by introducing phrase nodes into dependency graphs.

We present a new linguistic resource for the Quran, and the first treebank for Quranic Arabic. We also present a novel contribution to Arabic grammatical theory, by investigating the assumed linguistic dependency framework that underlies i’rāb, and by modeling this using formal structures that correspond to other recent treebanking efforts for Modern Standard Arabic (MSA).

In version 0.5 of the Quranic Arabic Corpus, it is planned to extend the Quranic Arabic Dependency Treebank to include chapters 9-10 and 50-58. The following annotation plan divides this work into 36 blocks of roughly 200 words each:

Version 0.4 of the Quranic Arabic Corpus was released on the 1st May, 2011. Here are the release notes from the website:

The Quranic Arabic Corpus (http://corpus.quran.com) is an international collaborative linguistic project initiated at the University of Leeds, that aims to bridge the gap between the traditional Arabic grammar of i’rab and techniques from modern computational linguistics. This open source resource includes part-of-speech tagging for the Quran, morphological segmentation and a formal representation of Quranic syntax using dependency graphs. Version 0.4 of the corpus provides several improvements over the previous release:

*** [Increased coverage for the syntactic treebank]. Version 0.4 of the treebank covers 40% of the Quran by word count (30,895 out of 77,429 words). The treebank provides syntactic annotation using dependency grammar for chapters 1-8 and 59-114 of the Quran.

*** [Improved Quran dictionary and lemmatization]. The list of roots and lemmas that group related derived words has been made more consistent with traditional Arabic lexicons. The online Quran dictionary now also includes concordance lines from Quranic verses as context.

*** [Readability and navigation improvements]. The content of the website has been better organized, with improvements to navigation and layout. Several typing mistakes and omissions have been corrected in the word by word interlinear translation into English.

*** [More accurate tagging of proper nouns]. Eight new named entities have been added to the semantic ontology that were previously tagged only as nouns: Al-Ahqaf, Al-Jahiliyah, Al-Jumu’ah, Baal, Magians, Salsabil, Sirius, and Zaqqum.

*** [More accurate tagging for particles waw and fa]. In accordance with traditional Arabic grammar, for certain words, the particle fa is now tagged as a supplemental particle (harf za’id), such as in the combination a-fa-man.

*** [Version 0.4 of the morphologically annotated corpus] is freely available for download from the Quranic Arabic Corpus website.

The Quranic Arabic Corpus is an open source project. Contributions or questions about the research are more than welcome. Please direct any correspondence to Kais Dukes, PhD researcher at the School of Computing, University of Leeds:

Completing the Quranic Arabic Corpus requires smart use of my limited research time, so discipline and planning are essential. My aim is to have the syntactic treebank cover 100% of the Quran. At present, version 0.3 covers chapters 1-5 and 59-114 of the Quran (30.08%). My plan is to have the treebank include the following additional chapters in each release: