Syndicate

CAMS-KG: a Classical Arabic Morpho-Semantic Knowledge Graph

Tracking #: 1914-3127

Authors:

Ibrahim Bounhas

Nadia Soudani

Yahya Slimani

Responsible editor:

Guest Editors Knowledge Graphs 2018

Submission type:

Full Paper

Abstract:

Abstract. In this paper we propose to build a morpho-semantic knowledge graph from Arabic vocalized corpora. Our work focuses on classical Arabic as it has not deeply investigated in related works. We use a tool suite which allows analyzing and disambiguating Arabic texts, taking into account short diacritics to reduce ambiguities. At the morphological level, we combine Ghwanmeh stemmer and MADAMIRA which are adapted to extract a multi-level lexicon from Arabic vocalized corpora. At the semantic level, we infer semantic dependencies between tokens by exploiting contextual knowledge extracted by a concordancer. Both morphological and semantic links are represented through compressed graphs, which are accessed through lazy methods. These graphs are mined using BM25 measure to compute on-to-many similarity. Indeed, we propose to evaluate CAMS-KG in the context of Arabic Information Retrieval (IR). Several scenarios of document indexing and query expansion are assessed. That is, we vary indexing units for Arabic IR based on different levels of morphological knowledge, a challenging issue which is not yet resolved in related works. We also experiment several combinations of morpho-semantic query expansion. This permits to validate our resource and to study its impact on IR based on state-of-the art evaluation metrics.

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

-------- overall review (scale: Excellent, Very good, Good, Fair, Weak)
originality: Very good
significance of the results: Fair
quality of writing: Good

This paper tackles an interesting problem (semantic interpretation of Arabic corpus), with focus on improving Information Retrieval by Query Expansion. This problem is challenging, given the complex morphology of Arabic language. CAMS-KG tries to combine morphological and semantic features for a superior QE method. The paper could be written more fluently, and the results could be analyzed better. There are some ambiguities in terms of design choices and evaluation results which I have explained in detail later.

-------- detailed review
summary of the paper:

This paper introduces a morpho-semantic knowledge graph from vocalized Classical Arabic corpus. The authors go over the morphology of Arabic language in detail, and go over the principal processes that form the Arabic lexicon: Derivation, Inflection, and word construction. This analysis reveals some of the challenges of semantic interpretation of Arabic language. The authors then discuss state of the art resources for morphological and semantic analysis of Arabic language. CAMS-KG combines existing tools for morphological analysis (Ghwanmeh) and disambiguation (MADAMIRA), and implement a concordance builder tool, and KG representation for morpho-semantic features.
given an arabic corpus, it extracts the morpho-semantic features and stores them in a KG (nodes: (root, verbed pattern, lemma, stem, vocalized word, non-vocalized word, links: morphological, e.g. the flectional relation between a lemma and a stem) which then is used for NLP or IR tasks (morpho-semantic query expansion). BM25 ranking is used for retrieving related documents for given query.
The authors evaluate CAMS-KG on two large datasets (Tashkeela, and ZAD), and compare their morpho-semantic QE with several baselines (use of different morphemes, nodes and links in the KG, and with different indexing methods), and other morphological/semantic QE methods, on 25 queries from ZAD dataset.

Strengths
1 - The paper tackles a challenging problem of combining morphological and semantic features for Arabic lexicon, in order to improve information retrieval from Arabic corpus.
2 - The authors review previous methods in details, and explain the challenges with examples which is insightful for the reader.
3 - The experiments contain many different scenarios and the IR metrics are presented for different baseline methods, as well as state of the art systems.

Major comments
1 - the paper is written in a verbose manner. The provided examples help with understanding the problem, however, I think reformatting some parts of the paper will make it easier to follow.

- there are many methods and tools mentioned in the text (AMIRA, MADAMIRA, Sebawi, etc.); categorizing these methods by their capabilities will help the reader understand the limitations of current methods, differentiate CAMS-KG and understand what gaps it is trying to fill.
- differentiating the methods by their design, algorithm choices, capabilities, etc. may be one way (one table saves many lines of writing)
- discussion of the experimental results should be more analysis and conclusion, rather than reporting.

2 - results are not very conclusive, in many cases ZAD and CAMS perform similarly, some of the results explained as significant by the authors are marginally different (may be experimental error: 0.535 vs. 0.534). Although in Table 9 the authors show that the difference between different methods is significant, it is not reflected in other results. Maybe running the method multiple times and reporting average scores remove the effect of experimental errors.
3 - The number of retrieved documents may be reduced by adding relevance scores to the expanded tokens, and treating them different than the original query tokens.
4 - given that for different queries, there is not always a best scenario (tables 10 and 11), it will be worth exploring an ensemble method for combining different methods to improve the accuracy.
5 - based on the results in table 12, Sebawi performs better than Ghwanmeh. Why wouldn’t the authors use Sebawi in their framework instead of Ghwanmeh? or maybe IBM-LM which is stated that performs better than Sebawi?
6 - in Fig. 5, the bars for Q23 are missing.
7 - in table 12, the results for [77], [39], and [13] are reported for stem/root indexing, while “our result” is for lemma indexing. Given that in previous sections it is discussed that indexing method affects the result of IR, it would be better to compare all systems using lemma indexing.
8 - in Table 6, it is not clear what is the difference between “BM25 (lemma, MADAMIRA)” and “BM25 (lemma, MorphToolKit)”; given that MorphToolKit uses MADAMIRA for disambiguation.
9 - the authors introduce Arabic Word Net in the beginning of the paper, however, it is not considered in the evaluation results. It would be interesting to compare AWN with the KG created by CAMS-KG.

Minor comments
1 - in table 4, #relevant documents per query is stated as “([0,72])” which seems to be a typo.
2 - sometimes the highlighted numbers in the evaluation tables are not best results.

Review #2

By Milan Dojchinovski2 submitted on 03/Oct/2018

Suggestion: Reject

Review Comment:

The paper describes a linguistic knowledge graph generated based on Arabic corpora with classical writings.
The paper provides background information on the Arabic language and its specific morphological features (Section 2). Next (in Section 3), it describes the core information which is available as part of the generated knowledge graph. In Section 4 the authors describe the KG construction process and Section 5 presents an evaluation and provides an answer to the question to what extent the KG improves the performance of information retrieval and query expansion.
The key contribution (novelty) of the paper is the generated Arabic knowledge graph.
On the other hand, the method applied for the generation of the Arabic linguistic knowledge graph is not novel - it integrates and makes use of already existing tools and methods.

In general, the authors presented a nice piece of work, however, there are several crucial problems with the paper and presented work:

- the work is not well positioned - the related work is scattered across the paper and thus it is difficult to clearly identify the problems with the existing solutions and the contributions of the author's work

- the paper is difficult to read and follow - the motivation, the data source, the generation process, the data model and the evaluation are not clearly presented. More importantly, due to the lack of simple examples, it is difficult to understand and interpret particular parts from the paper.

- the evaluation is not satisfactory - the authors primarily focused on experiments in order to evaluate the benefit from the generated knowledge graph. However, they do not evaluate the quality of the created knowledge.

Follows more detailed comments:

* abstract - the abstract does not clearly state the contributions of the paper, which is the knowledge graph. In what aspect it is better than the related work?

* introduction - I would expect the authors to motivate their work by presenting the existing/related work on knowledge graphs for Arabic. The authors write "Nevertheless, our literature review (cf. section 4.1 and 4.2.1) shows that existent works mainly focused on modern texts which are actually produced and shared on the Web." - what is wrong with these texts? why knowledge graph based on these texts is worse than your approach using classical texts?

* Section 3. CAMS-KG specification - I would expect the authors to discuss also linguistic models for capturing morphological information. There are several relevant works which are not discussed in the paper. This includes 1) lemon-ontolex - a model for representing lexical information and 2) MMoOn - a data model for morphological language data [2]. Also, there is no discussion of related knowledge graphs, such as BabelNet [3] and the Semantic Quran [4].
Also, in section 3 the data model is very poorly described. It is not clear how the information is modeled - concepts and relationships.

- I also find it weird, that the authors very late present the data source (Tashkeela) which is used for generation of the knowledge graph. I would expect this to be briefly presented in the introduction section, and in more details in Section 3.

* Section 4 describes the KG generation process, but only actual extraction components are very briefly described (4.1.3 and 4.2.2). From the text in Section 4, it is difficult to understand what is the input and what is the output of the process. A concrete example would help a lot here.

* Section 5 describes several experiments executed in order to identify the benefit of the generated KG. However, the evaluation does not evaluate the quality of the generated KG. This is very important.

- Also, I found nowhere in the paper statistics about the knowledge graph. Also, there is no information on the availability of the dataset (download link, query interface, license).

In general, wrt (1) originality - the key contribution (lexical knowledge graph for Arabic) is original, (2) significance of results - the community can benefit from the results (the KG) as it was presented in the experiments, and (3) quality of writing - the paper has to been significantly improved in order to reach a level for publishing.

Overall, the paper and some parts of the actual work has to be significantly improved, thus I do not recommend acceptance of the paper. However, I would encourage the authors to consider the comments above, and consider resubmission of the paper.

The paper presents the problem of KG extraction from classical Arabic. It defines the problem well, lists the challenges of Arabic processing, and lists the limits of the existent resources.
The contributions of the paper start appearing late in Section 3.2. Then in Section 4. Before that, the reader has no idea of what
the approach of the paper is.
I suggest that these contributions need to be listed at the end of the introduction section for clarity and to have the reader focus and know what to expect. The first paragraph of Section 3.2 is definitely better placed in the intro.
Figure 1 (or a cleaned version of it with less legends, less colors, and a simplified label instead of contextual knowledge extraction) might be better to appear and be discussed shortly in the introduction with a list of contributions.

Overall the paper needs a restructure. The related work is spread all over, redundant in many places and sometimes not consistent. For example some morphological analyzers are introduced and screen in Section 4 while Arabic morphology is discussed in Section 2.

I suggest that the authors use a Background section where they house a shortened version of the Arabic morphology along other material
they build upon to have their work such as the essential nodes, edge and code descriptions for a WebGraph construction.
The paper also lacks a related work section where the authors discuss other graph construction and analysis techniques and compare to them.

The English in the first three sections of the paper needs work. I provided detailed comments below including clarification questions. The English in the rest of the paper is cleaner. The authors are encouraged to carry out a thorough English revision of the whole document especially with the restructuring.

The contextual knowledge extraction box (either capitalize k or reduce C and E) is not explained and seems not in place as it feeds into morphological analysis which ignores context, and then into MADAMIRA which is an entry point to context. So a clarification is due abot that in Section 4. I beleive it is better to include that box in the disambiguation phase if my understanding of what the authors are tryng to do is correct.

The description of nodes and edges is not clear to the regular reader. It is instrumental that the authors add a motivating example towards the end of the introduction section of how a KG is constructed from text. Then later on in Section 4 explain in details the different nodes, edges, their construction and the information/tokens stored in them.

In relation to "Other important funds such as classical Arabic poesy and literature are not yet well studied, while they are available on the Web." You may want to consider the work of Zaraket and his group to genearte KG from hadith documents using morphology and semantic relations expressed in finite state machines.

3.1.2. Morpho-Semantic resources:
IN here you may want to refer to the work of Mustafa jarrar on digital Arabic ontologies as well.

"Unfortunately, most Arabic NLP tools (cf. section 4.1.1), do not take them into account. Besides, only few works tried to exploit them [33, 34]."
The two sentences say the same thing! Maybe you want to look at "Zaraket, Makhlouta:
Arabic Morphological Analyzer with Agglutinative Affix Morphemes and Fusional Concatenation Rules. COLING 2012".

The paper metions classical Arabic and modern standard Arabic. They are definitely different, yet the paper does not clear that difference. It should especially that it claims that it addresses classical Arabic as a contribution (Section 3.2).

The following sentence is the essence of its section.
"It contains six types of nodes corresponding to the different layers of the Arabic lexicon (root, verbed pattern, lemma, stem, vocalized word, non-vocalized word)."
The six node types should be probably itemized, but surely not mentioned inside parentheses!

The BM25 formalization is interesting. The authors need to specify what does it mean though. Typically BM25(document, query) returns a score of
how relevant the document is to the query. In here, it is not clear what BM25(C,n) returns (where C is a collected of wheighted nodes and n is a candidate node (maybe from
within the nodes in C).

The authors mention information retreival at the beginning, yet suddenly in the results section they mention that they also
perform query expansion. This is a surprise. It should not be. The authors should discuss that at the beginnig.

Section 5 describes the corpora and lists a number of experiments the authors performed. However, it does not say what research questions
the experiments were designed to answer. The authors should specify the research questions and then say how each experiment is
designed to answer some of the questions.
It will be also interesting to include runtime numbers to depict how complex the computations were.

As the author reads towards the end of the experimental results section, new approaches that were not described before appear!
They are good surprises! But they should not be. The authors should tell us breifly about their experiments and motivate the
different approaches they list in the introduction and methodology sections of the paper.
They should also discuss the results and provide intuitive explanations for the highighted numbers.

Abstract:
=========
it has not deeply investigated ---> has not been deeply
using BM25 measure ---> using measure BM25 ( I would even spell what BM25 is: Best Match 25th relevance iteration)
to compute on-to-many ---> to compute one-to-many
in related works ---> in related work (applies across the paper)

Introduction:
==================
transformed in ---> transformed into
last sentence first paragraph is hardto understand. needs rewriting: split in two for simplicity.
transforming them in ---> transforming them into
IRS in second sentence second par ---> spell it out here and not later.
Due to the shortness of queries and language ambiguities ---> Due to language ambiguities and shortness of queries, (flip to disambiguate)

Other important funds ---> Other important resources

taking advantage of the richness of the Arabic lexicon and a contextual ---> taking advantage of both (1) the richness of the Arabic
lexicon and (2) a contextual
to build and asses CAMS-KG ---> to build and assess CAMS-KG

in referring to Sections, sometimes you use section and sometimes Section with capital S. Be consistent with preference to Section. This applies throughout the paper.

Arabic morphology:
==================
very lengthy section with plenty of repitition of what is available out there in the literature: not a big contribution and no real added value. cite one or more references to your analyzers and shorten the section.

the use of 'base', 'stem' and 'root' interchangeably is confusing, either define the three terms (before you use them) or use one of them only.

use a standard transliteration for the Arabic and cite it for reference (i am assuming you are using buckwalter?)
spell the prounciation for the patterns as well

(iii) ---> the term 'word construction' is not necessarily accurate with all clitics. for example, a conjunction clitic "and teachers" is not necessarily a single word.

Sarf's (sourceforge.net/projects/sarf) last modified date is 2007. There are plenty of newer systems out there that you may want to refer to. SAMA is one. You use Madamira which is good.

3.1.3. Synthesis: maybe you want to explain why the section is name synthesis?

3.2.1. Nodes and edges
----------------------
the types of nodes which should be included ---> nodes included (should be gives an impression that the work is not done yet)

surface words ---> what are these? define them shortly
revolves that returning ---> revolves is not possibly the best term here? suggests?
morphological analyzer [41] evaluates any ...relies on ... ---> evaluates/relies does not make sense here!
LDC's Arabic Treebank ---> citation missing
Authors showed that the lemma-based semantic retrieval outperforms all the other indexing units. ---> Another support to this claim is the work of Zaraket to extract narrator graphs from Hadith documents (FLAIRS 2012, Cicling 2012)

token as detailed bellow. ---> ach token as detailed below.

An example is unfolded in table 2.. ---> lose the last '.'

Table 1. POS row: use part of speech, lose the '.'. Not sure 'grammatical category' is an accurate description let alone a common attribute! You can also list a few of the main attributes.

CAMS-KG construction and mining
================================
In section 3.2, you seemed to extract lexical and semantic relations, but at the begining here you refer to 'text mining' techniques! this is a little but confusing.

"identify the correct morphological solution based on contextual information" ---> I think correct here is a big claim. they disambiguate and eliminate the solutions that are less probably with the extracted context features.

The authors in here mix a lot betweek NLP tools, morphological analysis tools for Arabic, and stemmers. Even if the justification of their choice of tools is somehow clear, but the comparison is not sound as these are apples and oranges and the comparison does not stand in its current form. It would be better to discuss the utility of each of these tools for their specific task. SO I would make the comparative study Section more of a Tool Utility and Seclection Section. The statement about Ghawanmeh is better repeats several times. I think that should be reduced. In one place, the authors say cited works and only cite one reference (77) and then use the term glean. The enthusiasm here should be reduced a bit.

in table 1. ---> in Table 1.

from glued punctuation signs --- what do you mean by glued here? what signs exactly?
the position in the sentence, the frequency, ---- in the sentence, document or corpora?
and the type --- dont you need morphological analysis for that?

of all the tokens types ---> of all the token types
by any of morpho-syntactic attribute --> by any morpho-syntactic attribute