Tracking #: 1871-3084

In information retrieval (IR), user queries are generally imprecise and incomplete, which is challenging, especially for complex languages like Arabic. IR systems are limited because of the term mismatch phenomenon, since they employ models based on exact matching between documents and queries in order to find the required relevance scores. In this article, we propose to integrate domain terminologies into Query Expansion (QE) process in order to ameliorate Arabic IR results. Thus, we investigate different semantic similarities models: word embedding, Latent Semantic Analysis (LSA) and probabilistic graph-based. To evaluate our approaches, we conduct multiple experimental scenarios. All experiments are performed on a test collection called Kunuz, which documents are organized through several domains. This allows us to assess the impact of domain knowledge on QE. According to multiple state-of-the art evaluation metrics, results show that incorporating domain terminologies in QE process outperforms the same process without using terminologies. Results also show that deep learning-based QE enhances recall.

The paper describes a Query Expansion strategy for Arabic language. The task is particularly challenging as Arabic language seems to show a high level of ambiguity.

I regret to say that the paper is not ready for a publication and that I do not see it fitting very well the journal, as there is not much deep learning in it, except for the use of similarity on Word Embeddings.

The paper is not properly planned nor it went through proofreading. The text is often redundant and it contains sentences that are disconnected from the context (e.g. “Querying a local data collection can hardly be considered…” does not seem anyhow related to the focus of the introduction).

Despite the effort and the applicative importance to Arabic language QE, the method does not seem to be very innovative: it mostly combines a series of similarity estimations, using multiple representations (LSA, WE, Graph). The authors have done a careful analysis of the results, which is interesting, but it is still quantitative rather than qualitative, limiting the insights to the minimum.

I recommend the authors to rewrite the paper in a shorter form, eventually for a workshop or QE journal, avoiding redundancy and carefully planning every section (e.g. through few points that they may want to communicate). English proofreading is necessary.

Typos:
- Title: infromation —> information
- Vqand —> Vq and
- isn’t —> is not
- Probabilistic Graph Mining section —> there is a strange “(1)” in the text, which I am not sure about.
- Table 1: What do the numbers represent?
- Using word embedding similarity does not mean using deep learning

Review #2

By Cristina España i Bonet submitted on 29/Apr/2018

Suggestion: Major Revision

Review Comment:

This is a complete work that studies query expansion (QE) via in-domain terminologies into an Arabic IR system. In my opinion, the title does not fully correspond to the content of the paper. "Deep learning" refers only to word vectors but, besides, the effect of word vectors is not completely analysed. Results favour BM25 in front of word vectors for the similarity calculus, but the authors state that "they seem to be affected by the size of the training set and its nature". This should be further investigated to draw conclusions.

(1) originality

(general-domain) Word vectors have been used before for query expansion, and the terminology extraction procedure is already published by the same authors. The main contribution is therefore the target domain detection to estimate the similarities and the fact that it is applied to Arabic.

(2) significance of the results

Results are non-conclusive, as they are very much dependent on the query parameters and sometimes the best system is that without query expansion. Significance tests could help in assessing the significance of the results. The methodology is interesting though and I miss details on word vectors characteristics if the focus of the paper is this.

(3) quality of writing

One of the weak points of the paper is the way it is written. Some paragraphs are hard to understand and to connect with the previous ones. The writing would also benefit from a native speaker. Besides:
- There are typos all around, but the most important places are the title itself and Fig. 4.
- The same acronym is defined several times through the article (QE is defined 4 times for instance)
- The spacing in tables and figures makes captions unclear
- References should be listed alphabetically by surname. In the text, please, substitute "and al" by "et al." as in the example:
"Alromima and al,"
Alromima \emph{et al.},

- "Also, sentences in Arabic do not follow an exact structure as in French or English: Subject + Verb + Complement, which makes the treatment of these texts difficult."
Why?

- "All query expansion approaches can be classified into two main groups: (1) Global analysis and (2) Local analysis."
But then, the following resources are not described under this classification. There is a mix with the interactive and automatic QE approaches

- How do Abbache [1] and Abderrahim [24] when assessing that "the use of AWN is much better than using AMD", if [1] did not obtain improvements with a plain use of AWN?

- "we noticed that some researchers have tested their models using a limited number of queries."
But you don't improve this, your results are based on 34 queries.

3. Terminology-based QE using word embedding

- "QE can contribute in solving the problems of short and imprecise queries and in handling morphological and semantic variations"
In the example "Drinks", how can QE help is there is no context?

3.1 Preprocessing

- AyedTool
No reference is given

- "The comparison reveals that MADAMIRA gives the lower ambiguity levels"
If the comparison yours? If so, could you give more details? If not, please, add the reference

3.2 Terminology extraction

- "we obtain a minimal terminology"
Which is the criterion used for getting the appropriate size? Number of elements? TF-IDF threshold? Have you experimented with different size or compared the results with those obtained with a true terminology of a given domain?

3.4 Knowledge representation similarity calculus

- "In this work, we compare two different IR models"
Okapi BM25, and which is the other one?

- "Where IDF is the Inverse Term Frequency of the term i"
You should change the name "IDF" to distinguish it from the standard Inverse Document Frequency

- "we train our model on different representative terminologies of a large number of areas."
You train w2v on your terminologies or on other ones? All of yours? How many areas? Add more details, please.

- "seep-learning"
deep-learning

4. Experimental study

- The introduction is not clear:

"The main goal of our experiments is to assess the role of terminologies in enhancing IR through QE using word embedding. This approach will be compared to a baseline which consists on carrying out a QE without TDD."

Word embeddings learned in terminologies, word embeddings in general would lead to another experiment for QE (also interesting). It is not clear either that the next sentence describes the baseline which, on the other hand, is already defined later.

- "We add the n top ranked terms to q. n is equal to query length."
(here and in the other scenarios) Why? Has it been proven to be the best choice? Lowering the number for "full queries" does not help query expansion?

- "Table 2 recapitulates experimented in this paper."
POSS has not been defined and it is not shown anymore. Either remove or show results.

- Section numbering is wrong from 1.1 on.

- Tables in general.
The metrics shown in the tables should be introduced somewhere. What is RP? P@0 is P@1, right?

- "word embeddings seem to be efficient in QE, but it seems (they seem) to be affected by the size of the training set and its nature"
This is relevant to the conclusions and should be analysed quantitatively

2. Conclusions

should be 5. Conclusions

Review #3

Anonymous submitted on 29/May/2018

Suggestion: Major Revision

Review Comment:

The paper focuses on the usage of deep learning in the task of query expansion for Arabic domain texts. The authors aim to prove that the incorporation of domain specific terminology improves the results and the deep learning techniques improve the coverage of the retrieved data.
As a whole, the above mentioned ideas are not new with respect to the idea that the involved terminology would help query expansion and that the query expansion leads to a better coverage. I assume some novelty in the following respects: the specific language and the experiments where the target domain is detected automatically. I see as an advantage also the comparison of the graph-based methods with the neural ones.
First of all, I think that the paper needs a substantial polishing of the English phrasing. There are typos, grammar errors and non-clear sentences.
The paper claims that it builds on some previous work [41], but it is under press, so no consultation can be made. I would not call the preprocessing ‘a sophisticated method for linguistic analysis’.
It would be nice to see some information about the domains used, about the influence of the data size in each method, about some error analysis per various domains.
The presented results are interesting but too aggregated and also including too many parameters for discussion, such as the stop words, the text length, etc. The narrative should be made clearer and more concise.