Syndicate

Tracking #: 2070-3283

This paper is currently under review

Authors:

Wiem Lahbib

Ibrahim Bounhas

Yahya Slimani

Responsible editor:

Guest Editors Semantic Deep Learning 2018

Submission type:

Full Paper

Abstract:

Abstract. Term mismatch influences negatively the performance of Information Retrieval (IR). User queries are generally imprecise and incomplete, thus lacking important terms useful to understand the user's need. Therefore, detecting semantically similar words in the matching process becomes more challenging, especially for complex languages including Arabic. Employing classic models based on exact matching between documents and queries in order to compute the required relevance scores cannot resolve such problem. In this article, we propose to integrate domain terminologies into the Query Expansion process (QE) in order to ameliorate the Arabic IR results. Thus, we investigate different experimental parameters such the corpus size, the query length, the expansion method and the word representation models, namely (i) word embedding; and (ii) graph-based representation. In the first category, we use neural deep learning-based model (i.e. word2vec and GloVe). In the second one, we build a cooccurrence-based probabilistic graph and compute similarities with BM25. We compare Latent Semantic Analysis (LSA) with both of them. To evaluate our approaches, we conduct multiple experimental scenarios. All experiments are performed on a test collection called Kunuz which provides documents in several domains. This allows us to assess the impact of domain knowledge on QE. According to multiple state-of-the-art evaluation metrics, results show that incorporating domain terminologies in the QE process outperforms the same process without using terminologies. Results also show that deep learning-based QE enhances recall.