Summarizing Arabic Text with AI

The amount of information we consume from multiple sources is growing like never before. To handle this massive amount of information, people are turning to computers to provide summarizations that can be quickly read. Advanced algorithms and machine learning techniques have recently been developed by major software companies that successfully paraphrase lengthy English texts better than anything previously available. However, work on algorithms capable of summarizing Arabic text has not been progressing as quickly, until now.

A group of researchers from Khalifa University's Emirates ICT Innovation Center (EBTIC) and the Khalifa University College of Engineering have developed artificial intelligence (AI) algorithms that can automatically summarize long Arabic texts to produce coherent briefs. Their system overcomes some of the major challenges to automating Arabic summarization, which stem from the complicated nature of the Arabic language.

The Arabic language tends to have a high degree of ambiguity due to its structure, explained Ahmad Al-Rubaie, EBTIC's Head of Research, Operations and Strategy. In addition, the presence of Arabic colloquial and classical variations, marks for pronunciation above and below letters that are often omitted when writing but change word meanings, and differences between formal written classical Arabic and its modern counterpart complicate matters further. Moreover, summarization evaluation standards for Arabic summarization have not reached the maturity of the English language, although this is starting to change, but it remains that different Arabic summarization systems use different evaluation methods.

To tackle the issues of automatic Arabic summarization using AI, EBTIC proposed and implemented a unique system that combines the advantages of current state-of-the-art summarization research with methods specifically developed for the Arabic language's complex structure.

The system was developed by Lamees Al Qassem as part of her MSc by Research in Engineering. Al Qassem was supervised by Dr. Hassan Barada, Associate Dean of the College of Engineering, Dr. Di Wang, EBTIC Senior Researcher, Ahmad Al-Rubaie and Dr. Nawaf Almoosa, Director of EBTIC and Assistant Professor of Electrical Engineering and Computer Science. She is now pursuing her PhD in Engineering at KU.

Developed as a complete end-to-end solution, the system was designed with a back-end component that collected newspaper articles from various UAE based newspapers and online news outlets, which it then archived in order to produce summaries for each article. The output was served to users through a mobile application developed on the Android mobile operating system. Summaries of relevant stories were provided to users based on their profiles. A limited trial was conducted at KU to test and improve the system followed by demonstrations at various showcases and events where EBTIC was involved. The most recent of these events was EBTIC's 10th Anniversary celebration in April 2019. The system remains operational and there are current plans to further develop it for use by EBTIC partner.

The EBTIC Arabic text summarizer leverages Natural Language Processing (NLP), the branch of AI that helps computers understand, interpret and produce written human language. It works by first running the Arabic text through an algorithm designed by Al Qassim that first detects and extracts nouns, as nouns are representative of the key information contained in sentences. The extracted nouns, as well as a number of other features selected through research and experimentation, are fed into a Fuzzy Logic engine, a type of scoring system that determines the degree to which sentences are important and thus should be included in the final summary. Fuzzy Logic is well suited for determining how important a sentence is in an article or text.

In traditional binary logic the importance of a sentence in an article or text can be either important or not important. In Fuzzy Logic, the level of importance can be an infinite range of importance values, enabling the representation of vague concepts. For example, in Fuzzy Logic, a sentence can be very important, slightly important, not that important, or not important at all, Dr. Di Wang said.

The team followed a rigorous evaluation criterion when researching and developing the summarization system, which has already garnered significant interest from EBTIC's partners. A paper describing the system titled Automatic Arabic Text Summarization Based on Fuzzy Logic won Best Paper Award at the 2019 UAE Graduate Student Research Competition (GSRC). An extended version of the paper has been accepted for publication at the International Conference on Natural Language Processing and Speech Recognition 2019 to be held at Trento University in Italy. Three other papers have already been published.

The proposed method and its components were tested and compared to existing state-of-the-art systems, at both the component level and system level, Dr. Barada explained. For example, our noun extraction algorithm was evaluated against the Stanford morphological analyser to ensure it matches or outperforms the current state-of-the-art. The team also looked at the various Arabic summarization engines available and evaluated their system against the others using the same Arabic texts and evaluation method where available and possible.

The research and development process was demanding, but the result was that we outperformed most, if not all, existing methods that have been implemented, Al-Rubaie said.

Dr. Almoosa added, The team is currently in discussion with a UAE government agency to adapt the system to be able to summarize information that is more demanding, which would require further enhancements and a greater understanding of the underlying meanings in sentences.