The Needles in the Haystacks : ISIS, Social Media and Big Language

Social media, email and chat have become a primary means of communications for those who have access to technology and the internet. It is how kids communicate, how businesses market, how individuals organize to protest their governments and support their causes; it is also how terrorist groups recruit, share information, and execute terror attacks.

Social media and other forms of informal communications can provide the intelligence information needed to help our security forces prevent future attacks and enable them to identify terror networks and individual actors within those networks. However, it is a great deal of data being generated at a speed and uncontrolled variability like no other data sources used by the Intelligence and security community.

80% of all social network communications are in languages other than English.

Today, there are more than 6 Million Twitter users in the Arab world producing 17 Million Arabic tweets per day.

We saw the power of social media in December 2010 through 2011 – when citizens in Tunisia, Egypt Libya and many other countries in the region organized using applications like Facebook, You Tube and Twitter, to protest against the oppression and corruption of their governments – seeking immediate change.

The volume and speed of this data was overwhelming for observers who were trying to understand what was going on, and the implications to the country and region. A great deal of valuable intelligence was in these social media posts, but often in varying Arabic dialects, and Arabezi – a Latin representation of Arabic. Within Intelligence organizations, there were not enough translators available who could review, translate and digest all of this data.

This was Big Language Data – the kind that is full of slang, acronyms, and idioms – all in a variety of languages and dialects spoken within the region. This Big Language data has only become bigger and more diverse. How big? Today, there are more than 6 Million Twitter users in the Arab world producing 17 Million Arabic tweets per day.

The Intelligence Challenge of Informal Languages in Social Media

For the past half-decade, ISIS and similar Terror groups have harnessed the global reach that social media enables; They have used social media, chat and SMS to recruit, communicate plans, and enable the execution of their acts of terror. This communication occurs in many languages – English, Arabic dialects, Arabezi, French, French dialects from the Sahel, and many, many more. What makes this even more complicated is that the use of slang and idioms changes. There are now reportedly more than 45,000 ISIS supporter accounts on Twitter, and additional presence on Facebook, WhatsApp, Kik, and other sites. ISIS related tweets are now approaching over 2 million per day. Terrorist groups use social media to hide in the open. The reality is that there are not enough Linguists employed by our Intelligence and Security organizations, and their contractors, to collect, translate and analyze all of this data – with most of it not being relevant to the issue. There isn’t time to wait so we may recruit, screen, test and employ these linguists. We need to integrate technology to provide some level of filtering so that linguists and analysts can focus on the important information.

The example below shows how complex this problem is – which goes beyond Arabic. The following Tweets come from 4 countries in Africa, and represent how different informal language is. Now multiply this by the number of countries where terror cells are being formed or currently active, and by the number of regional languages that are commonly used. It is not uncommon to see Arabezi, French dialects, and Tribal languages in informal messages and communications.

Informal Language Geopolitical Boundaries

Utilizing technology to help with the gisting and filtering of informal language data is critical. Many Defense and Security organizations have already moved down this path. However, it is critical to understand that a “standard” Arabic Machine Translation (MT) engine is not going to catch the unique words and terms used in informal dialects of Arabic. Each block in the maps below represent an Arabic dialect – sometimes along country borders, and sometimes not. As ISIS continues to push into non-Arabic speaking countries, we will see the expansion of code switching in their communications.

Informal Languages and Machine Translation Solutions

For this reason, SDL Government has released a new Arabic to English engine that provides machine translation of Modern Standard Arabic (MSA), Arabezi, 8 Arabic dialects (from Morocco, Algeria, Tunisia, Libya, UAE, Yemen, Syria and Egypt), and a variety of informal terms and slang across the Arab-speaking regions of Africa and the middle East. An MT solution such as this would provides a level of automation that reduces the amount of data that ‘falls to the floor’ because we just don’t have the linguists to view all of it. It is critical the that US Intel and Defense communities invest in the continued development of informal corpus – especially for languages in regions where the threat of ISIS and other terrorist organizations’ recruitment is high.

MT can’t do it all; it is not a replacement for the linguist. But it is a feasible and accessible means to enable Big Language data translation and analytics across a continuum of sources and content. If we continue to treat multi lingual social media as a just a Big Data problem, we are going to continue to fall behind.

Finding the needles in this Haystack is a truly complex problem – but we need to move quickly and effectively as a community. Solving the Big Language problem (1) Requires the correct tools and approach when dealing with multilingual data. (2) requires a balance between translation speed and accuracy when translating the content into a target language, and (3) requires a balance between human and automated translation. The need is there, and the technology solution is ready today.

SDL Government introduced the first Informal Arabic to English language pair for machine translation

About the Author

S Danny Rajan is CEO of SDL Government in Herndon VA.

SDL Government provides innovative Language Technology Solutions for the US Defense and Intelligence community, built around its suite of COTS products and its technical solutions team.