Even though English is the most commonly used language in darknet marketplaces, the number of non-English darknet marketplaces has been growing steadily since 2013. Even though text classification has been used to detect cyber threats in English darknet marketplaces, the task is relatively hard to accomplish in non-English marketplaces due to language problems and lack of reliable data. Currently, available approaches rely on monolingual models and machine translated data to mitigate these problems. Nevertheless, translation errors can greatly undermine the reliability of classification results. The abundance of information obtained from English darknet marketplaces can be utilized to understand non-English threats without having to rely on machine translation.

A recently published research study showed that a deep cross-lingual approach, which can identify the common language representation of two different languages, is far more efficient than a mono-lingual approach applied to machine translated data in order to identify cyber threats in non-English darknet marketplaces. Oppositely to previous research studies, this approach does not rely on any external data sources such as bilingual lexicons or bilingual word embeddings. The study conducted experiments on Russian darknet marketplaces, which proved that this approach can yield better results when compared to state-of-the-art approaches for detection of non-English cyber threats in online hacker communities. Let’s take a look at the design and results of this research study.

Research design:

The used transfer learning-based cross-lingual framework for detection of cyber threats included three main components as illustrated via the below figure:

Seven English darknet marketplaces and one Russian darknet marketplace were chosen based on data obtained from DeepDotWeb.com. A special web spider, which prevents anti-crawling measures, crawled each darknet marketplace to obtain all product descriptions. 95,095 product listings were extracted and parsed in a database that included products and services related to cyber threats (e.g. keyloggers, ransomware, and DDoS attack kits), in addition to other products and goods (e.g. drugs, books, weapons, and digital goods).

B- Bilingual testbed generation:

To train the proposed model, the product listings in each language were randomly sampled via preserving the ratio between cyber threat products and non-cyber threat products. This yielded 2,373 product listings which included 1,821 English products and 552 Russian products. English product listings were labeled by two cyber security experts as cyber threats or non-cyber threats. A Russian speaker and a cybersecurity expert labeled the Russian products in a similar manner.

C- Cross-lingual cyber threat detection:

The researchers developed a cross-lingual long short term memory (CL-LSTM) framework for identification of Russian cyber threats. The architecture of this CL-LSTM is comprised of three layers: a language independent BiLSTM layer to identify the common representation between English and Russian, in addition to two language specific LSTM layers that can interpret the common representation for detection of cyber threats in English and Russian separately.

Results and accuracy of the proposed method:

The researchers compared their proposed method to previously used methods for detection of non-English cyber threats in darknet marketplaces, regarding accuracy, recall, precision, and F1 score. They proved that their proposed method, i.e. CL-LSTM, outperforms all previously used methods in terms of accuracy, recall, precision, and F1 score by a significant statistical margin. As such, we can safely state that CL-LSTM is not only highly effective in the detection of non-English cyber threats in darknet marketplaces but can also reduce the incidence of false positive results.

Final thoughts:

The proposed framework can greatly help CTI professionals identify cyber threats in non-English darknet marketplaces without having to rely on machine translation. However, future studies are needed to develop approaches that can obtain results from very short product descriptions. Moreover, it is necessary to validate the framework for other languages such as Chinese, Arabic, and others.