Taia Global's Chief Science Advisor Dr. Shlomo Argamon, one of the country's preeminent researchers in authorship analysis and stylometry, led a team that conducted native language identification (NLI) analysis on the 20 messages left by Sony's hackers. Their results do not support the U.S. government's charge that North Korea was responsible for the network attack against Sony Pictures Entertainment. This post is a mini-version of the full report, which can be downloaded from the Taia Global website.

The Problem

The specific question that we address in this report is to determine the (non-English) native language of the authors of the electronic messages (emails and forum posts) signed by the “Guardians of Peace,” putatively from the group that hacked into Sony Pictures Entertainment, stole their data, and posted some of it publicly.

The Data

For our analysis, we used twenty messages reported in the media and posted to Pastebin that have been attributed to the “Guardians of Peace” (GOP) group (see Appendix A in the report).

Assumptions and Caveats

To do our analysis, we must first rule out two alternate scenarios.

First, that a native English speaker or speakers wrote the messages and then intentionally inserted errors to make it appear as if a non-native English speaker(s) had written them.

Second, that the messages are the result of automatic translation of foreign-language original texts, in which case it would be difficult, if not impossible, to figure out the original language from the English texts (at least without knowing specifically what translation software had been used). See Appendix D in the report for examples of Google Translate.

Methodology

We apply a two pronged methodology to analyzing the native language of the messages’ authors.
First, we examine a number of possible candidate languages, including Korean, to see if we can rule them out, and if not, which one does seem the most likely native language.
Second, as a further check, we perform an independent test for the messages’ similarity to English written by native Korean speakers to see how similar the non-fluencies are.

The Results

We conclude that it is unlikely that the messages were written by native Korean speakers, though it is not impossible. It is far more likely that they were written by native Russian speakers. It is virtually impossible, however, that they were written by native German or Mandarin Chinese speakers.

If You'd Like To Join Our Study

This study is limited by the small number of languages that were studied, as well as by the limited comparison with L2 English samples. We plan an expanded study of the messages, comparing against a wider sample of candidate languages as well as performing statistical comparisons against L2 samples in Korean and other languages.

Taia Global is looking for linguists with academic backgrounds to join this research project commencing in early January 2015. Interested candidates should contact Dr. Shlomo Argamon.