Why linguistics can't always identify cyber attackers' nationality

The security whodunnit: analysing the language used in an attack is just one tool to assign attribution, and it’s not always reliable.

Malware. Data theft. Ransomware. Everyone wants to know who was behind the latest audacious attack. Several attempts have been made over the years to use linguistics to identify perpetrators, but when it comes to attribution, there are limitations to using this method.

Linguistic analysis came up recently when analysts at intelligence firm Flashpoint said there was a Chinese link with the WannaCry ransomware. Much of the security research up till then had pointed to North Korean ties, as the attacks reused infrastructure components associated with the shadowy Lazarus Group. Before that, a Taia Global report suggested The Shadow Brokers’ manifesto was actually written by a native English speaker, despite the broken English. Linguistic analysis also was used to suggest that Guccifer 2.0, who released documents stolen from the Democratic National Committee, was likely not Romanian as claimed. Back in 2014, Taia Global said linguistic clues pointed the Sony breach to Russian actors, and not the North Koreans as the United States government had claimed.

Attribution is hard enough—and relying on linguistic tools appears to be just adding to the confusion. Was WannaCry the work of the Chinese or the North Koreans? Is Guccifer 2.0 Romanian or Russian? Linguistic analysis will very rarely lead to the smoking gun. At the very least, it will uncover a whole set of clues for researchers to track down, and at the best, it will support (or confirm) other pieces of evidence uncovered by technical research and forensics methods. Linguistic analysis is another tool in the arsenal when it comes to attribution.

“Linguistic evidence, to be reliable, must show a consistent pattern of different features pointing in a single direction,” says Shlomo Argamon, a professor at the Illinois Institute of Technology. He was behind Taia’s original analysis of the Sony hackers and that of The Shadow Brokers.

Understanding the analysis

There are two kinds of analysis, one looks at the actual source code and the other examines the actual text that is used. In the first kind, the analysis focuses on code style and patterns to see similarities to other known code samples. Many researchers have relied on this method to link different attacks to a single actor, but this isn’t linguistic analysis.

The second method relies on human language, such as error messages, dialog boxes and messages directly shown to victims. For it to be effective, there needs to be text and plenty of it. Flashpoint’s WannaCry analysis focused on the ransom notes that victims were shown. Argamon analyzed The Shadow Brokers’ rambling manifesto. In the case of Guccifer 2.0, Argamon looked at Motherboard’s Lorenzo Franceschi-Bicchierai’s interview with Guccifer 2.0 over Twitter. In some cases, there is text in the code itself—such as comments—but that is typically considered too little to be useful. “You have to have enough text,” says Argamon.