How the NSA (Or Anyone Else) Can Crack Tor's Anonymity

Researchers identified 81 percent of people using the service with a honeypot scheme and some statistical analysis.

​With recent deep web raids from the FBI and Europol, the dark net appears to have more law enforcement eyes on it than ever before. And now, this: Tor, the most common way of accessing the dark net, isn't nearly as anonymous as it purports to be, according to a new study by researchers at Columbia University.

Tor has always kept web traffic anonymous by delaying or otherwise altering the packets of data that are sent through servers (that's why Tor tends to run slower than your standard browser), making it look like the traffic is coming from a place that it's not actually (the IP address that the server "sees" is called an "exit node"). If the end server of the site you visited can also detect the origin point of where your traffic enter Tor (the "entry node" or "entry relay), then anonymity is lost.

This is how Tor normally works. Image: Chakravarty et. al.

"Tor (like all current practical low-latency anonymity designs) fails when the attacker can see both ends of the communications channel," the Tor Project wrote in its frequently asked questions page. "For example, suppose the attacker controls or watches the Tor relay you choose to enter the network, and also controls or watches the website you visit. In this case, the research community knows no practical low-latency design that can reliably stop the attacker from correlating volume and timing information on the two sides."

Turns out, however, that Chakravarty's method uses a vulnerability encoded in almost every commercial router and some statistical analysis to figure out who is who. It's a complicated setup, but it goes something like this:

He set up a fake server and a fake website on the deep web, from which the victim has to download a large file. Embedded in this file is code that allows him to access a feature of most routers called NetFlow, which was developed by Cisco to divide traffic into different types of data: email, browser, and other, for instance.

While that's happening, the server is also sending data back along Tor's various nodes, which are servers designed to disguise where someone is coming from. If the user continues to be routed through these nodes (which requires the file to be continuously downloaded for at least several minutes, perhaps as long as an hour), Chakravarty is able to use the NetFlow information he's getting from that user to basically guess (with the help of some advanced statistical analysis) where that original user's entry node is by analyzing the type of data that the user's router is accessing.

Graphically, it looks like this:

Image: Chakravarty et. al.​

That sounds complicated and perhaps a bit farfetched, but a "powerful adversary," he says, only needs to make a fake website and have some rudimentary number-crunching capabilities in order to recreate it. If that adversary happens to run a lot of other nodes, it becomes even easier.

In the real world, this would mean that the NSA, or FBI, or anyone, really, can set up a honeypot situation where, if you visit a fake site that's rigged with, say, illegal drugs or child porn or something and download a relatively large file from it (around 100 MB, he suggests), your identity can be discovered, 81 percent of the time.

From the paper:

"In our attack model, we assume that the victim is lured to access a particular server through Tor, while the adversary collects NetFlow data corresponding to the traffic between the exit node and the server, as well as between Tor clients and victim's entry node. The adversary has control of the particular server (and potentially many others, which victims may visit), and thus knows which exit node the victim traffic originates from."

None of this is particularly easy to do (though Chakravarty noted that he used all open-source tools to do it), but is well within the grasp (and interests) of the FBI and NSA.

"We assume a powerful adversary, capable enough of observing traffic entering and leaving the Tor network nodes at various points," he wrote. "Such an adversary might be a powerful nation state or colluding nation states that can collaborate."

Because Chakravarty's system is essentially making a very educated guess at who you are, false positives (about 6 percent of the time) are inevitable, which is why Roger Dingledine, president of the Tor Project, says that it's not good enough to be useful in the real world.

"That sounds like it means if you see a traffic flow at one side of the Tor network, and you have a set of 100,000 flows on the other side and you're trying to find the match, then 6,000 of those flows will look like a match. It's easy to see how at scale, this 'base rate fallacy' problem could make the attack effectively useless," Dingledine wrote in a blog post.

That's a fair point, but it assumes that a powerful adversary doesn't have additional methods of tracking people once it has their IP addresses—which an entity like the NSA or FBI does. It also assumes that this attacker couldn't simply wait for one IP address to show up multiple times, perhaps even over the course of months—which the NSA or FBI could. Chakravarty noted that it doesn't necessarily have to be a powerful state actor launching the attack and that others who have plenty of Tor nodes at their disposal could do it, as well.

Tor has always said that this type of attack could be possible, but that the incidence of false positives make it infeasible for law enforcement to use. In other words, Tor is preaching safety in numbers: If there were even more people running Tor relays (also called nodes), it becomes much more difficult for the adversary to actually use this attack.

"I should also emphasize that whether this attack can be performed at all has to do with how much of the internet the adversary is able to measure or control," Dingledine wrote. "It seems to me that there's a lot of work left here in showing that it would work in practice."