Defending NSA Prism's Big Data Tools

The more you know about NSA's Accumulo system and graph analysis, the less likely you are to suspect Prism is a privacy-invading fishing expedition.

It's understandable that democracy-loving citizens everywhere are outraged by the idea that the U.S. Government has back-door access to digital details surrounding email messages, phone conversations, video chats, social networks and more on the servers of mainstream service providers including Microsoft, Google, Yahoo, Facebook, YouTube, Skype and Apple.

But the more you know about the technologies being used by the National Security Agency (NSA), the agency behind the controversial Prism program revealed last week by whistleblower Edward Snowden, the less likely you are to view the project as a ham-fisted effort that's "trading a cherished American value for an unproven theory," as one opinion piece contrasted personal privacy with big data analysis.

The centerpiece of the NSA's data-processing capability is Accumulo, a highly distributed, massively parallel processing key/value store capable of analyzing structured and unstructured data. Accumolo is based on Google's BigTable data model, but NSA came up with a cell-level security feature that makes it possible to set access controls on individual bits of data. Without that capability, valuable information might remain out of reach to intelligence analysts who would otherwise have to wait for sanitized data sets scrubbed of personally identifiable information.

As InformationWeek reported last September, the NSA has shared Accumulo with the Apache Foundation, and the technology has since been commercialized by Sqrrl, a startup launched by six former NSA employees joined with former White House cybersecurity strategy director (and now Sqrrl CE0) Ely Khan.

"The reason NSA built Accumulo and didn't go with another open source project, like HBase or Cassandra, is that they needed a platform where they could tag every single piece of data with a security label that dictates how people can access that data and who can access that data," said Khan in an interview with InformationWeek.

Having left government employment in 2010, Kahn says he has no knowledge of the Prism program and what information the NSA might be collecting, but he notes that Accumulo makes it possible to interrogate certain details while blocking access to personally identifiable information. This capability is likely among the things James R. Clapper, the U.S. director of National Intelligence, was referring to in a statement on the Prism disclosure that mentioned "numerous safeguards that protect privacy and civil liberties."

Are They Catching Bad Guys?

So the NSA can investigate data with limits, but what good is partial information? One of Accumulo's strengths is finding connections among seemingly unrelated information. "By bringing data sets together, [Accumulo] allowed us to see things in the data that we didn't necessarily see from looking at the data from one point or another," Dave Hurry, head of NSA's computer science research section, told InformationWeek last fall. Accumulo gives NSA the ability "to take data and to stretch it in new ways so that you can find out how to associate it with another piece of data and find those threats."

The power of this capability is finding patterns in seemingly innocuous public network data -- which is how one might describe the data accessed through the Prism program -- yet those patterns might somehow correlate with, say, a database of known terrorists or data on known cyber warfare initiatives.

Sqrrl has supplemented the Accumulo technology with analytical tools including SQL interfaces, statistical analytics interfaces, text search and graph search engines, and there's little doubt the NSA has done the same, according to Kahn. Graph search, in particular, is a powerful tool for investigation, as the NSA itself revealed last month when it shared at a Carnegie Mellon technical conference an in-depth presentation on the 4.4-trillion-node graph database it's running on top of Accumulo.

Nodes are essentially bits of information -- phone numbers, numbers called, locations -- and the relationships between those nodes are edges. NSA's graph uncovered 70.4 trillion edges among those 4.4 trillion nodes. That's truly an ocean of information, but just as Facebook's graph database can help you track down a long-lost high school classmate within seconds, security-oriented graph databases can quickly spot threats.

Kahn says a Sqrrl partner company that does graph analysis of internal network activity for security purposes recently identified suspicious activity using a graph algorithm. "Five days later, they got a knock on the door from the FBI letting them know that data was being exfiltrated from their network, likely by a foreign entity," Kahn reports.

As we've reported, graph database technology dates back to the 1950s, but only recently has it advanced to truly big data scale, with Facebook exposing its Graph Search capabilities in January and NSA sharing details of its graph search capabilities last month.

Where prior intelligence techniques have largely been based on knowing patterns and then alerting authorities when those patterns are detected, security and intelligence analysts now rely on big data to provide more powerful capabilities than analytics alone.

"Graph analysis is just one really good technique for finding unknown patterns in data," Kahn explains.

Do You Trust The Government?

In the end, assurances from Clapper, a former White House employee like Khan or even President Barak Obama may do little to assuage the concerns of privacy hawks, critics inside government or large swaths of American citizens. But those who known the technology used by the NSA know that Prism is not a simplistic, "collect first, ask questions later" expedition, and it's not based on an "unproven theory."

One government insider informs InformationWeek that he knows with certainty that "semantic and visual analytics tools have prevented multiple acts of terrorism." That insight predates recent advances in graph analysis that are undoubtedly giving the U.S. Government even more powerful tools. Privacy concerns and the desire for checks on government access to private information must be considered, but we can't naively turn a blind eye to very real threats by not making the most of advanced big data intelligence tools now at our disposal.

Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.

Why should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.