Big Data: What NBA stats can teach you about NSA surveillance

Big Data: What NBA stats can teach you about NSA surveillance

Slides from the secret deck published by the Guardian outlining the mission of "Boundless Informant", the NSA's Hadoop-based big data analytic.

Leaks of highly classified National Security Agency documents detailing the intelligence organization's international surveillance program have introduced many members of the public to terms like "metadata" for the first time — leaving some frustratingly confused or resigned to apathy.

But in spite of their complexity, big data tools shouldn't be so intimidating. Indeed, average people use them every day, whether it's to follow their favorite athletes or avoid heavy traffic.

For diehard NBA fans, comprehending the NSA’s PRISM program and Boundless Informant indexing tool may be easier than expected.

The information technology industry’s infatuation with "big data" — meaning, vast collections of information that would be unwieldy without sophisticated tools that store, index and analyze them efficiently — has grown since advances in storage technology have sped up the time it takes to process enormous amounts of data.

The benefits of big data technology have mostly been enjoyed by businesses and, it's now clear, the intelligence community. But since last February, basketball fans have become quite familiar with the tools' power, too.

At stats.nba.com, fans can use big data analytics to delve into professional basketball stats from as long ago as 1946, giving them the facts to settle an argument or craft a statisically perfect fantasy team. The basketball-obsessed looking for historic hoops trends can find up to 4.5 quadrillion statistical combinations — that’s 4,500,000,000,000,000 numerical amalgamations.

With so much information in one searchable database that can be queried by analytics, users can hunt for answers to very specific questions. Just a few clicks will show a Miami Heat fan where exactly on the court Lebron James is mostly likely to sink a three-point shot and who usually tosses him the ball beforehand. The user could further figure out if he’s more likely to make the shot while his team is up or down, or if the clock is closer to the buzzer.

Boundless Informant functions similarly, albeit for very different reasons. Analyzing personal metadata and other information collected through programs like PRISM, the NSA's Boundless Informant can show intelligence analysts when and where a terrorist attack is likely to occur — or, perhaps, when a reporter is on the verge of a potentially embarrassing scoop.

In the NSA’s own words, Boundless Informant “use[es] Big Data technology to query SIGINT ["signals intelligence," or information gained from the interception of electronic signals] collection in the cloud to produce near real-time business intelligence describing the agency’s available SIGINT infrastructure and coverage.”

The power of big data analytics is in their ability to process every piece of information in a given database, no matter how large. In the hands of the NSA, that power has created fresh privacy concerns among those who understand how the application functions.

While relatively little is known about the NSA’s big data analytic tool, leaked documents indicate that Boundless Informant uses the Hadoop Distributed File System, an open source application used to manage large amounts of data. Hadoop achieves the same goals as the NBA’s SAP HANA; indeed, the two applications are sometimes used in tandem.

“Every valid [digital network intelligence] and [dial number recognition] metadata record is aggregated to provide a count at the appropriate level,” the classified Boundless Informant explainer says, indicating some idea of the amount of data analyzed by the tool.

The Boundless Informant intelligence map leaked to the Guardian depicts almost three billion pieces of intelligence collected in the US over a 30-day period ending in March 2013. If an NSA analyst uses Boundless Informant to review that information, they'll "drill down" from aggregated summaries of to the more pertinent details.

But before they can do that, every individual from whom the data was harvested is effectively scrutinized, since their personal information is plugged into the tool for analysis.

Lebron may not mind that every minute detail of his in-game performance is logged, categorized, stored and then analyzed by fans and pundits. The average non-celebrity, however — who's also probably making less than Lebron's $17.5 million — may not be as happy about the exposure.