Data-driven analysis debunks claims that NSA is out of control (Special Report)

If these numbers were reported in a corporate situation, they would be considered an absolute triumph of big data management and implementation. UPDATE: Response/corrections/clarifications from Washington Post reporter.

IMPORTANT UPDATE: Please see the end of this article for a detailed response by Barton Gellman of the Washington Post. He clarifies some of my statements, calls me out on some, and gives us a much better understanding of others. He posted this response to the comments, but I don't want it to get lost.

Just how heinous is the National Security Agency? If press reports and blog postings are to be believed, the NSA and the entire government surveillance apparatus of the United States are completely out of control and we're headed for a Gestapo-style state.

But is that really true? What does the data have to say about it?

Let's start with a basic problem. Big numbers are hard for people to visualize. Really, really, really big numbers are impossible to visualize.

The gotcha that comes out of this cognitive limitation is that it's possible to distort public perception by tossing out big-sounding numbers. Even if an attempt is made to put those numbers in perspective, most readers grab the most savory bit of information, usually from the headline, and that's what becomes their internal representation of the facts.

So let me summarize the results of my data-driven investigation, and then take you through the details. Here is a summary of the results of my analysis:

Facebook captures 20 times more data per day (for just its server logs, not counting everyone's posts) than the NSA captures in total.

The NSA's selection systems are actually insanely accurate. If you compared all the data they capture to a year's worth of time, the amount of errors they make amounts to about a quarter of a millisecond.

The actual byte quantity of erroneous data the NSA records amounts to less than one MP3 track per week.

If these numbers were reported in a corporate situation, they would be considered an absolute triumph of big data management and implementation.

So, there you go. Headlines hyper-inflate the facts. Now let me take you through all the details. Let's start with what happened on Thursday.

According to the Post, an NSA audit described "2,776 incidents in the preceding 12 months of unauthorized collection, storage, access to or distribution of legally protected communications." This describes the period of about May 2011 to May 2012.

It's important to note, before I go further, that I have an incredible degree of respect for both Bart and the EFF. But, to paraphrase President Clinton, it's time to employ some arithmetic.

Volume of NSA data

Here's where the really big numbers come from. According to the NSA itself, in a document released to the public (PDF), the Internet as a whole carries 1,826 petabytes of information per day. Hang with me here. The numbers are not going to make much sense for a little while, but I'll knit them together so you can grasp the big picture.

Of that 1,826 petabytes, the NSA "touches" 1.6%, or just under 30 petabytes. While the NSA doesn't define "touches" in detail, we can assume from context that they mean the data briefly passes through their networks and/or data collection centers. I know you can't picture either 1,826 petabytes or 30 petabytes, but don't worry about that for now. Stick with me. This will make sense soon.

The NSA disclosed that of that 30 petabytes it "touches," only 0.025% is "selected for review". That number is about 7.3 terabytes. By "selected for review," we can fairly assume that about 7.3 terabytes is added to the NSA's global databases and may be examined by federal agents.

I'll come back to the Washington Post's 2,776 "incidents" in a minute. First, let's get some picture of the difference between petabytes and terabytes.

Picturing the scale of data

The best way I've found to picture these data sizes is by comparing them to money. A single byte, roughly one character (like "B") could be compared to a penny. If one byte is one penny, then the 140 characters in a tweet is worth about $1.40 (140 pennies).

Okay, let's raise the stakes a bit. A kilobyte is roughly a thousand (I know, 1024, but work with me), about a thousand characters of text. So far, in this article, you've read about three times that many characters. In terms of pennies, a kilobyte would be about ten bucks, or just about the cost of two Subway sandwiches.

Following along, then, a megabyte is worth about a million pennies, or about $10,000 dollars, which is roughly the cost of a used 1998 Toyota Camry. A gigabyte (which in video form will hold just about one episode of a TV show) would be a billion pennies, or about $10 million dollars — the price of a very fancy mansion.

Do you see how these numbers just get insanely bigger? When we go from a kilobyte (a thousand or so) to a gigabyte (a billion or so), we go from a few sandwiches to a Hollywood celebrity's mansion.

Hang with me. I'll bring this back to the NSA in a minute, but you still need to get the full picture. Let's punch it up. Let's go from a gigabyte to a terabyte. Let's say a terabyte is worth a trillion pennies. In dollars, that puts you in billionaire territory, roughly the net worth of Microsoft's Steve Ballmer, and about half the net worth of Jeff Bezos, who just bought the Washington Post for what, for him, is pocket change.

So a terabyte in money terms puts you in Mark Zuckerberg, Bruce Wayne, Lex Luthor territory. So what about a petabyte? We've been flinging the term petabyte around the news all last week, but how much is that? How can we picture it?

Let's use money again. If we're talking a penny a byte, a petabyte is one quadrillion pennies, or about $10 trillion dollars. If it's hard picturing billionaire-level wealth, try this one out for size: $10 trillion is the entire Gross Domestic product of China and Japan...combined.

Okay, so let's go back to trying to picture what the NSA is doing, and doing wrong. Now that we have a frame of reference (ranging from the cost of a submarine sandwich to the total income of China and Japan combined), we can get a feel for the relationship of the terms the press is flinging around.

Parsing the NSA data flow using what we now understand

Let's start with the biggest number first. While the NSA "touches" about 30 petabytes (in the dollar analogy, about twice America's GDP), it only selects for review about 7.3 terabytes (about the net worth of Bill Gates and Jeff Bezos combined).

By the way, as a reality check, according to Robert Johnson (Facebook Director of Engineering), back in 2011 Facebook collected 130 terabytes of log data each day. Facebook, just in terms of log data (not counting all the cat pictures and recipes everyone posts), gathers almost 20 times the amount of log data each day than NSA grabs of all data.

Now, let's look at the number 2,776, which is what has everyone all upset.

Before we start playing with this number, let's add one more fact. This number is over the course of a year, while the other data we're looking at is over the course of a day.

2,776 is the number of erroneous data accesses by the NSA that the Washington Post reported. First of all, how much data is that? Since we're talking about metadata, we're not talking full messages. A typical email header has about 4,500 bytes (or about 4K). Let's give the naysayers the benefit of the doubt and let each NSA error be 32K.

Putting it all into perspective

So now, we can start putting the heinousness in perspective. 32K times 2,776 errors is a little under 90 megabytes — or about the size of one Justin Bieber album downloaded as MP3s — per year.

To fit this into the daily numbers we've been working with, let's divide that 90 megabytes by 365. That gives us about 252K. In penny-per-byte terms, that's about $2,500 (or about the cost of one nicely equipped iMac).

In terms of dollars, which is the analogy we've been using throughout this article, the NSA mistakenly grabs the penny-per-byte data equivalent of an iMac as compared to the penny-per-byte equivalent of the overall net worth of Bill Gates plus Jeff Bezos.

The bottom line is this: the NSA runs about 30 quadrillion bytes through its systems each day. It records about 7 trillion of those bytes. It mistakenly records less than a megabyte a day — less than one MP3 worth of data per day.

Let's put it another way. When we talk about our goals for measuring excellent data center high-availability performance, we look for "five nines" of service availability, meaning that uptime is 99.999 percent. In terms of operating time, five nines means the network will be down all of 5 minutes and 26 seconds for the entire year.

If we picture the NSA's accuracy by comparing it to the commonly accepted IT goal of five-nines of high availability (or about five and a half minutes per year), the NSA's error rate (described in terms of time) would be 0.2649 milliseconds per year. That's not the Holy Grail of five nines of accuracy. That's more like twelve nines.

These numbers don't look to me like a heinous disregard for privacy on the part of the NSA's coders and systems engineers. Instead, it looks to me more like a triumph of IT and database engineering.

Of course, information like that doesn't cause outrage, it doesn't sell newspapers, and it doesn't generate page views. It's just accurate. Looking at actual data rather than breathless hyperbole paints a far clearer picture of the activities of America's most advanced technical intelligence gathering operation.

The following was posted to the comments for this article by Barton Gellman. I'm thrilled he's participating in our conversation. Thanks, Bart, for joining us and sharing clarifications.

From the author of the Washington Post story (Barton Gellman)

I'm the author of The Washington Post story. There's a newsroom expression. "Danger: reporter doing math." I'm not going to audit David, but in any case the math won't be the problem here. The problem is that he misunderstands what he's counting. I don't blame him for that: This is a very complex set of legal, technical and operational questions. I have been following them closely since 2005, and devoted two chapters of my last book to them, and I still don't find them easy. No time for a treatise but a few quick points:

* The "compliance incidents" do not all involve collection. As the story and the documents note, they can take place anywhere along the spectrum of electronic surveillance: collection, retention, processing or distribution. Any of them can range from the minor, with little privacy impact, to the very serious.

* David assumes the surveillance is all about metadata. It is not. Much of it -- an unknown quantity, because the report does not break this down -- is content. As the story notes, the NSA does not "target" Americans for content collection but it does collect a great deal of American content "inadvertently," "incidentally" or deliberately when one party is known to be a foreign target overseas. Most of it stays in databases, and a single search can pull up gigabytes.

* A crucial point to understand: the last two categories of collection on Americans -- "incidental" and deliberate, when one party is overseas -- account for the highest volume of American data in NSA hands. They DO NOT COUNT as incidents. NONE of them are among the 2,776 incidents. As the NSA interprets the law, it is not a violation to collect, keep and process it. Until my story that had never been clear, and the White House still works hard to obscure the difference between forbidden and routine collection (including collection of content) from Americans. "Minimization" rules strip out identities by default, but there are many exceptions and requests from "customers" to unmask identities are readily granted.

* It is not possible to calculate or even estimate within several orders of magnitude the quantity of data involved in 2,776 incidents, nor the number of people affected, even if you know whether you're dealing with metadata or content. A small but unknown number of incidents -- those involving unlawful search terms but obtaining no results -- do not collect, process or disseminate any data at all and thus have zero privacy impact. Other incidents may involve only a few surveillance subjects but includ large volumes of data, either because collection takes place over a span of time or because the previously collected data set is very large. One "incident" in the May 2012 report involved over 3,000 database files, and each file contained an unknown (but typically very large) number of records. Another episode -- not counted as an "incident" at all -- collected data on all calls from Washington, DC for an unknown period of time. There is no way to tell from the report alone, but based on the routine procedures and scale of NSA operations it is likely that some of these individual incidents (1 of 2,776) affected hundreds of thousands of people.

* By the way, as again the story notes, the 2,776 cover only Ft. Meade and nearby offices. There would be substantially more incidents in an audit that included the SIGINT Directorate's huge regional operations centers in Texas, Georgia, Colorado and Hawaii -- and the activities of other directorates such as Technology, and such as Information Assurance, that also touch enormous volumes of data.

* It's fair game to take a full data set and challenge a reporter's (or researcher's) analysis of the data. But this was not a full data set and it's a mistake for David to think he can suss out the whole story from the limited number of documents we posted alone. I drew upon other documents and filled the gaps with many hours of old-fashioned interviews. I took some primary material, combined it with other leads, and applied journalism in order to understand what the material says, what it doesn't say, and what inferences can and can't be drawn from it. That's among the reasons we don't just dump documents into the public domain. There are not many stories in the Snowden archive that can be told by documents alone.

* Despite all this, David is surely right to say the error rate is very low in percentage terms. That is important in assessing individual performance, and maybe that's the end of the story for you. That's your choice. For some people, public policy question considers the absolute number as well. We might not accept the more mundane harm of 1 million lost airline bags a year, even if 99.9 percent of 1 billion bags checked annually made it to their destinations. Some systems have to be designed with less fault tolerance than others. That's a political and social decision, but we have been unable to debate it until the Snowden disclosures.

* Part of the importance of this story is that the government worked so hard to obscure it. In public releases of semi-annual reports to Congress, the administration blacked out ALL statistical data. (By the way, note that the tables in the 14-page document I posted are unclassified. In the DOJ/DNI report to Congress, they were marked Top Secret // Special Intelligence, which made public release impossible and restricted the readership in Congress.) Alongside the refusal to release any data, the government left the very strong impression that mistakes were vanishingly rare and abuse non-existent. That may depend on the definition of "abuse." Marcy Wheeler quotes a tv interview in which I discussed that and makes some additional points here.