Thursday, August 29, 2013

Capacity of the Utah Data Center

The Utah Data Center (UDC), a one-million-square-foot data storage warehouse being built by the NSA at Camp Williams, Utah, is scheduled for completion next month. Estimates of the storage capacity of this facility range from as high as a yottabyte or more of data, enough to store this year’s global Internet traffic more than one thousand times over, to as low as three exabytes. However, an assessment based on the facility’s size and projected power consumption suggests that its initial capacity is likely to be around 7 to 10 exabytes. A 7- to 10-exabyte data warehouse would be an extremely large facility, but even at 10 exabytes it would be only one hundred-thousandth the size of a yottabyte facility.

How big is the Utah Data Center?

The NSA has not released any information about the data storage capacity of the UDC. But it has released some details about its size and physical infrastructure.

According to the NSA, the facility will contain four data halls with a total of 100,000 square feet of raised floor. Other structures totaling approximately 900,000 square feet will host support functions such as materials storage, administration, cooling, and backup power.

The NSA has also reported that the facility is being built with the capability to deliver 65 megawatts “technical load to the raised floor”. This figure has often been taken to represent the total electrical power consumption of the entire facility, but it is likely that this careful wording refers only to the power delivered to the IT equipment in the data halls. Data centers have other power needs, most significantly with respect to providing cooling to the IT equipment.

The scale of the cooling requirements at the UDC can be gleaned from the projected water consumption at the facility. According to the NSA, the UDC’s water usage at full load will be “approx. 1.7 million gal/day”. Cooling towers lose water, principally through evaporation, as part of their normal operation. Typically, about two gallons of water are lost per hour to provide one “ton” of cooling. This suggests that the cooling towers at the UDC will have a cooling capacity of about 36,000 tons. Photos of the facility show that 36 cooling tower cells, each with a 12-foot fan, have been constructed (18 at each end of the complex), which suggests that each cell can provide about 1000 tons of cooling. Another estimate puts the cooling capacity at the facility at 60,000 tons, but this does not seem likely unless additional, alternative cooling technologies that do not require water consumption are also used.

One ton of cooling typically also consumes about 0.6 kW of electrical power, which suggests that the UDC will require approximately 22 MW for cooling when running at its full 36,000-ton load (or 36 MW if 60,000 tons is accurate). Assuming the lower figure is the more correct, the total power consumption (IT, cooling, and other) of the facility at peak is likely to be around 90–95 MW.

Not all of this power will be needed immediately, however. According to the NSA, the UDC will require only 30 MW in its data halls when it first opens, which suggests an initial total power requirement of around 40 MW at the facility. This lower initial power requirement probably relates to the initial rate at which data will be delivered to the facility and accessed for processing by NSA. It would not make sense to run, or even to install, the facility’s full potential capacity months or years before it needs to be utilized. It will probably be a couple of years before the NSA needs to run the UDC at or near its full power.

Moore’s Law

And even at that point the facility is unlikely to be “full”. As long as storage technology continues to advance, the replacement of older storage drives with newer generations will enable the capacity of the UDC to continue to increase. Hard drive storage capacity per unit cost has grown by a factor of 10 every 4.2 years for the last three decades, and storage capacity per Watt has grown at a similar rate. If these growth rates can be sustained, the capacity of the UDC could reach the yottabyte range—an almost unthinkably huge amount of data today—in as little as 20 years’ time.

On-going growth in storage capacity does appear to be the NSA’s intent. As the Salt Lake Tribune reported in June 2013, NSA officials will not provide “exact numbers on how much data the NSA is preparing to store at Bluffdale,” but they will confirm that “they built the center’s capacity with an eye on Moore’s law”.

How much data can the UDC hold now?

Assuming that the UDC will operate roughly at but not significantly beyond the best level of technology available to large, commercial data centres, the question of how much data can be stored at the facility can probably be answered within a reasonable range of accuracy by looking at comparable commercial large-scale storage capabilities.

In June 2013 the Internet Archive’s Brewster Kahle estimated that current cloud storage typically uses about 5 kW per petabyte, or 5 MW per exabyte (IT load only). He later reduced this estimate to about 4.2 MW per exabyte (IT load only). Kahle’s newer estimate suggests that the initial storage capacity of the UDC may be about 7.2 exabytes, or perhaps as high as 15 exabytes if the site’s full projected power consumption is used as the basis of calculation. Based on the size of its data halls and perhaps also his power consumption estimates, Kahle himself estimated a 12-exabyte capacity for the data center.

Gladwin was recently quoted by Forbes to the effect that the overall size of a data facility capable of storing 10 exabytes would have had to have been about two million square feet in January 2012 but would need to be only one million square feet now, which also suggests that 10 exabytes may be a good estimate of the current storage capacity of the one-million-square-foot UDC.

The overall size of a facility is at best a very crude measure of its storage capacity, however. A more promising approach to the question might be to compare the amount of raised-floor space available in the facility’s data halls to the physical space required to house a given level of storage capacity with today’s technology.

Disk drives and their associated servers are commonly installed in 19”-wide equipment racks. An efficiently laid out IT room can require as little as 25–30 square feet per rack, including aisles and other space for accessing the equipment, which suggests that the UDC’s data halls could hold as many as 3,300–4,000 racks in total. The amount of storage that can be accommodated in a single rack is constantly growing, but numbers in excess of 2 petabytes are already feasible. The manufacturer Xyratex Ltd., for example, announced in 2012 that its products could accommodate up to 2.5 petabytes per rack, which suggests that the UDC may be capable of storing as much as 8.3–10 exabytes of data, i.e., about the same as the capacity suggested by the facility’s power consumption numbers.

At least one expert believes that storage of this scale is not yet feasible at the facility. Paul Vixie, also quoted in the Forbes article, estimates that there will be “less than three exabytes of data capacity” at the UDC when it opens. However, the facility’s projected power consumption suggests that Vixie’s estimate is likely to be low. If it were correct, it would mean that even at its low initial power consumption, the UDC would require more than 10 MW of power for every exabyte of data stored (and closer to 15 MW once cooling requirements are included). This seems implausibly inefficient for an all-new facility that is seeking LEED Silver certification.

Thus, the most plausible range for the UDC’s initial storage capacity seems likely to be roughly 7 to 10 exabytes.

What about tape storage?

The numbers used up to this point assume that all of the data stored at the UDC will be stored on disk. This will undoubtedly be the case for data that the NSA needs to have instantaneous access to, such as the metadata used for data mining or indexing intercepted communications. But for cost reasons the agency may prefer to store data that does not require rapid or frequent access on tapes that are loaded into a tape drive only when required. According to a recent Clipper Group study, the purchase cost of storage tapes, tape drives, and automated tape library systems is significantly lower than the cost of comparable disk storage systems. Furthermore, for data that is rarely accessed, the energy costs of tape storage can be as low as 1% of those for disk storage.

The Clipper Group study estimated that tape storage uses roughly one-quarter of the data hall space required for the same amount of disk storage. If, for the sake of analysis, we were to assume that the UDC dedicated half of its data hall space to tape storage, we might expect the total storage capacity at the facility to be around 2.5 times the amount that could be provided solely using disk systems. If the amount of disk-based storage that can currently be accommodated in half of the UDC’s data hall space is around 5 exabytes, for example, the expected overall storage capacity of the UDC would be around 25 exabytes.

This number is probably too high, however. There are probably significant limits to the proportion of its data that the NSA is willing to consign to tape storage, and the extensive power and cooling distribution systems constructed at the UDC, and its large projected power consumption, also argue against the possibility that tape will displace disk-based systems for a large proportion of the site’s storage. Thus, although it is possible that the NSA has opted to include the use of tape at the UDC, tape storage seems unlikely to significantly exceed the scale of disk storage at the site. Allowing for the possibility of tape storage does mean that the initial storage capacity of the site could plausibly be as high as 15–20 exabytes, however.

NSA whistleblower William Binney provided a somewhat similar estimate of the facility’s size in a September 2012 legal declaration, stating that the facility is likely to store “in the range of multiples of ten exebytes [sic]” of data. (Binney has also been associated with much larger estimates of the facility’s size, but it seems likely that he was referring to its possible future growth in those comments.)

What will NSA store at the UDC?

As the revelations from another whistleblower, Edward Snowden, have confirmed, the torrent of information flowing through the global Internet is now NSA’s largest target. The 7–10 exabytes (or more) that NSA may soon be able to store at the UDC is a very large amount of data, but it is minuscule in comparison to the total amount of information flowing through the Internet, which according to an estimate recently cited by the NSA is nearly 670 exabytes per year. Even 20 exabytes represents only 3% of that gargantuan annual total.

But 7–10 exabytes is not minuscule in comparison to the amount of Internet data that the same NSA document says the agency “touches” (i.e., pulls out of the data streams flowing past) per year. According to that document, NSA currently processes in one form or another about 29 petabytes of Internet data per day—or about 10.6 exabytes per year. (A much smaller proportion of that processed data is actually viewed by analysts, of course.)

And that 10.6 exabytes of data, although it represents only 1.6% of the total data flow, probably comprises a fairly significant proportion of the Internet data that NSA is actually interested in monitoring and is capable of accessing.

Even if it were capable of doing so, NSA would have no interest in copying everything transmitted through the Internet. It is estimated, for example, that various forms of video (movies, TV shows, YouTube) constitute roughly 85% of all Internet traffic. No intelligence agency needs to store a separate, complete copy of Batman Returns every time someone streams the movie to her home. Music and online gaming, which are also responsible for large data flows, would also be of little interest.

Similarly, although NSA might want to record the web-browsing histories of individuals, it would not need to record a separate, complete copy of the front page of The New York Times every time one of that paper’s millions of readers downloaded it. One copy of each webpage, updated whenever the page was changed, would be enough to record its content. Metadata documenting the webpages visited is all that would be needed for most individual files.

The main data that the NSA would be interested in recording is original, user-generated content. Authoritative numbers for such traffic are hard to come by, but it is safe to conclude that original voice, text, chat, e-mail, and document traffic comprise only a small proportion of Internet data flows.

Cisco Systems estimates that global consumer Voice Over IP (VOIP) traffic, for example, currently accounts for approximately 159 petabytes per month [Cisco Visual Networking Index: Forecast and Methodology, 2011–2016, Cisco White Paper, 30 May 2012], which would total about 1.9 exabytes per year if it could all be accessed.

E-mail volumes are more difficult to estimate, but an indication of the scale of this traffic can be found in a recent report by the market research company Radicati Group, which estimated that 183 billion e-mails are sent every day by the world’s 2.4 billion e-mail users. Since this amounts to 76 e-mails per user per day, this total must include spam and other multiple-recipient traffic. If we generously assume that 10% of e-mails are original texts written by the sender (Symantec recently estimated that approximately 70% of e-mail is spam, and mailing lists and other forms of duplicated mail must account for a substantial part of the remainder), the maximum number of e-mails that NSA might wish to store, assuming it could access them all, would probably be no more than 7 trillion per year. If the average size of these e-mails, excluding attachments, is approximately 75 kilobytes (Microsoft implied in this graphic that the 150 petabytes of stored e-mail that it migrated from Hotmail to Outlook.com in 2013 was equivalent to about 2.2 trillion e-mails), the total volume it might wish to store would be something like 0.5 exabytes, assuming no further compression of the data were possible. Storing original attachments as well would of course raise this number substantially, perhaps adding several exabytes to the total.

Tweets and other forms of texting would take much less space. More than 400 million tweets are sent every day. But storing a tweet probably takes less than one kilobyte, even when associated metadata is included, so a year’s worth of the world’s tweets would probably take less than 150 terabytes to store. Chat and instant messaging texts presumably pose similar storage requirements.

These statistics suggest that the 10 or so exabytes of Internet traffic currently being processed by the NSA may contain a large proportion of the original, user-generated content being transmitted over the part of the Internet to which NSA and its allies have access. At the moment, the vast majority of that data is reportedly deleted after three days, presumably for lack of space to store it. Completion of the UDC will give the NSA the option to save most of that data indefinitely.

The Internet is not the only source of data that the NSA might wish to save, of course. Non-Internet telephony intercepts, to name perhaps the most important example, might also run into the exabyte range. An indication of the scale of this traffic can be gleaned from the report How Much Information?, which estimated that the total data content of all U.S. domestic phone calls (voice only) in 2008 was 1.36 exabytes. This figure counted each conversation twice, however: once for each participant (see footnote 45 on page 36 of the report). Storing each phone call once—a total of 39 billion hours of phone conversation—would thus take only 680 petabytes. Furthermore, if all of the calls were stored with the quality and compression of cell phone speech, it might be possible to reduce this number to approximately 180 petabytes.

It does not appear likely that the NSA is currently storing U.S. domestic phone calls. (Only the storage of phone call metadata has been confirmed.) But these statistics give a sense of the scale of the storage task posed by the world’s telephone communications. Assuming that the global per capita use of voice telephony is no greater than that of the United States, the storage space required to save all the phone calls made by everyone in the world could be as little as 3.6 exabytes per year.

The Swedish telecommunications company Ericsson recently estimated that worldwide mobile voice traffic comprises about 180 petabytes per month, or roughly 2.2 exabytes per year. Addition of landline traffic (not estimated by Ericsson) would make this number significantly larger, but it probably wouldn’t double it, suggesting that a 3.6-exabyte estimate for all non-Internet voice traffic may well be in the right ballpark.

These examples are not meant to be exhaustive. Telephony and Internet-based communications are not the only forms of data/metadata that NSA might wish to target for acquisition and storage. But these examples do demonstrate that the data streams generated by human communications and online activities are not impossibly large: it is becoming feasible for spy agencies with big budgets to store much of the voice and text data that humans generate.

With an initial capacity probably in the 7- to 10-exabyte range, the Utah Data Center is probably not large enough to store even a single year’s worth of the Internet data, telephony, and other information currently being processed (and mostly discarded) by NSA. But it may well have the room to store the material that NSA deems to be of potential current or future interest.

And within just a few years, if Moore’s Law holds, the UDC’s storage capacity will be large enough—and be growing fast enough—to accommodate the entire ten or more exabytes of Internet data acquired every year and all of the additional telephony, purloined computer files, and other data that the NSA obtains and might want to save. At that point the most significant limitation on the amount of data being stored at the UDC may well lie in the NSA’s ability to access the data it seeks and the capacity of its communications circuits to haul it all back to Utah from the various intercept points around the world.

[Update 11:00 pm: Corrected two instances where I wrote "terabytes" instead of "petabytes". Thanks for catching that, Brewster!]