The Future of Networking

On a twisting road that runs behind the main campus of Johns Hopkins University, there's a building that contains the entire known universe. Or at least pictures of it.

The building is the Space Telescope Science Institute, home to the team that ran the Hubble Space Telescope for years. This space will house the operations for the James Webb Space Telescope when it launches in 2018. Currently, it also stores the Mikulski Archive for Space Telescopes (MAST). And MAST is more than just the Universe's photo album; it's becoming the archive of record for nearly all astronomy imagery and data.

About 30 miles southwest of STSCI's Baltimore locale, there's another federally funded set of research to deal with another sort of big data. The Office of Information Technology at the National Institutes of Health is working to implement changes to the Institutes' networks and infrastructure. The hope is to allow researchers inside and outside its Bethesda campus to make better use of the high-performance computing resources, massive databases of genomes, and other research data.

Both the STSCI and NIH, despite vast differences in focus, are representative of the shifting demands of computational research. Today, organizations must move beyond the silos that have prevented collaborating researchers from getting to data—and knowledge—that could drive new science. Spaces like these have become hubs of collaboration for scientists in their respective fields, and they are facing ever-growing demands for both raw information and ways to turn that data into knowledge. There is imminent need. Hosts must provide computing power on demand to both perform large-scale analysis of existing data and to open up resources to allow more open research beyond institutions' walls.

This is driving an evolution in how these organizations build their networks and computing infrastructure, moving toward something that looks a lot like that of Google and other Web giants. It's also requiring a bigger investment in high-speed networks, both internally and externally.

The challenges of really big research data

There are some fundamental obstacles that STSCI's MAST and NIH's research networks both face. First is the scale of the data. At STSCI, the big data is petabytes of imagery and sensor data from NASA's space telescopes and other astronomy missions, as well as data from ground-based astronomy. At NIH, the "big data" that is most often in play is genomic data—a single individual's genome is three billion base pairs of data. Hundreds of thousands of genomes are being analyzed at a time by researchers hoping to find patterns in genes related to cancer or other illnesses.

With all that data comes a big—and growing—demand for access. MAST currently serves up, on average, between 14 and 18 terabytes worth of downloads per month to scientists through its various applications. Much of this needs to be transformed from raw data into processed imagery before it's delivered, based on calibration data for the telescopes that collected it. (About half of that is from the Hubble Space Telescope.) Similarly, the demands at NIH aren't necessarily for raw genomic information, but for analysis data derived from it. This requires access to high-performance computing resources that researchers themselves may not have.

This is a challenge to the openness of research using these massive data stores. Most of the work served by both STSCI and NIH is done by networks of researchers who are just as likely to be across the street from the institutions as they are on the other side of the globe. They not only need access to data and computing power, but these individuals need ways to collaborate around projects. It has led the institutions toward providing an increasing number of collaborative tools on top of their mission-specific applications.

From “human jukebox” to petabytes on disk

STSCI has been serving up images from the Hubble Telescope since before there was a World Wide Web. I remember fetching some of the earliest public Hubble images, of the Shoemaker-Levy comet colliding with Jupiter, in 1994 using Gopher. But back then, aside from a small number of low-resolution images posted via Gopher and FTP, the majority of Hubble's data was stored to optical disk. "We started writing to 12-inch LMSI optical platters back in 1992," said Karen Levay, STSCI's Archive Sciences Branch chief. "Then we switched to (12-inch) 6.5 gigabyte optical platters by Sony—we had four jukeboxes of those online and another collection offline." For everything that didn't fit into the jukeboxes, accessing meant a technician fetching the right platter from the library. "We referred to it as the 'human jukebox,'" she said. It wasn't until 2003 that STSCI moved to keeping all its imagery on spinning disks, using optical disks only for backup.

Red Hat and Windows on Dell and other servers power most of STSCI's MAST archive.

Sean Gallagher

Today, STSCI's archives have "crested the petabyte level in terms of capacity," said Gretchen Green, STSCI's chief engineer for Data Management Systems. That number, she told Ars, doesn't even include the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) data that STSCI is preparing to take on. This ground-based telescope imagery (being used in the search for asteroids and comets) will add another two to three petabytes of imagery data to the archive by itself. For even more storage, MAST has become the archive of record for many previous NASA astronomy missions, such as the Extreme Ultraviolet Explorer (EUVE) and the Swift gamma ray and X-ray space telescope. It's also helping with other space agencies' observation data, including the European Space Agency's XMM-Newton X-ray telescope.

While Levay says that the archive currently handles between 14 and 18 terabytes of downloads a month across all its sources, a significant portion of that has to be transformed before it can be delivered. (That doesn't include requests from scientists who want their data mailed to them on CD-ROM—still a frequently used option.)

The CD-ROM burner, still postal-netting astronomy data after all these years.

Sean Gallagher

As the MAST archive has expanded, it has required other sorts of technology shifts beyond an improvement in storage. "We've had four different incarnations of hardware," said Ron Russell, STSCI's Principal Technologist. "We had VMS, then Ultrix, then Sun Solaris 7, 8, 9, and 10."

That last hardware generation was installed in 2003. A hardware refresh last year retired Solaris in favor of Linux Red Hat and Windows Server, and the group pitched Sybase in favor of Microsoft SQL Server—mostly for the economics of it, Russell said. There's also a smattering of MySQL and PostgresQL databases for various projects at STSCI, and one application uses an XGrid of Apple XServers for image processing.

An Apple X-Grid serving up image processing cycles in one of STSCI's computer rooms.

Storage has been a major focus, for very simple reasons: the data sets aren't getting any smaller. While the Hubble data stored grows on average at about one terabyte per month, new projects are bringing in more data that will make Hubble's archives seem small in comparison.

Not all of that new data is imagery. The Galaxy Evolution Explorer, an all-sky ultraviolet survey satellite mission launched in 2003, will soon have its data added to the MAST archive. This includes a database from the spacecraft's photon detectors that records the time, vector, and energy level detail of every photon intercepted by the sensor. "They're now gathering the photon list—the project ends in April," said Green. "It's a 100-terabyte database."

And then there's the Webb Telescope, which will be managed by STSCI. It will have a mirror six times the size of Hubble's. The Webb will collect massive quantities of new data, and it will be able to detect galaxies so distant that they will be visible as they were when first formed in the early post-"big bang" universe. That research is going to mean potential petabytes more imagery for researchers to process and add to the archive.

27 Reader Comments

question:why cant a distributed computing model like say, folding@home be applied for data storage(trusted sources participating of course) ie. several different locations collaborating on the storage of the data?

question:why cant a distributed computing model like say, folding@home be applied for data storage(trusted sources participating of course) ie. several different locations collaborating on the storage of the data?

In that situation I would imagine you would require a whole lot of data duplication for safety reasons. If you stripe the data (à la RAID 0), one person's PS3 crapping out can cripple the whole data set.

question:why cant a distributed computing model like say, folding@home be applied for data storage(trusted sources participating of course) ie. several different locations collaborating on the storage of the data?

In that situation I would imagine you would require a whole lot of data duplication for safety reasons. If you stripe the data (à la RAID 0), one person's PS3 crapping out can cripple the whole data set.

And that's why Sony and Microsoft are pushing so hard for always-online game consoles. You just can't have some ding-dong holding up major scientific institutions just because he wants to take his games with him on vacation.

question:why cant a distributed computing model like say, folding@home be applied for data storage(trusted sources participating of course) ie. several different locations collaborating on the storage of the data?

In that situation I would imagine you would require a whole lot of data duplication for safety reasons. If you stripe the data (à la RAID 0), one person's PS3 crapping out can cripple the whole data set.

Additionally, what happens when there's an Outage in an area with an unusually high concentration of storage users? Half the data becomes inaccessible because there are 50 key data stores offline.

Consumer-grade bandwidth is another issue, especially if those people are actually using their Internet connections in any meaningful way.

How do you choose "trusted sources"? How much time and expense do you spend doing that vs. just adding drives to your existing data store?

Finally, who would offer to give up tens of GBs of precious HDD space indefinitely, and offer up half (or more) of their 'net connection for free?

Distributed computing works because it requires none of these: there's very little 'net access or storage space required (even if the data set is a few MBs, most of these projects will download one, then work on it for a few hours/days), don't require trusted sources (though, it's still a good idea to verify by having each work unit done by two different users' computers), and don't consume computing resources that were going to be used by the user anyway. The clients are designed to be able to pause their work on a moment's notice and the larger system can gracefully handle large delays in responses from the workers.

Distributed storage is almost diametrically opposed to that: lots of 'net access, lots of storage space required, trusted sources are helpful (but can be worked around via encryption, at the cost of slowing data access), consuming computing resources that the user would like to have (HDD space, Internet bandwidth), barely able to pause work quickly, and intolerant of large (and varying) response times.

Once Google Fiber (tm) has rolled out to the whole of the US, and HDD prices drop to the "pennies per TB" range, this becomes a possibility. Until then, though...

question:why cant a distributed computing model like say, folding@home be applied for data storage(trusted sources participating of course) ie. several different locations collaborating on the storage of the data?

Storage can use a distributed computing model, many companies already use it (google, facebook, etc) and offer it to customers as "cloud" storage. The data is replicated at several different locations (server farms) that are linked with high speed connections.

The major issue with distributing data storage is latency (time delay experienced in the system). If you rely on distributed data your ability to access that data is then limited by the infrastructure between the user and the data.

It's analogous to loading a file on your own computer: if you load it from your hard drive it will load with almost no latency, however try and load the same file from the distributed storage and you will wait longer (while it downloads) before it can load.

A factual correction: VOSpace was not developed initially by the Canadian Astronomy Data Center. It started initially as a SOAP-based interface offering the same functionality developed jointly by Caltech, JHU and the now-defunct UK Astrogrid project about 7 years ago. It was then updated to include a REST-based interface, reflecting the changing web services landscape and CADC were a contributing partner to this effort.

A factual correction: VOSpace was not developed initially by the Canadian Astronomy Data Center. It started initially as a SOAP-based interface offering the same functionality developed jointly by Caltech, JHU and the now-defunct UK Astrogrid project about 7 years ago. It was then updated to include a REST-based interface, reflecting the changing web services landscape and CADC were a contributing partner to this effort.

Thanks, Matthew. I'll correct that. I had been digging into VOspace and hadn't found the original authorship, though I knew it was originally based on SOAP.

i'm excited by some of the new open source software infrastructure coming out for big science computing. lustre isn't new but it has grown in leaps and bounds. and i think michael stonebraker's SciDB is really exciting -- it's the first compelling scientific unified distributed storage and computation system i've seen.

I wonder if they use ZFS, or some similar file store, to protect against and detect bitrot? 'Cause with that much data, standard RAID storage doesn't cut it.

I expect they are using Hadoop or similar. If your data gets that large, it seems like there are three options:1. Hadoop2. Buy a big pile of SAN3. Design your own distributed architecture into the software layer

In this case, #3 will be very hard given that the data is accessed in all kinds of ways. That might be where they're at right now, though.

If your article is going to mention Genomics, NIH and movement of massive datasets, you may want to mention the technology they are using that allows for high speed bulk data transfer across any type of network connection.

Take a look at the high speed option called Aspera download. The software can transfer data at up to 10Gbps in a single transfer session if the infrastructure allows for it. The protocol in place is called FASP.

I wonder if they use ZFS, or some similar file store, to protect against and detect bitrot? 'Cause with that much data, standard RAID storage doesn't cut it.

I expect they are using Hadoop or similar. If your data gets that large, it seems like there are three options:1. Hadoop2. Buy a big pile of SAN3. Design your own distributed architecture into the software layer

In this case, #3 will be very hard given that the data is accessed in all kinds of ways. That might be where they're at right now, though.

They are using some Hadoop. But they don't really have the compute power right now. Most of their compute is focused on image pipelining. The main data stores are indexed by database, with some web products and other low-res content in BLOB. A good deal of data is kept in offline mode, and not accessed unless it's requested. They keep the most commonly requested content in an online cache.

Remember that they don't need sub-second response like Facebook. And if they find bitrot on a hash check of files, they can restore from electro-optical.

I wonder if they use ZFS, or some similar file store, to protect against and detect bitrot? 'Cause with that much data, standard RAID storage doesn't cut it.

I expect they are using Hadoop or similar. If your data gets that large, it seems like there are three options:1. Hadoop2. Buy a big pile of SAN3. Design your own distributed architecture into the software layer

In this case, #3 will be very hard given that the data is accessed in all kinds of ways. That might be where they're at right now, though.

They are using some Hadoop. But they don't really have the compute power right now. Most of their compute is focused on image pipelining. The main data stores are indexed by database, with some web products and other low-res content in BLOB. A good deal of data is kept in offline mode, and not accessed unless it's requested. They keep the most commonly requested content in an online cache.

Remember that they don't need sub-second response like Facebook. And if they find bitrot on a hash check of files, they can restore from electro-optical.

The typical procedure for requesting data from MAST is to go to their website, search for data of interest, and submit a list of datasets you want. Then you wait anywhere from a few hours to a day or two for your data to be processed and staged. Once it's ready you can download it over anonymous FTP or have it uploaded to your own servers over anonymous FTP.

I wonder if they use ZFS, or some similar file store, to protect against and detect bitrot? 'Cause with that much data, standard RAID storage doesn't cut it.

I was wondering the same thing, and also, are they working on using longterm storage drives (like the one Hitachi announced that can store data for millennia: http://gizmodo.com/5946110/this-piece-o ... ta-forever). Data from something like astronomical observations or the genome doesn't change much over time, it just gets added to. Format interoperability would be an issue for future generations, but other than that I see no reason why future generations wouldn't want to mine some of this data too.

I no longer know about Virtual Observatory nuts and bolts -- we were playing at providing the REST endpoints for our own archives, which at the time didn't need much more than a modest iSCSI SAN.

I do know that the European Southern Observatory (ESO) created a clever meta-system out of cheap white-box Linux disk servers; the archive manages data-redundancy and moves things around in an attempt to avoid bit-rot, flags disks that host data that fails checksums and so on. I suppose that it is a "file-level" recapitulation of ZFS (where ZFS mostly works with sub-file, block allocation units).

ESO's archive grew out of yet another sneaker-net: this time, a data pipe between the top of Paranal in Chile, and the primary data center outside of Munich built on top of FedEx.

(I've used ZFS every day for at least the past five years. I don't know for sure if anyone uses it for astronomical data archive implementations. Probably considered an implementation detail.)

Hmm, I wonder if as these databases get larger and larger and they develop huge specialized SANs to get the computations closer to the storage, the latency problems caused by many different programs trying to access the same database at the same time might become problematic enough to justify some kind of "copipelining" (not sure exactly what to call it), where analysis runs written by different researchers but which are amenable to sequential streaming of chunks of the dataset can be merged to "pool" the streaming overhead, i.e. the streamed-in data as it comes from the media is then distributed over a really-high-speed local bus to different analysis programs that can just analyze it as it streams by, so each analysis doesn't have to "pay" for a unique streaming of the data.

Nice to see my workplace written up on Ars! Sean, let me know next time you're around the JHU campus or STScI.

Maybe sometime I can show you around the optics lab we've just built, down the other end of the hallway from MAST, to work on next generation mirror control technologies for exoplanet imaging. (Among other things. But direct imaging of exo-Earths is the main driver for where we'd like to get over the next 10-20 years...)

VOSpace is a pretty great technology - there's a FUSE filesystem front end that makes it work like Dropbox but with scale better suited for this kind of collaboration. We're using a couple terabytes as team collaborative space for the Gemini Planet Imager project, with all the instrument test data being written to there from the Lab for AO at UC Santa Cruz so that it becomes available right away to our intercontinentally spread out team. No way we could afford that much space at Dropbox or equivalent rates. Amazon S3 wouldn't be so bad but you'd have to run your own front end, handle user authentication for the hosted file system, etc, while VOspace basically does all that infrastructure stuff for us so we astronomers can concentrate more on science.

Now if only we could get our Canadian colleagues to put in a better internet pipe to Victoria so there's not so much damn lag to their island server farm... Storage and bandwidth are like telescopes - you always want bigger!

It actually is a grandfathered halon system. The STScI building went up in the early 80s, with what was a state-of-the-art computing setup back then. Still using the same server rooms today. Admittedly with mostly different gear inside :-)

Sean Gallagher / Sean is Ars Technica's IT Editor. A former Navy officer, systems administrator, and network systems integrator with 20 years of IT journalism experience, he lives and works in Baltimore, Maryland.