Not research data as such (although it could be the subject of research), but a long and interesting blog post about how Tumblr manages huge amounts of user generated data. It’s interesting not just because of the scale of the task day-to-day, but also because it offers some lessons learned about how to scale up to managing an ingest of several terabytes a day. When we talk about ‘big data’ in the sciences, is it this big? Bigger? How is big science actually managing data on this scale? I really don’t know.

500 million page views a day

15B+ page views month

~20 engineers

Peak rate of ~40k requests per second

1+ TB/day into Hadoop cluster

Many TB/day into MySQL/HBase/Redis/Memcache

Growing at 30% a month

~1000 hardware nodes in production

Billions of page visits per month per engineer

Posts are about 50GB a day. Follower list updates are about 2.7TB a day.

Dashboard runs at a million writes a second, 50K reads a second, and it is growing.

The (JISC-funded) project “Managing Research Data: Gravitational Waves” was nominally about gravvy wave date, but used that as a route in to talking about ‘big science’ data in general. The project URL is , and its report is at .

Section 1.2 of the report gives some ‘big data’ numbers.

The scale for this is I think set by the ATLAS experiment (one of the two big ones, out of four, at the LHC). That preserves what I now think of as ‘1 LHC’, namely 10 PB/yr. That’s in the region of 20-30 TB/day, in a mixture of bulk data and RDBMS data (I don’t know the mix), though the peak rates will be higher. The current LIGO experiment (gravvy waves) stores about 1PB/yr when running (it’s having a refit at present).

The SKA radio observatory will, everyone hopes, be commissioned around 2020, and will require transporting, though not necessarily storing, about 1Tb/s locally and about 100Gb/s intercontinentally. That’s 0.5 EB/yr, and 0.05% of the predicted 1ZB/yr worldwide IP traffic for 2015.

So those are respectable data volumes.

That’s pretty rich data (lots of metadata, though this will be dwarfed by the size of the bulk data). In contrast, astronomical object databases — which are relational, and have substantially more information per byte — come in at around the 1-10TB scale, though these are carefully curated, and highly reduced datasets.

There are some more details, and discussion of the consequences of all this, in the project report, which might be interesting to read.

It occurs to me to add that this volume of data is typically _not_ stored at an institution.

* CERN is the single Tier-0 site, with copies of all the data on both spinning disks and tapes.
* There are 11 Tier-1 sites around the world, all of which (I think) hold a copy of all of the data. The Rutherford Lab is the Tier-1 for the UK
* There are multiple Tier-2 sites associated with each Tier-1, which hold shifting fractions of the data. Glasgow Uni is the Tier-2 for Scotland.
* Individual institutions and departments are ‘tier-3′

The data management was designed and is run by the ‘LCG’ — the LHC Computing Grid: http://lcg.web.cern.ch/ — as a development group with roughly equal status to the detector and accelerator engineering groups.

At UH, big = Physics, Astronomy and Maths research. They have a 90 core hpc cluster with 200TB storage, which is nearly full. Nothing compared to Tumblr but enough to never let me sleep if I were responsible for it.