Sunday, May 26, 2013

Under the auspices of the Digital Preservation Coalition, Maureen Pennock has written a very comprehensive overview of Web Archiving. It is an excellent introduction to the field, and has a lot of useful references.

the best DNA storage can do with those dimensions [a gram of dry DNA] is 5.6*1015 bits.

A Bekenstein-bound storage device with those dimensions would store about 1.6*1038 bits.

So, there is about a factor of 3*1022 in bits/gram beyond DNA. He also compares the Bekenstein limit with Stanford's electronic quantum holography, which stored 35 bits per electron. A Bekenstein-limit device the size of an electron would store 6.6*107 bits, so there's plenty of headroom there too. How reliable storage media this dense, and what their I/O bandwidth would be are open questions, especially since the limit describes the information density of a black hole.

Thursday, May 16, 2013

In my various posts over the last six years on A Petabyte For A Century I made the case that the amounts of data and the time for which they needed to be kept had reached the scale at which the reliability needed was infeasible. I'm surprised that I don't seem to have referred to the parallel case being made in high-performance computing, most notably in a 2009 paper, Toward Exascale Resilience by Franck Cappello et al:

From the current knowledge and observations of existing large systems, it is anticipated that Exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint-restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system.

Here is a fascinating presentation by Horst Simon of the Lawrence Berkeley Lab, who has bet against the existence of an Exaflop computer before 2020. He points out all sorts of difficulties in the way other than reliability, but the key slide is #35 which does include a mention of reliability. This slide makes the same case as Cappello et al on much broader arguments, namely that to get more than an order of magnitude or so beyond our current HPC technology will take a complete re-think of the programming paradigm. Among the features required of the new programming paradigm is a recognition that errors and failures are inevitable and there is no way for the hardware to cover them up. The same is true of storage.

The overall effect is that we’re having a conversation in which
issues get hashed over with a cycle time of months or even weeks, not
the years characteristic of conventional academic discourse.

Second, the corruption of the reviewing process:

In
reality, while many referees do their best, many others have pet peeves
and ideological biases that at best greatly delay the publication of
important work and at worst make it almost impossible to publish in a
refereed journal. ... anything bearing
on the business cycle that has even a vaguely Keynesian feel can be
counted on to encounter a very hostile reception; this creates some big
problems of relevance for proper journal publication under current
circumstances.

Third, reproducibility:

Look at one important recent case ...
Alesina/Ardagna on expansionary austerity. Now, as it happens the original A/A paper was circulated through relatively “proper” channels: released as an NBER working paper, then published in a conference volume, which means that it was at least lightly refereed. ... And how did we find out that it was all wrong? First through critiques posted at the Roosevelt Institute, then through detailed analysis of cases by the IMF. The wonkosphere was a much better, much more reliable source of knowledge than the proper academic literature.

We believe the [Elsevier] adds relatively little value to the publishing process. We are not attempting to dismiss what 7,000 people at [Elsevier] do for a living. We are simply observing that if the process really were as complex, costly and value-added as the publishers protest that it is, 40% margins wouldn’t be available.

The world's research and education budgets pay [Elsevier, Springer & Wiley] about $3.2B/yr for management, editorial and distribution services. Over and above that, the worlds research and education budgets pay the shareholders of these three companies almost $1.5B for the privilege of reading the results of research (and writing and reviewing) that these budgets already paid for.

What this $4.7B/yr pays for is a system which encourages, and is riddled with, error and malfeasance. If these value-subtracted aspects were taken into account, it would be obvious that the self-interested claims of the publishers as to the value that they add were spurious.

Tuesday, May 7, 2013

"We
certainly have ways in national security investigations to find out
exactly what was said in that conversation. ... No, welcome to America. All of that stuff is being captured as we speak whether we know it or like it or not." and "all digital communications in the past" are
recorded and stored