Wednesday, December 30, 2015

Everyone interested in academic communication should read Amy Fry's magisterial Conventional Wisdom or Faulty Logic? The Recent Literature on Monograph Use and E-book Acquisition. She shows how, by endlessly repeating the conclusion of a single study of a single library now more than 35 years out of date, publishers and their consultants have convinced librarians that traditional collection development has failed and needs to be replaced by patron-driven acquisition. And how, building on this propaganda victory, they moved on to convince librarians that, despite studies showing the opposite, readers across all disciplines preferred e-books to print. Below the fold, some details.

On Tuesday, we announced a major new initiative to bring this vision to reality, supported by a coalition of over 40 of the world’s essential scholarly organizations, such as JSTOR, PLOS, arXiv, HathiTrust, Wiley and HighWire Press, who are linking arms to establish a new paradigm of open collaborative annotation across the world’s knowledge.

Over the past fifteen years, our perspective on tackling information
interoperability problems for web-based scholarship has evolved
significantly. In this opinion piece, we look back at three efforts that
we have been involved in that aptly illustrate this evolution: OAI-PMH,
OAI-ORE, and Memento. Understanding that no interoperability
specification is neutral, we attempt to characterize the perspectives
and technical toolkits that provided the basis for these endeavors. With
that regard, we consider repository-centric and web-centric
interoperability perspectives, and the use of a Linked Data or a
REST/HATEAOS technology stack, respectively. We also lament the lack of
interoperability across nodes that play a role in web-based scholarship,
but end on a constructive note with some ideas regarding a possible
path forward.

They describe their evolution from OAI-PMH, a custom protocol that used the Web simply as a transport for remote procedue calls, to Memento, which uses only the native capabilities of the Web. They end with a profoundly important proposal they call Signposting the Scholarly Web which, if deployed, would be a really big deal in many areas. Some further details are on GitHub, including this somewhat cryptic use case:

Use case like LOCKSS is the need to answer the question: What are all the components of this work that should be preserved? Follow all rel="describedby" and rel="item" links (potentially multiple levels perhaps through describedby and item).

Below the fold I explain what this means, and why it would be a really big deal for preservation.

And yet people still wonder why many people are hesitant to allow any
sort of software update to install. Philips isn't just turning their
product into a wall garden. They're teaching more people that "software
update"="things stop working like they did".

Tuesday, December 15, 2015

When Cliff Lynch found out that I was writing a report for the Mellon Foundation, the Sloan Foundation and IMLS entitled Emulation & Virtualization as Preservation Strategies he asked me to give a talk about it at the Fall CNI meeting, and to debug the talk beforehand by giving it at UC Berkeley iSchool's "Information Access Seminars". The abstract was:

20 years ago, Jeff Rothenberg's seminal Ensuring the Longevity of Digital Documents compared migration and emulation as strategies for digital preservation, strongly favoring emulation. Emulation was already a long-established technology; as Rothenberg wrote Apple was using it as the basis for their transition from the Motorola 68K to the PowerPC. Despite this, the strategy of almost all digital preservation systems since has been migration. Why was this?

Preservation systems using emulation have recently been deployed for public use by the Internet Archive and the Rhizome Project, and for restricted use by the Olive Archive at Carnegie-Mellon and others. What are the advantages and limitations of current emulation technology, and what are the barriers to more general adoption?

Below the fold, the text of the talk with links to the sources. The demos in the talk were crippled by the saturated hotel network; please click on the linked images below for Smarty, oldweb.today and VisiCalc to experience them for yourself. The Olive demo of TurboTax is not publicly available, but it is greatly to Olive's credit that it worked well even on a heavily-loaded network.

Thursday, December 10, 2015

Four and a half years ago Ian Adams, Ethan Miller and I proposed DAWN, a Durable Array of Wimpy Nodes, pointing out that a suitable system design could exploit the characteristics of solid-state storage to make system costs for archival storage competitive with hard disk despite greater media costs. Since then, the cost differential between flash and hard disks has decreased substantially. Below the fold, an update.

Tuesday, December 8, 2015

For some years now the LOCKSS team has been working with countries to implement National Hosting of electronic resources, including subscription e-journals and e-books. JISC's SafeNet project in the UK is an example. Below the fold I look at the why, what and how of these systems.

Thursday, November 19, 2015

The title is a quote from Coach Junior, who teaches my elder grand-daughter soccer. It comes in handy when, for example, the random team selection results in a young lady being on the opposite team to her best friend. It came to mind when I read Kalev Leetaru's How Much Of The Internet Does The Wayback Machine Really Archive? documenting the idiosyncratic and evolving samples the Internet Archive collects of the Web, and the subsequent discussion on the IIPC mail alias. Below the fold, my take on this discussion.

The publisher will argue that this one-sided agreement, often transferring all possible rights to the publisher, is absolutely necessary in order that the article be published. Despite their better-than-average copyright policy, ACM's claims in this regard are typical. I dissected them here.

The SPARC addendum was written by a lawyer, Michael W. Carroll of Villanova University School of Law, and is intended to be attached to, and thereby modify, the publisher's agreement. It performs a number of functions:

Preserving the author's rights to reproduce, distribute perform, and display the work for non-commercial purposes.

Acknowledges that the work may already be the subject of non-exclusive copyright grants to the author's institution or a funding agency.

Imposes as a condition of publication that the publisher provide the author with a PDF of the camera-ready version without DRM.

The kicker is the final paragraph, which requests that the publisher return a signed copy of the addendum, and makes it clear that publishing the work in any way indicates assent to the terms of the addendum. This leaves the publisher with only three choices, agree to the terms, refuse to publish the work, or ignore the addendum.

Of course, many publishers will refuse to publish, and many authors at that point will cave in. The SPARC site has useful advice for this case. The more interesting case is the third, where the publisher simply ignores the author's rights as embodied in the addendum. Publishers are not above ignoring the rights of authors, as shown by the history of my article Keeping Bits Safe: How Hard Can It Be?, published both in ACM Queue (correctly with a note that I retained copyright) and in CACM (incorrectly claiming ACM copyright). I posted analysis of ACM's bogus justification of their copyright policy based on this experience. There is more here.

So what will happen if the publisher ignores the author's addendum? They will publish the paper. The author will not get a camera-ready copy without DRM. But the author will make the paper available, and the "kicker" above means they will be on safe legal ground. Not merely did the publisher constructively agree to the terms of the addendum, but they failed to deliver on their side of the deal. So any attempt to haul the author into court, or send takedown notices, would be very risky for the publisher.

Publishers don't need anything except permission to publish. Publishers want the rights beyond this to extract the rents that generate their extraordinary profit margins. Please use the SPARC addendum when you get the chance.

Tuesday, November 10, 2015

Enough has happened while my report on emulation was in the review process that,
although I announced its release last week,
I already have enough material for a follow-up post. Below the fold, the details,
including a really important paper from the recent SOSP workshop.

Thursday, November 5, 2015

Big companies have embraced the cloud more slowly than expected. Some
are holding back because of the cost. Others are wary of entrusting
sensitive data to another firm’s servers. Should companies be doing most
of their computing in the cloud?

It was sponsored by Microsoft, who larded it with typical cloud marketing happy-talk such as:

The Microsoft Cloud creates technology that becomes essential
but invisible, to help you build something amazing. Microsoft Azure
empowers organizations with the creation of innovative apps. Dynamics
CRM helps companies market smarter and more effectively, while Office
365 enables employees to work from virtually anywhere on any device. So
whether you need on-demand scalability, real-time data insights, or
technology to connect your people, the Microsoft Cloud is designed to
empower your business, allowing you to do more and achieve more.

Tuesday, November 3, 2015

I'm very grateful that funding from the Mellon Foundation on behalf of themselves, the Sloan Foundation and IMLS allowed me to spend much of the summer researching and writing a report, Emulation and Virtualization as Preservation Strategies (37-page PDF, CC-By-SA). I submitted a draft last month, it has been peer-reviewed and I have addressed the reviewers comments. It is also available on the LOCKSS web site.

I'm old enough to know better than to give a talk with live demos. Nevertheless, I'll be presenting the report at CNI's Fall membership meeting in December complete with live demos of a number of emulation frameworks. The TL;DR executive summary of the report is below the fold.

Tuesday, October 27, 2015

On the 17th of last month Amazon, in some regions, cut the Glacier price from 1c/GB/month to 0.7c/GB/month. It had been stable since it was announced in August 2012. As usual with Amazon, they launched at an aggressive and attractive price, and stuck there for a long time. Glacier wasn't under a lot of competitive pressure, so they didn't need to cut the price. Below the fold, I look at how Backblaze changed this.

Wednesday, October 21, 2015

ISO standards are regularly reviewed. In 2017, the OAIS standard ISO14721 will be reviewed. The DPC is spearheading a praiseworthy effort to involve the digital preservation community in the process of providing input to this review, via this Wiki.

I've been critical of OAIS over the years, not so much of the standard itself, but of the way it was frequently mis-used. Its title is Reference Model for an Open Archival Information System (OAIS), but it is often treated as if it were entitled The Definition of Digital Preservation, and used as a way to denigrate digital preservation systems that work in ways the speaker doesn't like by claiming that the offending system "doesn't conform to OAIS". OAIS is a reference model and, as such, defines concepts and terminology. It is the concepts and terminology used to describe a system that can be said to conform to OAIS.

I was therefore asked to inaugurate the DPC's OAIS review Wiki with a post that I entitled The case for a revision of OAIS. My goal was to encourage others to post their thoughts. Please read my post and do so.

Tape has a very credible roadmap out to LTO10 with 48TB/cartridge somewhere around 2022.

Optical's roadmap shows increases from the current 100GB/disk to 200, 300, 500 and 1000GB/disk, but there are no dates on them. At least two of those increases will encounter severe difficulties making the physics work.

The hard disk roadmap shows the slow increase in density that has prevailed for the last 4 years continuing until 2017, when it accelerates to 30%/yr. The idea is that in 2017 Heat Assisted Magnetic Recording (HAMR) will be combined with shingling, and then in 2021 Bit Patterned Media (BPM) will take over, and shortly after be combined with HAMR.

The roadmap for NAND flash is for density to increase in the near term by 2-3X and over the next 6-8 years by 6-8X. This will require significant improvements in processing technology but "processing is a core expertise of the semiconductor industry so success will follow".

The recommendations they submitted are radical but sensible and well-justified by events:

Any vendor of software-defined radio (SDR), wireless, or Wi-Fi radio must make public the full and maintained source code for the device driver and radio firmware in order to maintain FCC compliance. The source code should be in a buildable, change-controlled source code repository on the Internet, available for review and improvement by all.

The vendor must assure that secure update of firmware be working at time of shipment, and that update streams be under ultimate control of the owner of the equipment. Problems with compliance can then be fixed going forward by the person legally responsible for the router being in compliance.

The vendor must supply a continuous stream of source and binary updates that must respond to regulatory transgressions and Common Vulnerability and Exposure reports (CVEs) within 45 days of disclosure, for the warranted lifetime of the product, or until five years after the last customer shipment, whichever is longer.

Failure to comply with these regulations should result in FCC decertification of the existing product and, in severe cases, bar new products from that vendor from being considered for certification.

Additionally, we ask the FCC to review and rescind any rules for anything that conflicts with open source best practices, produce unmaintainable hardware, or cause vendors to believe they must only ship undocumented “binary blobs” of compiled code or use lockdown mechanisms that forbid user patching. This is an ongoing problem for the Internet community committed to best practice change control and error correction on safety-critical systems.

As the submission points out, experience to date shows that vendors of home router equipment are not motivated to, do not have the skills to, and do not, maintain the security of their software. Locking down the vendor's insecure software so it can't be diagnosed or updated is a recipe for even more such disasters. The vendors don't care if their products are used in botnets or steal their customer's credentials. Forcing the vendors to use open source software and to respond in a timely fashion to vulnerability discoveries on pain of decertification is the only way to fix the problems.

Doing so likely violated copyright. Even though The Crossing was not an "orphan work":

in 2009, the year the paper went under, Vaughan began asking for permission—from the [Denver Public] library and from E.W. Scripps, the company that owned the Rocky—to resurrect the series. After four years of back and forth, in 2013, the institutions agreed to let Vaughan bring it back to the web.

Four years, plus another two to do the work. Imagine how long it would have taken had the story actually been orphaned. Vaughan also just missed another copyright problem:

This is the orphan font problem that I've been warning about for the last 6 years. There is a problem with the resurrected site:

It also relied heavily on Flash, once-ubiquitous software that is now all but dead.
“My role was fixing all of the parts of the website that had broken due
to changes in web standards and a change of host,” said [Kevin's son] Sawyer, now a
junior studying electrical engineering and computer science. “The
coolest part of the website was the extra content associated with the
stories... The problem with the website is that all of this content was
accessible to the user via Flash.”

There is a problem with the article. It correctly credits the Internet Archive with its major contribution to Web archiving, and analogizes it to the Library of Alexandria. But it fails to mention any of the other Web archives and, unlike Jill Lepore's New Yorker "Cobweb" article, doesn't draw the lesson from the analogy. Because the Library of Alexandria was by far the largest repository of knowledge in its time, its destruction was a catastrophe. The Internet Archive is by far the largest Web archive, but it is uncomfortably close to several major faults. And backing it up seems to be infeasible.

it doesn't once mention or consider the question of what we are going to do about the billions of orphan works that are being "born digital" every day.

Instead, the Copyright Office proposes to "solve" the orphan works problem with legislation that would impose substantial burdens on users that would only work for one or two works at any given time. And because that system is so onerous, the Report also proposes a separate licensing regime to support so-called "mass digitization," while simultaneously admitting that this regime would not really be appropriate for orphans (because there's no one left to claim the licensing fees). These proposals have been resoundingly criticized for many valid reasons.

She represented the Internet Archive in responding to the report, so she knows whereof she writes about the born-digital user-generated content that documents today's culture:

We are looking down the barrel of a serious crisis in terms of society's ability to access much of the culture that is being produced and shared online today. As many of these born-digital works become separated from their owners, perhaps because users move on to newer and cooler platforms, or because the users never wanted their real identity associated with this stuff in the first place, we will soon have billions upon billions of digital orphans on our hands. If those orphans survive the various indignities that await them ... we are going to need a way to think about digital orphans. They clearly will not need to be digitized so the Copyright Office's mass digitization proposal would not apply.

The born-digital "orphan works" problem is intertwined with the problems posed by the fact that much of this content is dynamic, and its execution depends on other software not generated by the user, which is both copyright and covered by an end-user license agreement, and is not being collected by the national libraries under copyright deposit legislation.

Tuesday, October 13, 2015

Worst of all, our obsession with providing access ultimately results in the loss of access. Librarians created the serials crisis
because they focussed on access instead of control. The Open Access
movement has had limited success because it focusses on access to
articles instead of remaking the economics of academic careers. Last
week Proquest announced it had gobbled up Ex-Libris, further
centralising corporate control over the world’s knowledge. Proquest will
undoubtedly now charge even more for their
infinitely-replicable-at-negligible-cost digital files. Libraries will
pay, because ‘access’. At least until they can’t afford it. The result of ceding control over journal archives has not been more access, but less.

and:

As Benjamin Franklin might have said if he was a librarian: those who
would give up essential liberty, to purchase a little access, deserve
neither and will lose both.

From the very beginning of the LOCKSS Program 17 years ago, the goal has been to provide librarians with the tools they need to take control and ownership of the content they pay for. As I write, JSTOR has been unable to deliver articles for three days and libraries all over the world have been deprived of access to all the content they have paid JSTOR for through the years. Had they owned copies of the content, as they did on paper, no such system-wide failure would have been been possible.

Friday, October 9, 2015

Back in May I posted Time For Another IoT Rant. Since then I've added 28 comments about the developments over the last 132 days, or more than one new disaster every 5 days. Those are just the ones I noticed. So its time for another dispatch from the front lines of the IoT war zone on which I can hang reports of the disasters to come. Below the fold, I cover yesterday's happenings on two sectors of the front line.

Yesterday Doctorow pointed to another of Maciej Cegłowski's barn-burning speeches. This one is entitled Haunted by Data, and it is just as much of a must-read. Doctorow is obviously a fan of Cegłowski's and now so am I. It is hard to write talks this good, and even harder to ensure that they are relevant to stuff I was posting in May. This one takes the argument of The Panopticon Is Good For You, also from May, and makes it more general and much clearer. Below the fold, details.

Tuesday, October 6, 2015

After patting myself on the back about one good prediction, here is another. Ever since Dave Anderson's presentation to the 2009 Storage Architecture meeting at the Library of Congress, I've been arguing that for flash to displace disk as the bulk storage medium would require flash vendors to make such enormous investments in new fab capacity that there would be no possibility of making an adequate return on the investments. Since the vendors couldn't make money on the investment, they wouldn't make it, and flash would not displace disk. 6 years later, despite the arrival of 3D flash that is still the case.

expected to account for less than 10 per cent of the total storage capacity the industry will need by 2020.

Stifel estimates that:

Samsung is estimated to be spending over $23bn in capex on its 3D NAND for for an estimated ~10-12 exabytes of capacity.

If it is fully ramped-in by 2018 it will make about 1% of what the disk manufacturers will that year. So the investment to replace that capacity would be $2.3T, which clearly isn't going to happen. Unless the investment to make a petabyte of flash per year is much less than the investment to make a petabyte of disk, disk will remain the medium of choice for bulk storage.

Sunday, October 4, 2015

I've had occasion to note the work of Steve Randy Waldman before. Today, he has a fascinating post up entitled 1099 as Antitrust that may not at first seem relevant to digital preservation. Below the fold I trace the important connection.

The LOCKSS Program develops and supports libraries using open source peer-to-peer digital preservation software. Although initial development and deployment was funded by grants including from NSF and the Mellon Foundation, grant funding is not a sustainable basis for long-term preservation. The LOCKSS Program runs the "Red Hat" model of free, open source software and paid support. From 2007 through 2012 the program was in the black with no grant funds at all.

The demands of the "Red Hat" model make it hard to devote development resources to enhancements that don't address immediate user demands but are targeted at longer-term issues. After discussing this issue with the Mellon Foundation, the LOCKSS Program was awarded a grant to cover a specific set of infrastructure enhancements. It made significant functional and performance improvements to the LOCKSS software in the areas of ingest, preservation and dissemination. The LOCKSS Program's experience shows that the "Red Hat" model is a viable basis for long-term digital preservation, but that it may need to be supplemented by occasional small grants targeted at longer-term issues.

Wednesday, September 16, 2015

My third post to this blog, more than 8 years ago, was entitled Format Obsolescence: the Prostate Cancer of Preservation. In it I argued that format obsolescence for widely-used formats such as those on the Web, would be rare. If it ever happened, would be a very slow process allowing plenty of time for preservation systems to respond.

Thus devoting a large proportion of the resources available for preservation to obsessively collecting metadata intended to ease eventual format migration was economically unjustifiable, for three reasons. First, the time value of money meant that paying the cost later would allow more content to be preserved. Second, the format might never suffer obsolescence, so the cost of preparing to migrate it would be wasted. Third, if the format ever did suffer obsolescence, the technology available to handle it when obsolescence occurred would be better than when it was ingested.

Below the fold, I ask how well the predictions have held up in the light of subsequent developments?

Friday, September 11, 2015

The Library of Congress' Storage Architectures workshop asked gave a group of us each 3 minutes to respond to a set of predictions for 2015 and questions accumulated at previous instances of this fascinating workshop. Below the fold, the brief talk in which I addressed one of the predictions. At the last minute, we were given 2 minutes more, so I made one of my own.

Tuesday, September 8, 2015

I've been writing a report about emulation as a preservation strategy.
Below the fold, a discussion of one of the ideas that I've been thinking
about as I write, the unique position national libraries are in
to assist with building the infrastructure emulation needs to succeed.

Tuesday, August 18, 2015

Last week's Storage Valley Supper Club provided an update on developments in solid state memories.

First, the incumbent technology, planar flash, has reached the end of its development path at the 15nm generation. Planar flash will continue to be the majority of flash bits shipped through 2018, but the current generation is the last.

Second, all the major flash manufacturers are now shipping 3D flash, the replacement for planar. Stacking the cells vertically provides much greater density; the cost is a much more complex manufacturing process and, at least until the process is refined, much lower yields. This has led to much skepticism about the economics of 3D flash, but it turns out that the picture isn't as bad as it appeared. The reason is, in a sense, depressing.

It always important to remember that, at bottom, digital storage media are analog. Because 3D flash is much denser, there are a lot more cells. Because of the complexity of the manufacturing process, the quality of each cell is much worse. But because there are many more cells, the impact of the worse quality is reduced. More flash controller intelligence adapting to the poor quality or even non-functionality of the individual cells, and more of the cells used for error correction, mean that 3D flash can survive lower yields of fully functional cells.

The advent of 3D means that flash prices, which had stabilized, will resume their gradual decrease. But anyone hoping that 3D will cause a massive drop will be disappointed.

Third, the post-flash solid state technologies such as Phase Change Memory (PCM) are increasingly real but, as expected, they are aiming at the expensive, high-performance end of the market. HGST has demonstrated a:

PCM SSD with less than two microseconds round-trip access latency for
512B reads, and throughput exceeding 3.5 GB/s for 2KB block sizes.

which, despite the near-DRAM performance, draws very little power.

But the big announcement was Intel/Micron's 3D XPoint. They are very cagey about the details, but it is a resistive memory technology that is 1000 times faster than NAND, 1000 times the endurance, and 100 times denser. They see the technology initially being deployed, as shown in the graph, as an ultra-fast but non-volatile layer between DRAM and flash, but it clearly has greater potential once it gets down the price curve.

In less than a decade, Dr. Aad, who lives in Marseilles, France, has
appeared as the lead author on 458 scientific papers. Nobody knows just
how many scientists it may take to screw in a light bulb, but it took
5,154 researchers to write one physics paper earlier this year—likely a
record—and Dr. Aad led the list.

His scientific renown is a tribute to alphabetical order.

The article includes this amazing graph from Thompson-Reusters, showing the spectacular rise in papers with enough authors that their names had to reflect alphabetical order rather than their contribution to the research. And the problem is spreading:

“The challenges are quite substantial,” said Marica McNutt, editor in
chief of the journal Science. “The average number of authors even on a
typical paper has doubled.”

Of course, it is true that in some fields doing any significant research requires a large team, and that some means of assigning credit to team members is necessary. But doing so by adding their names to an alphabetized list of authors on the paper describing the results has become an ineffective way of doing the job. If each author gets 1/5154 of the credit for a good paper it is hardly worth having compared to the whole credit for a single-author bad paper. If each of the 5154 authors gets full credit, the paper generates 5145 times as much credit as it is due. And if the list is alphabetized but is treated as reflecting contribution, Dr. Aad is a big winner.

How long before the first paper is published with more authors than words?

Tuesday, August 11, 2015

Although at last count I'm a named inventor on at least a couple of dozen US patents, I've long believed that the operation of the patent system, like the copyright system, is profoundly counter-productive. Since "reform" of these systems is inevitably hijacked by intellectual property interests, I believe that at least the patent system, if not both, should be completely abolished. The idea that an infinite supply of low-cost, government enforced monopolies is in the public interest is absurd on its face. Below the fold, some support for my position.

Amazon Web Services, ... grew its revenue by 81 percent year on year in the second
quarter. It grew faster and with higher profit margins than any other
aspect of Amazon’s business.

AWS, which offers leased computing services to businesses, posted
revenue of $1.82 billion, up from $1 billion a year ago, as part of its second-quarter results.

By comparison, retail sales in North America grew only 26 percent to $13.8 billion from $11 billion a year ago.

The cloud computing business also posted operating income of $391
million — up an astonishing 407 percent from $77 million at this time
last year — for an operating margin of 21 percent, making it Amazon’s
most profitable business unit by far. The North American retail unit
turned in an operating margin of only 5.1 percent.

Revenue growing at 81% year-on-year at a 21% and growing margin despite:

Amazon's strategy is not to generate and distribute profits, but to
re-invest their cash flow into starting and developing businesses.
Starting each business absorbs cash, but as they develop they turn
around and start generating cash that can be used to start the next one.

Unfortunately, S3 is part of AWS for reporting purposes, so we can't see the margins for the storage business alone. But I've been predicting for years that if we could, we would find them to be very generous.

Tuesday, July 7, 2015

The Internet Archive has by far the largest archive of Web content but its preservation leaves much to be desired. The collection is mirrored between San Francisco and Richmond in the Bay Area, both uncomfortably close to the same major fault systems. There are partial copies in the Netherlands and Egypt, but they are not synchronized with the primary systems.

This survey also shows that long term preservation planning and
strategies are still lacking to ensure the long term preservation of web
archives. Several reasons may explain this situation: on one hand, web
archiving is a relatively recent field for libraries and other heritage
institutions, compared for example with digitization; on the other hand,
web archives preservation presents specific challenges that are hard to
meet.

I discussed the problem of creating and maintaining a remote backup of the Internet Archive's collection in The Opposite of LOCKSS. The Internet Archive isn't alone in having less than ideal preservation of its collection. It's clear the major challenges are the storage and bandwidth requirements for Web archiving, and their rapid growth. Given the limited resources available, and the inadequate reliability of current storage technology, prioritizing collecting more content over preserving the content already collected is appropriate.

Perhaps a future article in the series will describe how successive US administrations consistently strove to ensure that encryption wasn't used to make systems less insecure and, the encryption that was used was as weak as possible. They prioritized their (and their opponents) ability to spy over mitigating the risks that Internet users faced, and they got what they wanted. As we see with the compromise of the Office of Personnel Management and the possibly related compromise of health insurers including Anthem. These breaches revealed the kind of information that renders everyone with a security clearance vulnerable to phishing and blackmail. Be careful what you wish for!

Tuesday, June 23, 2015

Director Zhang began by surveying the digital landscape, emphasizing the ride of ebooks, digital journals, and machine reading. The CAS decided to embrace the digital-first approach, and canceled all print subscriptions for Chinese-language journals. Anything they don’t own they obtain through consortial relationships ...

This approach works well for a growing proportion of the CAS constituency, which Xiaolin referred to as “Generation Open” or “Generation Digital”. This group benefits from – indeed, expects – a transition from print to open access. For them, and for our presenter, “only ejournals are real journals. Only smartbooks are real books… Print-based communication is a mistake, based on historical practicality.” It’s not just consumers, but also funders who prefer open access.

We were inspired by a 2009 paper FAWN A Fast Array of Wimpy Nodes in which David Andersen and his co-authors from C-MU showed that a network of large numbers of small CPUs coupled with modest amounts of flash memory could process key-value queries at the same speed as the networks of beefy servers used by, for example, Google, but using 2 orders of magnitude less power.

As this McElroy slide shows, power cost is important and it varies over a 3x range (a problem for Kaminska's thesis about the importance of 21 Inc's bitcoin mining hardware). He specifically mentions the need to get the computation close to the data, with ARM processors in the storage fabric. In this way the amount of data to be moved can be significantly reduced, and thus the capital cost, since as he reports the cost of the network hardware is 25% of the cost of the rack, and it burns a lot of power.

At present, eBay relies on tiering, moving data to less expensive storage such as consumer hard drives when it hasn't been accessed in some time. As I wrote last year:

Fundamentally, tiering like most storage architectures suffers from the idea that in order to do anything with data you need to move it from the storage medium to some compute engine. Thus an obsession with I/O bandwidth rather than what the application really wants, which is query processing rate. By moving computation to the data on the storage medium, rather than moving data to the computation, architectures like DAWN and Seagate's and WD's Ethernet-connected hard disks show how to avoid the need to tier and thus the need to be right in your predictions about how users will access the data.

That post was in part about Facebook's use of tiering, which works well because Facebook has highly predictable data access patterns. McElroy's talk suggests that eBay's data accesses are somewhat predictable, but much less so than Facebook's. This makes his implication that tiering isn't a good long-term approach plausible.

Tuesday, June 16, 2015

I'm not a regular reader of the Financial Times, so I really regret I hadn't noticed that Izabella Kaminska and others at the FT's Alphaville blog have been posting excellent work in their BitcoinMania series. For a taste, see Bitcoin's lien problem, in which Kaminska discusses the problems caused by the fact that the blockchain records the transfer of assets but not the conditions attached to the transfer:

For example, let's hypothesise that Tony Soprano was to start a bitcoin loan-sharking operation. The bitcoin network would have no way of differentiating bitcoins being transferred from his account with conditions attached - such as repayment in x amount of days, with x amount of points of interest or else you and your family get yourself some concrete boots â€” and those being transferred as legitimate and final settlement for the procurement of baked cannoli goods.

Now say you've lost all the bitcoin you owe to Tony Soprano on the gambling website Satoshi Dice. What are the chances that Tony forgets all about it and offers you a clean slate? Not high. Tony, in all likelihood, will pursue his claim with you.

Indeed, given the high volume of fraud and default in the bitcoin network, chances are most bitcoins have competing claims over them by now. Put another way, there are probably more people with legitimate claims over bitcoins than there are bitcoins. And if they can prove the trail, they can make a legal case for reclamation.

This contrasts considerably with government cash. In the eyes of the UCC code, cash doesn't take its claim history with it upon transfer. To the contrary, anyone who acquires cash starts off with a clean slate as far as previous claims are concerned. ... According to Fogg there is currently only one way to mitigate this sort of outstanding bitcoin claim risk in the eyes of US law. ... investors could transform bitcoins into financial assets in line with Article 8 of the UCC. By doing this bitcoins would be absolved from their cumbersome claim history.

The catch: the only way to do that is to deposit the bitcoin in a formal (a.k.a licensed) custodial or broker-dealer agent account.

In other words, to avoid the lien problem you have to submit to government regulation, which is what Bitcoin was supposed to escape from. Government-regulated money comes with a government-regulated dispute resolution system. Bitcoin's lack of a dispute resolution system is seen in the problems Ross Ulbricht ran in to.

Tuesday, June 9, 2015

It’s been hard to make a living as a journalist in the 21st century, but it’s gotten easier over the last few years, as we’ve settled on the world’s newest and most lucrative business model: invasive surveillance. News site webpages track you on behalf of dozens of companies: ad firms, social media services, data resellers, analytics firms — we use, and are used by, them all.
...
I did not do this. Instead, over the years, I only enabled others to do it, as some small salve to my conscience. In fact, I made a career out of explaining surveillance and security, what the net was doing and how, but on platforms that were violating my readers as far as technically possible.
...
We can become wizards in our own right, a world of wizards, not subject to the old powers that control us now. But it’s going to take a lot of work. We’re all going to have to learn a lot — the journalists, the readers, the next generation. Then we’re going to have to push back on the people who watch us and try to control who we are.

a 67.5% reduction in the number of HTTP cookies
set during a crawl of the Alexa top 200 news sites. [and] a 44%
median reduction in page load time and 39% reduction in data
usage in the Alexa top 200 news site.

Sunday, June 7, 2015

I gave a brief talk during the meeting at Columbia on Web Archiving Collaboration: New Tools and Models to introduce the session on Tools/APIS: integration into systems and standardization. The title was "Web Archiving APIS: Why and Which?" An edited text is below the fold

My colleagues at the Stanford Libraries are actively working to archive games. Back in 2013, on the Library of Congress' The Signal digital preservation blog Trevor Owens interviewed Stanford's Henry Lowood, who curates our games collection.

Saturday, May 30, 2015

As Stanford staff I get a feel-good email every morning full of stuff about the wonderful things Stanford is doing. Last Thursday's linked to this article from the medical school about Stanford's annual Big Data in Biomedicine conference. It is full of gee-whiz speculation about how the human condition can be improved if massive amounts of data is collected about every human on the planet and shared freely among medical researchers. Below the fold, I give a taste of the speculation and, in my usual way, ask what could possibly go wrong?

Thursday, May 28, 2015

I haven't posted on the looming disaster that is the Internet of Things You Don't Own since last October, although I have been keeping track of developments in brief comments to that post. The great Charlie Stross just weighed in with a brilliant, must-read examination of the potential the IoT brings for innovations in rent-seeking, which convinced me that it was time for an update. Below the fold, I discuss the Stross business model and other developments in the last 8 months.

Economists like to say there are no bad people, just bad incentives. The
incentives to publish today are corrupting the scientific literature
and the media that covers it. Until those incentives change, we’ll all
get fooled again.

Earlier this year I saw Tom Stoppard's play The Hard Problem at the Royal National Theatre, which deals with the same issue. The tragedy is driven by the characters being entranced by the prospect of publishing an attention-grabbing result. Below the fold, more on the problem of bad incentives in science.

Thursday, May 21, 2015

Trevor Pott has a post at The Register entitled Flash banishes the spectre of the unrecoverable data error in which he points out that while disk manufacturers quoted Bit Error Rates (BER) for hard disks are typically 10-14 or 10-15, SSD BERs range from 10-16 for consumer drives to 10-18 for hardened enterprise drives. Below the fold, a look at his analysis of the impact of this difference of up to 4 orders of magnitude.

Tuesday, May 19, 2015

I started blogging about the transition the Web is undergoing from a document to a programming model, from static to dynamic content, some time ago. This transition has very fundamental implications for Web archiving; what exactly does it mean to preserve something that is different every time you look at it? Not to mention the vastly increased cost of ingest, because executing a program takes a lot more, a potentially unlimited amount of, computation than simply parsing a document.

The transition has big implications for search engines too; they also have to execute rather than parse. Web developers have a strong incentive to make their pages search engine friendly, so although they have enthusiastically embraced Javascript they have often retained a parse-able path for search engine crawlers to follow. We have watched academic journals adopt Javascript, but so far very few have forced us to execute it to find their content.

Adam Audette and his collaborators at Merkle | RKG have an interesting post entitled We Tested How Googlebot Crawls Javascript And Here’s What We Learned. It is aimed at the SEO (Search Engine Optimzation) world but it contains a lot of useful information for Web archiving. The TL;DR is that Google (but not yet other search engines) is now executing the Javascript in ways that make providing an alternate, parse-able path largely irrelevant to a site's ranking. Over time, this will mean that the alternate paths will disappear, and force Web archives to execute the content.

Tuesday, May 12, 2015

We know that those Open Access policies that work are the ones that have teeth. Both institutional and funder policies work better when tied to reporting requirements. The success of the University of Liege in filling its repository is in large part due to the fact that works not in the repository do not count for annual reviews. Both the NIH and Wellcome policies have seen substantial jumps in the proportion of articles reaching the repository when grantees final payments or ability to apply for new grants was withheld until issues were corrected.

He points out that:

Monitoring Open Access policy implementation requires three main steps. The steps are:

Each of these steps are difficult or impossible in our current data
environment. Each of them could be radically improved with some small
steps in policy design and metadata provision, alongside the wider
release of data on funded outputs.

He makes three important recommendations:

Identification of Relevant Outputs: Policy design should include mechanisms for identifying and publicly listing outputs that are subject to the policy. The use of community standard persistable and unique identifiers should be strongly recommended. Further work is needed on creating community mechanisms that identify author affiliations and funding sources across the scholarly literature.

Discovery of Accessible Versions: Policy design should express compliance requirements for repositories and journals in terms of metadata standards that enable aggregation and consistent harvesting. The infrastructure to enable this harvesting should be seen as a core part of the public investment in scholarly communications.

Auditing Policy Implementation: Policy requirements should be expressed in terms of metadata requirements that allow for automated implementation monitoring. RIOXX and ALI proposals represent a step towards enabling automated auditing but further work, testing and refinement will be required to make this work at scale.

What he is saying is that defining policies that mandate certain aspects of Web-published materials without mandating that they conform to standards that make them enforceable over the Web is futile. This should be a no-brainer. The idea that, at scale, without funding, conformance will be enforced manually is laughable. The idea that researchers will voluntarily comply when they know that there is no effective enforcement is equally laughable.

Thursday, May 7, 2015

Barry Ritholtz points me to Ben Thompson's post The AWS IPO, in which he examines Amazon's most recent financials. They're the first in which Amazon has broken out AWS as a separate line of business, so they are the first to reveal the margins Amazon is achieving on their cloud business. The answer is:

AWS is very profitable: $265 million in profit on $1.57 billion
in sales last quarter alone, for an impressive (for Amazon!) 17% net
margin.

The post starts by supposing that Amazon spun out AWS via an IPO:

One of the technology industry’s biggest and most important IPOs
occurred late last month, with a valuation of $25.6 billion dollars.
That’s more than Google, which IPO’d at a valuation of $24.6 billion,
and certainly a lot more than Amazon, which finished its first day on
the public markets with a valuation of $438 million.

It concludes:

The profitability of AWS is a big deal in-and-of itself, particularly
given the sentiment that cloud computing will ultimately be a commodity
won by the companies with the deepest pockets. It turns out that all the
reasons to believe in AWS were spot on: Amazon is clearly reaping the
benefits of scale from being the largest player, and their determination
to have both the most complete and cheapest offering echoes their prior
strategies in e-commerce.

All the indications are that the money already invested in the research
publishing system is sufficient to enable a transformation that will be
sustainable for the future. There needs to be a shared understanding
that the money currently locked in the journal subscription system must
be withdrawn and re-purposed for open access publishing services. The
current library acquisition budgets are the ultimate reservoir for
enabling the transformation without financial or other risks.

They present:

generic calculations we have made on the basis of available publication data and revenue values at global, national and institutional levels.

These include detailed data as to their own spending on open access article processing charges (APCs), which they have made available on-line, and from many other sources including the Wellcome Trust and the Austrian Science Fund. They show that APCs are less than €2.0K/article while subscription costs are €3.8-5.0K/article, so the claim that sufficient funds are available is credible. It is important to note that they exclude hybrid APCs such as those resulting from the stupid double-dipping deals the UK made; these are "widely considered not to reflect a true market value". As an Englishman, I appreciate under-statement. Thus they support my and Andrew Odlyzko's contention that margins in the academic publishing business are extortionate.

Friday, May 1, 2015

The International Internet Preservation Consortium's General Assembly brings together those involved in Web archiving from around the world. This year's was held at Stanford and the Internet Archive. I was asked to give a short talk outlining the LOCKSS Program, explaining how and why it differs from most Web archiving efforts, and how we plan to evolve it in the near future to align it more closely with the mainstream of Web archiving. Below the fold, an edited text with links to the sources.

Tuesday, April 21, 2015

One of the most interesting sessions at the recent CNI was on the Ontario Library Research Cloud (OLRC). It is a collaboration between universities in Ontario to provide a low-cost, distributed, mutually owned private storage cloud with adequate compute capacity for uses such as text-mining. Below the fold, my commentary on their presentations.

Friday, April 10, 2015

Chris Mellor has an interesting piece at The Register pointing out that while 3D NAND flash may be dense, its going to be expensive.

The reason is the enormous number of processing steps per wafer -
between 96 and 144 deposition layers for the three leading 3D NAND flash
technologies. Getting non-zero yields from that many steps involves
huge investments in the fab:

Samsung, SanDisk/Toshiba, and Micron/Intel have already announced +$18bn investment for 3D NAND.

This compares with Seagate and Western Digital’s capex totalling ~$4.3 bn over the past three years.

Chris has this chart, from Gartner and Stifel, comparing the annual capital expenditure per TB of storage of NAND flash and hard disk. Each TB of flash contains at least 50 times as much capital as a TB of hard disk, which means it will be a lot more expensive to buy.

As of 24th March 2015, a selection of authors submitting a biology manuscript to Scientific Reports
will be able to opt-in to a fast-track peer-review service at an
additional cost. Authors who opt-in to fast-track will receive an
editorial decision (accept, reject or revise) with peer-review comments
within three weeks of their manuscript passing initial quality checks.

Sunday, April 5, 2015

I was interviewed for an upcoming news article in Nature about the problem of link rot in scientific publications, based on the recent Klein et al paper in PLoS One. The paper is full of great statistical data but, as would be expected in a scientific paper, lacks the personal stories that would improve a news article.

I mentioned the interview over dinner with my step-daughter, who was featured in the very first post to this blog when she was a grad student. She immediately said that her current work is hamstrung by precisely the kind of link rot Klein et al investigated. She is frustrated because the dataset from a widely cited paper has vanished from the Web. Below the fold, a working post that I will update as the search for this dataset continues.

Wednesday, April 1, 2015

The Andrew W. Mellon Foundation is aggressively funding efforts to support
new forms of academic publishing, which researchers say could further
legitimize digital scholarship.

The foundation in May sent university press directors a
request for proposals to a new grant-making initiative for long-form
digital publishing for the humanities. In the e-mail, the foundation
noted the growing popularity of digital scholarship, which presented an
“urgent and compelling” need for university presses to publish and make
digital work available to readers.

Note in particular:

The foundation’s proposed solution is for groups of university
presses to ... tackle any of the moving parts that task is comprised
of, including “...(g) distribution; and (h) maintenance and
preservation of digital content.”

Below the fold, some thoughts on this based on experience from the LOCKSS Program.

Tuesday, March 24, 2015

Jill Lepore's New Yorker "Cobweb" article has focused attention on the importance of the Internet Archive, and the analogy with the Library of Alexandria. In particular on the risks implicit in the fact that both represent single points of failure because they are so much larger than any other collection.

Typically, Jason Scott was first to respond with a outline proposal to back up the Internet Archive, by greatly expanding the collaborative efforts of ArchiveTeam. I think Jason is trying to do something really important, and extremely difficult.

The Internet Archive's collection is currently around 15PB. It has doubled in size in about 30 months. Suppose it takes another 30 months to develop and deploy a solution at scale. We're talking crowd-sourcing a distributed backup of at least 30PB growing at least 3PB/year.

To get some idea of what this means, suppose we wanted to use Amazon's Glacier. This is, after all, exactly the kind of application Glacier is targeted at. As I predicted shortly after Glacier launched, Amazon has stuck with the 1c/GB/mo price. So in 2017 we'd be paying Amazon $3.6M a year just for the storage costs. Alternately, suppose we used Backblaze's Storage Pod 4.5 at their current price of about 5c/GB, for each copy we'd have paid $1.5M in hardware cost and be adding $150K worth per year. This ignores running costs and RAID overhead.

It will be very hard to crowd-source resources on this scale, which is why I say this is the opposite of Lots Of Copies Keep Stuff Safe. The system is going to be short of storage; the goal of a backup for the Internet Archive must be the maximum of reliability for the minimum of storage.

Nevertheless, I believe it would be well worth trying some version of his proposal and I'm happy to help any way I can. Below the fold, my comments on the design of such a system.

The exponential growth in the number of scientific papers makes it increasingly difficult for researchers to keep track of all the publications relevant to their work. Consequently, the attention that can be devoted to individual papers, measured by their citation counts, is bound to decay rapidly. ... The decay is ... becoming faster over the years, signaling that nowadays papers are forgotten more quickly. However, when time is counted in terms of the number of published papers, the rate of decay of citations is fairly independent of the period considered. This indicates that the attention of scholars depends on the number of published items, and not on real time.

Friday, March 13, 2015

The current empirical literature on the effects of journal
rank provides evidence supporting the following four conclusions:
1) Journal rank is a weak to moderate predictor of scientific impact;
2) Journal rank is a moderate to strong predictor of both intentional
and unintentional scientific unreliability;
3) Journal rank is expensive, delays science and frustrates researchers;
and, 4) Journal rank as established by [Impact Factor] violates even
the most basic scientific standards, but predicts subjective judgments
of journal quality.

I'm particularly concerned about the medical journals that participate
in advertising networks. Imagine that someone is researching clinical
trials for a deadly disease. A smart insurance company could target such
users with ads that mark them for higher premiums. A pharmaceutical
company could use advertising targeting researchers at competing
companies to find clues about their research directions. Most journal
users (and probably most journal publishers) don't realize how easily
online ads can be used to gain intelligence as well as to sell products.

I should have remembered that, less than 3 weeks ago, Brian Merchant at Motherboard posted Looking Up Symptoms Online? These Companies Are Tracking You, pointing out that health sites such as WebMD and, even less forgivably, the Centers for Disease Control, are rife with trackers selling their visitors' information to data brokers:

The CDC example is notable because it’s a government site, one we assume
should be free of the profit motive, and entirely safe for use. “It’s
basically negligence,”

If you want to look up health information online, you need to use Tor.

It claims to have much lower latency, a few seconds instead of a few hours.

It has the same (synchronous) API as Google's more expensive storage, where Glacier has a different (asynchronous) API than S3.

Its pricing for getting data out lacks Glacier's 5% free tier, but otherwise is much simpler than Glacier's.

As I predicted at Glacier's launch, Amazon stuck with the 1c/GB/mo price point while, albeit slowly, the technology got cheaper. So they have room to cut prices in response, but I'd bet that they won't.

I believe I know how Google has built their nearline technology, I wrote about it two years ago.

Thursday, March 5, 2015

Tom Coughlin uses Hetzler's touch-rate metric to argue for tiered storage for archives in a two-partseries. Although there's good stuff there, I have two problems with Tom's argument. Below the fold, I discuss them.

Saturday, February 28, 2015

I was one of the crowd of people who reacted to Wednesday's news that Argonne National Labs would shut down the NEWTON Ask A Scientist service, on-line since 1991, this Sunday by alerting Jason Scott's ArchiveTeam. Jason did what I should have done before flashing the bat-signal. He fed the URL into the Internet Archive's Save Page Now, to be told "relax, we're all over it". The site has been captured since 1996 and the most recent capture before the announcement was Feb 7th. Jason arranged for captures Thursday and today.

As you can see by theseexamples, the Wayback Machine has a pretty good copy of the final state of the service and, as the use of Memento spreads, it will even remain accessible via its original URL.