I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.

Thursday, June 24, 2010

JCDL 2010 Keynote

On June 23 I gave a keynote address entitled to the joint JCDL/IACDL 2010 conference at Surfer's Paradise in Queensland, Australia. Below the fold is an edited text of the talk, with links to the resources.Stepping Twice Into The Same RiverI've been an engineer in Silicon Valley for a quarter of a century. I've spent the last decade and a bit working on the LOCKSS distributed digital preservation system. But before that I compiled a three-for-three record of successful startups. Marc Andreessen did even better; he did Netscape, LoudCloud and Ning, but he's also a successful VC and he's on the boards of FaceBook, eBay and HP. He spoke to, or rather answered questions from, a standing-room-only crowd at Stanford's Business School last month. The video is on Stanford's web.

Marc made a lot of good points, things that I recognized from my startup experience. In particular, he stressed that the best time to start a business was when a technological or an economic discontinuity hits the industry, or preferably when they both hit.

One of Marc's examples is particularly relevant today. He has said that it is time for the New York Times to shut off the presses, that everyone there knows it, that they can't talk about it because 80% of their income still comes from paper, and that because money drives attention they are spending 80% of their attention on a dying business, which is why the remaining 20% isn't being effective in the digital world. He says that if they turned off the presses they would end up a much smaller company, but they would be communicating the same news and making money. Until they do, they won't. The fact that the New York Times is not going to turn off the presses in the foreseeable future is the reason Marc has invested in a startup news business, TPM Media.

But why am I talking about startups to a digital library conference? Because this is a talk abut the intersection of technology and economics. The premier scholar in this area is W. Brian Arthur, of the Santa Fe Institute & PARC. He pointed out that technology markets have increasing returns to scale, more commonly described as network effects like Metcalfe's Law, and thus are likely to be captured by a single dominant player (Windows, the X86, iPod, and so on). This is his new book; if you aren't convinced of the importance of economics to the impact of technology in the real world you need to read it, and his earlier work.

Everyone is urging libraries to start up businesses publishing and archiving data. Here is Rick Luce from a CLIR report.

“... research libraries should focus on developing the functional requirements of a data-archiving infrastructure,”

Note the stress on funding.

“Adequate and sustained funding for long-lived data collections ... remains a vexing problem ... the widely decentralized and nonstandard mechanisms for generating data ... make this problem an order of magnitude more difficult than than our experiences to date ...”“A New Value Equation Challenge”, Rick Luce, CLIR, 2008

Indeed we find that libraries are starting up publishing businesses. But this ARL report shows they they're publishing journals and monographs.

“44% of … ARL member libraries … were delivering publishing services … 21% were in the process of planning publishing service development.”

Viewed from Stanford, there are several odd things about this report. One is that it nowhere mentions Stanford Libraries' HighWire Press. Stanford Libraries saw the technological discontinuity of the Web early, and 15 years ago last month pioneered the switch of journal publishing from paper to digital when HighWire put the "Journal of Biological Chemistry" on-line.

The goal was to seize the initiative and define the market for on-line journals in beneficial ways, before the big commercial publishers could do so. Fifteen years later, although HighWire is still a significant player in the industry, it can hardly be said to be driving the field. But it is clear that it achieved its initial objectives. Grabbing the opportunity of a discontinuity allowed HighWire to define how journals would appear on the Web. The use of HTML rather than PDF, moving pay walls, IP address access control, two-way links between citing and cited papers, and many other features we now take for granted were first introduced by HighWire Press. Note that the key to HighWire's success is that it did something different from the other players; that's easy to do if you are on the leading edge of a discontinuity.

“Agencies and the research community together need to create the digital equivalent of libraries; institutions that can take responsibility for preserving digital data ... The university research libraries are obvious candidates to assume this role. But whoever takes it on, data preservation will require robust, long-term funding.”

Here again Stanford has relevant experience to relate. Not just that HighWire has been self-sustaining for most of its 15 year history, but also that the LOCKSS program, which started twelve years ago, has been economically self-sufficient for the last six. We operate solely on the fees paid by libraries in the LOCKSS alliance, and on contract work. We have no foundation or other grants, and no subsidy from Stanford. We would be interested to learn of any other digital preservation program capable of making similar claims.

The value that good venture capitalists like Marc Andreessen bring to startups isn't just money, it is a depth of experience with the problems that the startups will run in to. In this talk I'll be playing the VC, trying to draw lessons from our experience about the problems that lie ahead for digital libraries. I think there are three take-away lessons:

First, the publishing business libraries are trying to get into is not a good business and, even if it was, trying to compete by doing the same thing the existing players have been doing for the last 15 years is not likely to succeed.

Second, the business that libraries are being urged to get in to is not like the existing publishing business, and the differences are not good news.

Third, technological and economic discontinuities are about to sweep over the business of publishing for scholars, and that this provides a HighWire-like opportunity to seize the initiative and define the future in beneficial ways.

I'd also like to re-frame the discussion somewhat, from being about the difference between articles and data, to being about the difference between static information and dynamic information. Less about what we are preserving and more about how preserved information is accessed. Less about HTML and other formats, and more about HTTP and other protocols. The reason is that static information is a degenerate case of dynamic information; a system designed for dynamic information can easily handle static information. The converse isn't true.

Berkeley's Center for the Study of Higher Education just published a magisterial eport on the future of scholarly communication.

It is based on a large number of interviews, primarily with faculty but also with other experts in related fields such as librarians. It concludes that change will be very slow if it happens at all, primarily due to the stranglehold tenure committees have on scholar's careers. They predict that the major commercial publishers will continue to dominate the industry, siphoning off the vast bulk of the value created by the vast bulk of scholars. And that the forms that scholarly communications take, journal articles, monographs and books, will remain static.

In the years up to 2008 a similar set of interviews of eminent finance experts would also have produced a report suggesting little change in prospect, spiced as the Berkeley report is with a few dissenting views. It is entirely possible that the report is right; I have a long track record of being right about what is going to happen and wildly wrong about how fast it is going to happen. And the scholarly world is remarkably and justifiably resistant to change.

Nevertheless, in this talk I am going to make the case that the report is wrong; that sudden revolutionary change is not merely possible but plausible. I'll be using Elsevier as the canonical example of a commercial publisher. This is actually a tribute to them. Over the years that we have both cooperated and competed with them, it has become clear that their dominant position is due to the very high quality of their management, marketing and technology.

I admit that predicting the imminent downfall of the big publishers also has a poor track record. But revolutionary change caused by technological discontinuity is not unprecedented. It typically happens some time after the new technology appears.

Once upon a time the music industry looked secure in its ability, fueled by its stranglehold on artists' careers, to control the distribution of music and siphon off the vast bulk of the value created by the vast bulk of artists. Now, they are fighting for their very existence, suing their customers because they feel no compunction in violating the industry's intellectual property rights. The customers have concluded that the rights-holders have violated the social compact underlying these rights, abusing them to rip off the actual creators of value. And technology has provided the artists and their customers with the tools to bypass the record companies.

Academic publishers have already undermined their moral position in a way the record companies haven't. They have conceded that access to the literature is a social good, and they provide it to developing countries free or on a sliding scale of discounts via initiatives such as Hinari for the biomedical literature. Pharmaceutical companies are in a similar bind, desperately fighting "drug importation" bills in Congress.

Note that what precipitated the panic in the record industry and led to them suing their customers in droves was not that they started losing money, it was simply that the rate at which they were increasing the amount of money they made decreased. Or look at the crisis in the newspaper industry. It isn't that newspapers stopped making money. They have just proved incapable of making the astonishing amounts of money that, during the last two booms, their proprietors expected them to make. Crises of the kind I am talking about are caused by failure to fulfill inflated expectations; they strike long before the failing business model starts to lose actual cash.

"Audiences are at once fragmenting into niches and consolidating around blockbusters. Of course, media consumption has not risen much over the years, so something must be losing out. That something is the almost but not quite popular content that occupies the middle ground between blockbusters and niches. The stuff that people used to watch or listen to largely because there was little else on is increasingly being ignored."

It concluded that the pundits who predicted that the world of the Web would be dominated by a small number of blockbuster media hits, and those like Chris Anderson who predicted that it would be dominated by the long tail of immense numbers of small, niche products, were both right. The studios that produce blockbusters like "Avatar" and the "Harry Potter" franchise are doing better than ever. The huge numbers of small players in the long tail are doing better than ever. The impact has been felt in the middle.

When we started Nvidia, Jen-Hsun Huang drew this fascinating graph.It related to the graphics chip market in those days, which we were eventually to conquer. Most technology markets have a similar graph for most of their history. The X-axis is unit price of the product, and the Y-axis is the total number of margin dollars generated by products sold at that price for all suppliers in the market:

At the left, low-price end of the graph are the mass-market consumer graphics chips sold in huge volumes to motherboard manufacturers. These have wafer-thin margins, so despite their huge volumes they generate few of the total margin dollars. Think the Honda Fit in cars.

At the right, high-price end of the graph are the extremely expensive high-end chips, sold in small numbers on add-in cards to dedicated gamers. Each sale generates a lot of margin dollars, but there aren't many sales so their contribution to the total is small. Think Lamborghini.

The part of the market that generated the bulk of the margin dollars was the Goldilocks zone, the segment just below the gonzo gamer chips. It generates a lot more sales than the gonzo chips with good but not outstanding margins. Think the zone between the Porsche Boxster and the Mazda Miata.

In the past, scholarly publishers understood this graph. The journals that provided the bulk of the bottom line were the upper part of the mid range, successful enough to generate good sales and relatively cheap to run. The few blockbusters such as "Nature" were nice, but you could make a good living without one.

Then the big publishers discovered that, like Microsoft, they could inflate their bottom line by bundling.

“Libraries face a choice between Science Direct E-Selects (a single journal title purchased in electronic format only on a multiple password basis rather than a site licence), or a Science Direct Standard or Complete package (potentially a somewhat more expensive option but with the virtue of a site licence).”Assoc. of Subscription Agents, 6 July 2007

Instead of subscribing to the individual journals they wanted, libraries could subscribe to everything that Elsevier, say, published. Over time, the price of the individual subscription was raised to encourage buying the "big deal". And the big deal acquired multi-year discounts that locked libraries in to committing the bulk of their subscriptions budget for years to come. This froze out smaller publishers and reinforced the market dominance of the big publishers.

In the world of the big deal, having a few blockbusters was essential. What drove the librarians to buy the big deal was access to the blockbusters, and the sheer number of journals in the deal. In the world of the big deal, the publishers were motivated to pour resources into the blockbusters, and to pump up the number of journals by starting vast numbers of cheap, junk journals. In doing so, they corrupted the "peer-reviewed" brand. With so many low-quality journals to be kept fed with papers, pretty much anything an author wrote could get published; it would simply find a level far enough down the hierarchy at which the reviewing was sloppy enough to accept it.

So, just like other genres of publishing, the economic value in journals has migrated from the mid-range to the extremes.

What Wall St. demands of companies such as the commercial publishers is continual growth in earnings per share. This can be delivered in one of three ways:

Increased sales, i.e. more money coming in.

Improved margins, i.e. less of the money that comes in going out.

Reduced numbers of shares, i.e. spending some of the money that came in and didn't go out on buying back shares from investors.

As the company gets bigger, the number of dollars in each of these categories needed to have the same effect on the market's perception of the stock's value gets bigger too.

"The big minus is that our performance advantage has shrunk dramatically as our size has grown, an unpleasant trend that is certain to continue. ... huge sums forge their own anchor and our future advantage, if any, will be a small fraction of our historical edge."Warren Buffet's "Letter to Shareholders" Feb 2010

Increased sales means either more customers paying the same amount, or the same customers paying more. When a company dominates the market the way the big publishers do, there are no new customers to find. So they need the existing customers to pay more. If your existing customers are in the straits that libraries are right now, their ability to pay more is zero. In fact, libraries are going to be paying less. They have started to cancel the big deals, and even individual subscriptions, in favor of pay-per-view.

Ironically, this process is being accelerated by the one e-journal archive that is dominating the market. Almost all publishers deposit their e-journal content into Portico. Libraries subscribe to Portico to obtain post-cancellation access, although in practice most publishers prefer to provide post-cancellation access from their own website.

But in addition libraries get four user accounts to use to "audit" the system. Apparently, these accounts give access to Portico's entire contents, not just the content to which the library used to subscribe. So long as they don't use it too much, their Portico subscription gives them access to every e-journal ever published at no additional cost. The combination of this, informal trading in copies, and pay-per-view creates a situation I believe the publishers will come to regret.

If the number of dollars coming in is going to decrease, how about reducing the number of dollars going out? There is a long history of technology companies faced with disruptive technology needing to reduce expenditures to maintain the bottom line. They are almost always unable to cut enough cost fast enough to do the trick. This is because the expectation of a certain level of margins gets baked in to the structure of the company in very fundamental ways.

I was at Sun for a long time in the early days. When I joined it was a small startup struggling to defeat Apollo. I thought it was being run very cheaply, but I had no idea. Sometime after Sun's IPO, when the company was growing very fast and had acquired a large and highly effective sales force, I visited Dell in its early stages. It was a revelation in cheapness. Mismatched second-hand furniture, industrial space on the outskirts of Austin. Dell's build-to-order model was fundamentally more efficient than Sun's; it didn't have a sales force, it didn't have to finance inventory, it didn't have to predict sales.

Just as with the New York Times, although everyone at Sun knew the solution, they couldn't bring themselves to endure the pain of implementing it. The result is clear, Sun no longer exists and Michael Dell is a billionaire.

In 2008 I was on the jury for the Elsevier Grand Challenge, a competition Elsevier ran with a generous prize for the best idea of what could be done with access to their entire database of articles. This got a remarkable amount of attention from some senior managers. Why did they sponsor the competition? They understood that, over time, their ability to charge simply for access to the raw text of scholarly articles will go away. Their idea was to evolve to charging for services based on their database instead. The contest was to generate ideas for such services. The result reinforced my conclusion that what people would pay for these services comes nowhere close to replacing the subscription income.

Reflect, the winning entry, is very impressive but it still shows that although the services that have been identified do add some value to the text, they don't add a lot. Reflect enhances the words in the display of a paper with all sorts of relevant information based on the context in which the words appear. For example, it knows about gene, protein and small molecule names, and provides them with pop-ups containing sequence (for proteins) or 2D structure (for small molecules). This is useful, but when I asked my friend who is a molecular biochemist to try it he was underwhelmed - he already knew all the information in the pop-ups. Services delivering value to non-specialists and students won't command premium prices.

More importantly, the advantage Elsevier has in delivering these services based on their database comes from the difficulty potential competitors have in delivering the same service, because they don't have access to the underlying text. But in a world where access to the text is free, Elsevier's advantage disappears and with it their ability to charge premium prices for the service. Reflect illustrates this too. Precisely because it is a Web service, you can apply it to any web page. For example, you can apply it to this page from the New England Journal of Medicine by copying and pasting its URL into this page. NEJM is not an Elsevier journal; it is on-line from HighWire.

I know it is hard for academics to believe that Elsevier is in trouble but last November the board of Reed Elsevier, the parent company, fired its CEO after only 8 months in the job. That simply doesn't happen to companies that are doing well. Industry comment regarded Elsevier's academic publishing as the only part of the company worth keeping.

"With ever-broadening competition for the core content licensing services of LexisNexis, ... Reed Elsevier looks increasingly like a company with one fairly stable boat and three heavy anchors failing to find a bottom."

and

"[O]rganic growth requires ... focusing far more aggressively on its Elsevier division. Elsevier is not without its own challenges – scientific publishing faces strong pushback from corporate and academic libraries that find it increasingly hard to afford the full range of journals"

But even the industry commentators admit that this is a risky bet. What they mean is that, to meet Wall St. expectations, Elsevier needs to "focus more aggressively" on extracting enough extra money from academic publishing, that is from teaching and research budgets, to cover for the failures of its other divisons. How much money are we talking about here? Elsevier makes $1B/yr in profit from academic publishing. To make Wall St. happy they'd need, every year for the next few years to extract at least an additional say $200M from teaching and research. This is not going to happen.

There's another problem. Libraries sign multi-year "big deals" which limit how much the price can rise during the deal. Say these deals last 5 years. That means Elsevier can't raise much of the 20% increase it needs this year from 80% of its customers. The 20% increase has to come from the 20% who are re-negotiating this year. Those subscribers are therefore looking at 100% increases. This is not going to happen.

"UC Libraries are confronting an impending crisis in providing access to journals from the Nature Publishing Group ... [NPG] has insisted on increasing the price of our license for Nature and its affiliated journals by 400% beginning in 2011, which would raise our cost for their 67 journals by well over $1 million dollars per year."

"In the past 6 years, UC authors have contributed ~5300 articles to these journals, 638 of them in ... Nature. ... an analysis by CDL suggests that UC articles published in Nature alone have contributed at least $19 million dollars in revenue to NPG over the past 6 years."Letter to Faculty, CDL, 4 June 2010

Although the commercial publishers broadly dominate the field, in some areas not-for-profit publishers are highly influential. Theresa Velden & Carl Lagoze's report on communication among chemists illustrates this; the American Chemical Society has a dominant position.

"The level of salaries and bonuses that ACS officials receive has raised the eyebrows of some academic chemists ... ACS defended its position by referring to the large membership, the $420M annual revenue, and the $1B in assets ... if one acknowledges ACS Publishing and CAS as being massive businesses, then the remuneration would seem to be in line with that size of business."

Not-for-profit publishers might seem to be immune from the discipline Wall Street imposes, but market forces are still at play. Management of these immense businesses will be just as reluctant as managers at for-profit publishers to see the bottom line decrease, and with it their lavish renumeration.

Note that at least $150M/yr of ACS' revenue is extracted from academic research budgets. It can be argued that some of this pays for the editing and peer review processes. But is the value these processes add worth the $150M/yr in foregone research? It cannot be argued that any of Elsevier's $1B/yr in profit pays for editing and peer review. That's paid for before the profit. Is the value that the editing and peer review adds to the content enough to compensate not just for its cost but the additional research that could be done with $1B/yr?

Once people start asking what value they are getting from publishers, the publishers are on a very slippery slope.

It is interesting that the discipline most attuned to the effects of technology, computer science, has essentially completed the transition that, I believe, most fields will undergo at some time in the future. Usenix, a Berkeley-based not-for-profit society, is an example in this area. They used to publish journals available only to members and institutional subscribers. But over time they have evolved to running workshops; a whole lot of workshops. These workshops have supplanted journals as the major form of communication in computer science. The papers for a workshop are available to registrants beforehand, and become open access immediately afterwards. The society funds itself by charging for attendance at the workshops; membership is buried in this fee.

The key point here is that workshop attendance is much less vulnerable to technological competition than reading journals. The same thing is happening in the music industry; increasingly concerts are where the money is. But there is no chance that switching its business model from publishing journals to running conferences will generate "organic growth" for a company Elsevier's size; very much the reverse.

The business of simply publishing the static text of scholarly articles, monographs and books and charging someone for access to the text is going away. The value publishers add to the text no longer justifies prices that can sustain the publisher's legacy organizations. Publishers are desperately looking for added-value services based on the text that would sustain their margins, but even Elsevier hasn't found them yet. There are fundamental reasons why they aren't likely to find them. No-one gives up $1B/yr in profit easily. Expect the struggle to preserve the old ways to become very nasty, but it will ultimately be futile.

Just as Marc Andreessen predicts for newspapers, I predict publishing for scholars is going to end up being a much, much smaller business than it is now. If, like Marc, libraries are investing in a shrinking business they, like Marc, need to be doing something radically different not simply copying the existing players.

Libraries are also on a slippery slope. Compare them to Starbucks:

The library is marginally better as a place to study.

The hours at Starbucks are better.

The coffee and food at Starbucks is better, but the bandwidth probably isn't as good.

Access to subscription content at the library might be better, depending on the extent to which the university provides off-campus access, or uses pay-per-view.

Of course, Starbucks could easily afford to offer subscription content to its customers too. Vicky Reich created this fake Starbucks web page advertising subscription content access four years ago;here is another librarian posting the same idea to his blog last March. And here is Starbucks announcing it for real, starting July 1st.

I don't have to belabor the point for this audience that libraries are suffering extreme economic stress. The December-January issue of Against The Grain is all about how libraries are responding by substituting pay-per-view (PPV) for subscriptions.

The disaggregation of PPV will have effects on journals world similar to the effects of iTunes disaggregation on music:

A fixed price per song eliminates the publisher's ability to charge premium prices for premium content, but a variable price deters purchases.

Listener's ability to pay for only the tracks they want reveals that most of the bundle was stuff no-one wanted to pay for anyway.

The song price has to be set so low that there is no incentive for listeners to go find a free download of the song.

If the song price is set low enough to discourage free downloads, the industry has to adjust to radically lower margins.

iTunes use of DRM only succeeded because it was implemented in an extremely attractive hardware media player. All attempts to use DRM to raise song prices not tied to this one standout hardware platform have failed. It doesn't look like the Kindle (or even the iPad) will be the text equivalent of the iPod.

Libraries implementing PPV have two unattractive choices:

Hide the cost of access from readers. This replicates the subscription model but leads to overuse and loss of budget control.

Make the cost of access visible to readers. This causes severe administrative burdens, discourages use of the materials, and places a premium on readers finding the free versions of content.

Placing a premium on finding the open access copy is something publishers should wish to avoid. Now, many publishers permit scholars to post their work on their own websites or in institutional repositories. They correctly expect that, in most fields, authors won't bother to do so. After all, they have free access to the publisher's copy via their library's subscription, and so do their colleagues. What difference does it make? If PPV spreads with visible pay gates, the difference it makes becomes immediately obvious, and authors will be motivated to make their work open access. This should (but probably won't) motivate libraries that do choose PPV to deploy visible pay gates; the investment in dealing with the hassles will be repaid over time in more open access content.

"More than 80% use the search engine to find academic papers; close to 60% use it to get information about scientific discoveries or other scientists' research programmes; and one-third use it to find science-policy and funding news ... 'The findings are very typical of most countries in the world,' says David Bousfield, ... 'Google and Google Scholar have become indispensable tools for scientists.'"Jane Qiu "A Land Without Google", Nature 463, 1012-1013, 24 Feb 2010

This makes it pretty much irrelevant whether a paper is accessed from the journal publisher, from an institutional or subject repository, or from the author's own website. The brand, or tag, of the journal that published it still matters as a proxy for quality, but the journal isn't essential for access.

How effective is the combination of Google and Open Access? In 2008, a team from Finland estimated that 20% of all papers published in 2006 were available in some open access form via a Google search. That is before PPV, deposit mandates and institutional repositories have their impact.

“We ... estimate that the total number of articles published in 2006 by 23,750 journals was approximately 1,350,000. ... it was also possible to estimate the number of articles which are ... (gold OA). This share turned out to be 4.6% ... at least a further 3.5% was available after an embargo period of usually one year, bringing the total share of gold OA to 8.1%. ... we also tried to estimate the proportion of the articles ... available as ... (green OA). ... For 11.3% a usable copy was found [via Google]. [Thus] we estimate that 19.4% of the total yearly output can be accessed freely.”Bjork et al. “Global annual volume of peer reviewed scholarly articles and the share available via different Open Access options”, ELPUB2008

Last month, after 10 years there, my step-daughter Corinne graduated with a Ph.D. in Mechanical Engineering from U.C. Berkeley. She left behind a very traditional thesis. It discusses the problem of "Life Cycle Analysis"; how companies can evaluate the total environmental impact of their operations, including both their entire supply chain and the total life of their products through final disposal. The key to doing this is collating and combining a large number of data sources from across the Web, reporting things like the proportion of energy from coal in the local electricity supply, or the capacity and fuel consumption of trucks, at locations around the world. Her thesis shows how this data can be combined with financial and other information about a company to estimate its environmental footprints in terms of carbon, water, pollution, and so on.

Like a true Silicon Valley native, she is doing a startup. Stanford's endowment would have had stock in the company, but all Berkeley has is the thesis. Part of what the company needs to do is to build the web service which collects and collates the data as it changes, reflecting new power stations being built in China and wind farms in Germany, or old trucks being replaced by new in Brazil. The data isn't static, it is dynamic. I've read her thesis carefully and I'm sure replicating her work simply on the basis of the description there is too much work to be practical, so the thesis isn't that useful to scholars. What scholars need to build on her work is a web service answering queries about the collated data. Berkeley should have had the infrastructure in place so that Corinne left behind her this web service.

Assuming the service were successful, it would have been more of an asset for the University and other scholars than a thesis. If it wasn't successful, the cost would be minimal.

Preserving the service Corinne hypothetically left behind is problematic. Pre-Web documents and early Web content were static, and thus relatively easy to preserve. This is what the early Web looked like - the BBC News front page of 1st December 1998 from the Internet Archive. The evolution of the Web was towards dynamic content. This started with advertisements. No-onepreserves the ads. The reason is that they are dynamic, every visitor to the page sees a different set of ads. What would it mean to preserve the ads?

Now the Web has become a world of services, typically standing on other services, and so on ad infinitum, like the service Corinne's startup needs. Everyone talks about data on the Web. But what is out there is not just data, it is data wrapped in services of varying complexity, from simple, deliver me this comma-separated list, to complex, such as Google Earth.

Here is the real-estate site Zillow. It is a mash-up of other services, like Bing and MLS and local government property registers. Note that the ads are an essential part of the site.

What would it mean for an archive to preserve Zillow? Could it preserve the experience a user has with the system? Scraping the web site freezes it in both time and space, as being one day last March and a small part of Palo Alto. I'll bet no-one but me saw that combination. Preserving just my view of the service is pretty pointless.

Could the archive preserve the internals of the Zillow service? If Zillow would agree, it could. But Zillow is a thin layer on top of other services that Zillow doesn't own. Zillow could point to those services in turn, and it would turn out that in turn they depend on services they don't own, and so on.

We know how to build these services. We don't know any technical or legal means of copying them to an archive in such a way that the result is useful.

A question that seems almost too obvious to ask is: "why did the Web have the impact that it did?" What was it that the Web provided that had been missing before? My answer is that the Web wrapped content in a network service, the service of providing access to it. Absent HTTP, the network service that provides access, HTML would have been just another document format. The key was to encapsulate content in this and other formats inside Web servers, providing the network service of accessing them.

Because preservation has focused on the content (the HTML), and not on the service (the HTTP), it hasn't been either effective or cost-effective. It hasn't been effective because it hasn't preserved important parts of the user experience provided by the service, such as links and dynamic interactions that work, and it hasn't been cost-effective because it has spent money to reduce the value of the content by, for example, breaking the links to it and lowering its search engine ranking.

When what scholars wanted to publish was "papers", that is static articles, monographs and books, the problems caused by preserving only the content and not the service could be ignored. But this is just an artifact of the new medium of the Web starting out by emulating paper, the old medium it was supplanting.

A decade and a half into the growth of the Web it is becoming hard to ignore the preservation problems caused by not preserving the service. We see that even with PDF, the Page Description Format, the format most closely identified with the old medium, paper. PDF has a variant, PDF/A, geared to archival use. PDF/A forbids the use of essentially all of the newer, dynamic features of the full PDF specification. For example, it forbids the use of PDF's capabilities for embedding interactive 3D models into documents. This is a feature that even the conservative chemists surveyed by Velden and Lagoze love, because it allows them to publish interactive 3D models of molecules.

Less conservative fields than chemistry are already making use of the dynamic, interconnected world of web services to publish their work in a way that others can re-use directly rather than by re-implementing it. A wonderful example is myExperiment.org, work led by David de Roure of Southampton and Carol Goble of Manchester Universities.

Scientific workflows encapsulate web services like databases, simulations and computations so that experiments "in silico" can be built up from them. Scientists build them, often by assembling them partly from other workflows other scientists have created. The execution of a compound workflow involves multiple services across the Web; the scientist invoking the execution is unlikely to know all of the servers involved. Workflows are made freely available, and are the ultimate example of easy re-use of other's work. The problems come in the longer term, with their stability and permanance.

Here is an example workflow uploaded to myExperiment last March. It queries three different Web services and processes the results. It could itself be a service accessed by other workflows.

What scholars are going to want to publish are dynamic services, not static content whether it be papers or data. The entire communication model we have is based on the idea that what is being communicated is static. That is the assumption behind features of the current system including copyright, peer review, archiving and many others.

We are in danger of making the same mistake with data as it grows that we made with the Web as it grew - paying attention to the content instead of the service and so building publishing and archiving infrastructure that is oriented to static rather than dynamic content.

Turning to archives, we now have several years experience operating several digital archives at scale. The lessons can be summarized as:

Operating an archive at scale is expensive.

Ingesting content is what costs the money.

Except for a few essentially dark archives with legal mandates (such as government & state archives) the only thing that funders view as justifying the cost is access to the content now.

These considerations mean that light archives are like technology markets; they have increasing returns to scale and are subject to market capture. Think about it. Is anyone going to build a new archive to compete with the Internet Archive? And that means that no matter how much people like to talk about the risks of putting all their eggs in one basket, in practice that is precisely what they will do.

Another worrying problem is that funders don't see value in content that is accessible from the archive but also from its original publisher. In the Web context, archives add no value to open access content until it disappears, when it is too late to add value. The Internet Archive has built up a critical mass of vanished open access content, but only because of its non-selective collection policy. Selecting and preserving high-value open-access content means you're choosing stuff that is likely not to go away, so not likely to enhance the value of your archive.

The converse of that problem happened to the LOCKSS system. We designed it to address the problem that moving Web content somewhere else reduces its value by breaking the links to it. Because the LOCKSS daemon was designed to act as a proxy from the original publisher web site, not as another publisher, the preserved content remained accessible at its original URL and the links continued to work.

The LOCKSS system was transparent to readers, it just made the content much more durable. The problem was that, precisely because it was transparent, it was not visibly providing any value. In fact, if it was installed and configured as originally designed, you had to be quite a Web expert to detect that it was doing anything at all. This made it hard to persuade libraries to pay for it. In order to get paid, we had to add the capability for LOCKSS to act as an alternate publisher. This made access to the preserved content not transparent, and added complexity to the user experience.

Memento, a recent proposal that uses the content negotiation capabilities of HTTP to allow Web sites to direct browsers to preserved versions of themselves, is important in this context.Here's a demo using the Memento plugin for FireFox. Its transparent; using the slider I can view all the preserved versions of the page without worrying about where they came from. Unless I look carefully in the navigation bar I won't even know where they came from. This is definitely better than the LOCKSS proxy technique, and we will implement it. But it shares the problem of being a double-edged sword. Precisely because the technique is so transparent, it is the web site that gets credit for providing access to past versions, not the archives that are holding them.

This dilemma is a big problem for archive business models; to get funding they need to be perceived as delivering value now. But the only way to create that perception is to get into the reader's face in ways that reduce the value of the archived content.

Ingesting the content from the original publisher into the archive and preparing it for access costs money, so access from an archive is always going to cost more than access from the original publisher. And in fact we see that this is the case in both the ways that archives are supposed to deliver access to preserved journals:

As regards post-cancellation access, despite the rash of cancellations, only three institutions are receiving post-cancellation access from Portico. When it comes to the crunch, most publishers prefer to deliver it from their own system. That way they continue to capture the hits.

As regards preserving the record of scholarship, the journals that have been rescued from oblivion by preservation systems needed rescue precisely because their value was so low that no-one would pay to access them.

The fundamental problem here is that the process of copying content to an archive and republishing it there both costs money, and acts to reduce the value of the content in many ways, for example by breaking the links to it and reducing the ranking search engines will give it. And given that the content accessed from the archive will be lower value to start with, reducing its value has a disproportionately large impact.

The numbers that are emerging for the cost of data archiving support this case.

"the cost of archiving activities (archival storage and preservation planning and actions) is consistently a very small proportion of the overall costs and significantly lower than the costs of acquisition/ingest or access activities"

They suggest that the dominant cost is ingest; a broad definition of ingest cost that includes persuading data holders to let you do it is over half the total. This is not surprising, the big cost in archiving journals is ingest too.

In both cases this means that most of the cost of preservation is paid up-front. There are two reasons this is a big problem for preservation business models. The time value of money makes it important to delay costs, not to front-load them. And you have to pay the full cost for every preserved item regardless of whether it will eventually get used or not. Most will not get used. Finding ways to delay costs until preserved information is accessed is important.

Even if we focus on the data rather than the service, we can say that if journal publishers and archives are in trouble then data publishers and archives have some serious business model problems.

Here is the business model that people are assuming. Scholars accumulate data then transfer it, doing their side of the transfer for free and not paying the archive to do the other side, to an archive which ingests it, adds metadata, and provides open access to it indefinitely. This model spends most of its money up-front doing things that reduce the value of the data. The legal basis for the things it does is shaky at best. There's no existing cash flow that the archives can tax to pay for their activities, so they're dependent on grant funding. Funders don't see any value in paying to preserve stuff that is open access until it goes away, by which time its too late. Apart from mandates from the funding agencies, there is no incentive for scholars to cooperate.

This environment is ripe for Potemkin Village data archives, which offer loss-leader pricing in the hope of capturing the market then using their monopoly to raise prices enough to actually implement the service they have been selling all along. When the house of cards collapses it'll be too late. Alternatively, it is subject to capture by some well-funded company who can see the long-term advantage of owning the world's data.

If we look at the services instead of just the data, we know which company that will be. Although it's hard to tell, according to Marc Andreessen Amazon has better than 90% market share in "the cloud". We use Amazon's services, you use them, everyone uses them. Unless something dramatic happens, scholars who want to publish services wrapped around their, or other people's, data will take the path of least resistance and use Amazon's services. Miss a credit card payment, your data and service are history. Worse, do we really want to end up with Amazon owning the world's science and culture?

So we see that all three intermediaries in scholarly communication are in trouble:

The publishing world is splitting into the blockbusters and the long tail; the area in the middle is not viable. The customers can't afford to keep the blockbuster publishers in the style to which they have become accustomed. Wall St. will punish any failure to maintain their lavish lifestyle severely.

Libraries' only role in the electronic space is to sign the checks for the publishers. They can't afford to keep doing this. To the extent to which they respond by switching to pay-per-view they will speed both their irrelevance and the demise of the publisher's business models.

Archives have discovered that while everyone pays lip service to long-term preservation, the only thing people will pay for is access to the content now.

The fundamental change caused by the Web is that none of the intermediaries are any longer necessary to the process of communication from author to reader. Authors can post their thoughts on their own websites and the infrastructure of the Web will allow readers to find and access them without involving any of the traditional intermediaries. Thus they can no longer justify their traditional costs.

The bottom line is that there is no longer enough money to keep the current system of scholarly communication running as it has been. Just as with the banks, the experts may all agree that the system is basically sound, but at some point after the customers stop paying the numbers fail to add up and radical change happens. This is the kind of economic discontinuity VCs love.

Will we get a technological discontinuity to go along with it? Through much of the history of computing there's been a debate pitting the advocates of centralized against distributed systems. At the lower levels of the system architecture, the last decade and a half have decisively resolved the debate in favor of distribution. Does anyone here remember Alta Vista? It was the first real Web search engine, and the last to be built on a few large powerful computers. It was decisively defeated by Google, built from a large number of generic PCs. It may surprise you to know that Alta Vista is still around; you know it as Yahoo! Search and it is now based on the same distributed architecture as Google.

It simply isn't possible any longer to build or run systems the size we need with a centralized architecture. You can't beat the parallelism, configuration flexibility and fault tolerance that come from plugging components together with IP networks, and the cost advantages of building them from large numbers of identical low-cost consumer components. So we know the architecture of the future infrastructure for communicating between and preserving the output of scholars will be distributed.

The question now is how far up the system hierarchy the underlying distributed architecture is visible, and how this relates to the organizational structure in which the system exists. Contrast two services, both provided by the same organization using the same distributed computing infrastructure:

Google Search is an enormously valuable service delivered with minimal controversy because the content on which it is based is distributed. A competitor, such as Bing, does not have to negotiate with Google for access to the content and can thus compete on its merits.

Google Books is an enormously valuable service delivered with maximal controversy because the content on which it is based is centralized. Google quite naturally wants to be the only supplier of access to the content which it has paid to digitize, but the effect is to prevent the emergence of competitors and to allow Google to exact monopoly terms from authors and readers.

Since the architecture of the system delivering the service is going to be distributed, a centralized service is favored only for organizational and business reasons. The overwhelming reason is to retain ownership of the content in order to control access to it. Imagine an alternative Google Books, where the content was held at the libraries that contributed it. Technically, the system would work just the same, organizationally it would be more difficult but far less controversial.

One of the papers that gained a well-deserved place in last year's SOSP workshop described FAWN, the Fast Array of Wimpy Nodes. It described a system consisting of a large number of very cheap, low-power nodes each containing some flash memory and the kind of system-on-a-chip found in consumer products like home routers. It compared a network of these to the PC-and-disk based systems companies like Google and Facebook currently use. The FAWN network could answer the same kinds of queries that current systems do at the same speed while consuming two orders of magnitude less power. You very rarely see an engineering result two orders of magnitude better than the state of the art. I expect that the power reductions the use of FAWN-like systems provides will drive their rapid adoption in data centers. After all, eliminating half the 3-year cost of ownership is a big enough deal to be disruptive.

"FAWN couples low-power embedded CPUs to small amounts of local flash storage, and balances computation and I/O capabilities to enable efficient, massively parallel access to data. ... FAWN clusters can handle roughly 350 key-value queries per Joule of energy – two orders of magnitude more than a disk-based system"

This is a table that helps explain what is going on. It plots the time it would take to read the entire content of a state-of-the-art disk against the year. It makes it clear that, although disks have been getting bigger rapidly, they haven't been getting correspondingly faster. There's a fundamental reason for this - the data rate depends on the inverse of the diameter of a bit, but the capacity depends on the inverse of the area of a bit.

The reason that FAWN-like systems can out-perform traditional PCs with conventional hard disks is that the bandwidth between the data and the CPU is so high and the amount of data per CPU so small that it can all be examined in a very short time.

There is another reason to expect radical change. Ever since Clayton Christensen published The Innovator's Dilemma it has been common knowledge that disk drive cost per byte halves every two years. In fact, what has been happening for some time is that the capacity at constant cost doubles every two years, which isn't quite the same thing. As long as this exponential continues, the economics of digital preservation look good. If you can afford to keep data for say 10 years, a simplistic analysis says you can afford to keep it indefinitely.

Alas, exponential curves can be deceiving. Moore's Law has continued to deliver smaller and smaller transistors. However, a few years ago it effectively ceased delivering faster and faster CPU clock rates. It turned out that, from a business perspective, there were more important things to spend the extra transistors on than making a single CPU faster. Like putting multiple CPUs on a chip.

Something similar is about to happen to hard disks (PDF). The technology (called HAMR) is in place to deliver the 2013 disk generation, i.e. a consumer 3.5" drive holding 8TB. But the business case for building it is weak. Laptops and now netbooks are destroying the market for the desktop boxes that 3.5" drives go in to. And very few consumers fill up the 2009 2TB disk generation, so what value does having an 8TB drive add? Let alone the problem of how to back up an 8TB drive on your desk! What is likely to happen, indeed is already happening, is that the consumer market will transition rather quickly to 2.5" drives. This will eliminate the high-capacity $100 3.5" drive, since it will no longer be produced in consumer quantities. Consumers will still buy $100 drives, but they will be 2.5" and have perhaps 1/3 the capacity. The $/byte curve will at best flatten, and more likely go up for a while. The problem this poses is that large-scale disk farms are currently built from consumer 3.5" drives. The existing players in the cloud market have bet heavily on the exponential cost decrease continuing; if they're wrong it will be disruptive.

FAWN-like systems, built from very cheap chips, with no moving parts, taking up very little space and power, and communicating with the outside world only through carefully specified network protocols, are interesting for publishing and preserving scholarly content expressed as services. They encapsulate the service and its data in a durable form. They use only TCP/IP and Ethernet to communicate to the outside world; these are the most stable interfaces we know for reasons that Metcalfe's Law explains. And they are inherently easy to virtualize, breaking the link between the survival of the physical medium and of the service.

The reason paper is still the medium of choice for archival storage is that it survives benign neglect. Fifteen years ago Jeff Rothenberg drew attention to the fact that, up till now, hardware and software have been very high maintenance. But now it looks like the mainstream platform for delivering services will switch to something that will tolerate benign neglect very well.

There are concerns about the use of flash memory in these systems. Flash has many characteristics that are attractive for archival use, including physical robustness, extremely high density, and zero power for data maintenance. The two obvious concerns are the limited write lifetime and the high cost of flash.

The limited write lifetime of flash should not be an issue for archival use, since archival data is write-once. The question is, what is known about long-term data retention in flash memory? The answer is, surprisingly little. Even the issue of a suitable definition and measurement technique for the "unrecoverable bit error rate" (UBER) is the subject of research.

The research suggests but does not yet prove that if flash memories are written and read at appropriate intervals for archival use they will retain data with very high reliability for a long time, probably many decades. However, it is important to note that it is only in recent years that detailed studies of disk failures in field use have become available, and they show that theoretical predictions of data reliability for disks were wildly optimistic (because the failure modes are very complex). We will need to wait for similar studies of the reliability of large-scale flash memory in field use.

It is true that flash memory is still much, much more expensive per byte than magnetic media, disk and tape. But detailed studies of the costs of long-term storage at the San Diego Supercomputer Center show that the media represents only about 1/3 of the total cost. Power and cooling, is another big contributor, as is the need to replace the hardware frequently. FAWN-like systems will last much longer than disk and will need much less power and cooling while they are doing it. The cost penalty for flash is much less than at first appears; because the running costs are such a large proportion of the total, paying more up front to buy hardware that reduces them is sensible.

The fact that flash has been so successful in so many uses despite the cost premium has spurred innovation. In the near future there will be a wave of new non-volatile memory technologies striving to undercut flash, with names like Phase Change Memory and Memristors. If (and its a big if - anyone remember bubble memory?) one of these technologies displaces flash it will do so by being a lot cheaper per byte. Using the winning technology in the low-cost, low-power FAWN-like architecture will probably undercut even tape while providing much lower access latency.

Have I made the case for the possibility of radical change?

Even in the absence of radical technological change, all three of the intermediaries (large publishers, libraries and archives) on which the conservative case rests have very shaky business models.

Radical technological change in both the mechanism of communication, and the forms in which scholars want to communicate, is at least plausible.

If I'm right, and radical change in scholarly communication is in prospect, what can we do to shape that change in beneficial ways?

I already pointed to HighWire Press as an example of the scholars seizing an opportunity and changing the market. Another is the sequencing of the human genome. Craig Venter and Celera hoped to extract vast profits from controlling the human genome database. At the urging of scholars, the Wellcome Trust backed an open competitor, the Sanger Centre. They succeeded in keeping the scientific data free for scholars to use. Of course, few fields have the resources of the Wellcome Trust behind them. But few fields have data as expensive to generate as the first human genome sequence.

"... the Sanger Centre and its sister labs moved fast enough to ... deny Celera the chance of any meaningful monopoly. ... Subscribing to Celera's database became an option, not a necessity; and not enough people took the option up. Having lost ... $750M, the company took a hard look and declared that it was shifting out of the genome database business. ... Because of what the Wellcome Trust and the Sanger Center did, the history of the world is permanently altered."Frances Spufford "Backroom Boys: The Secret Return of the British Boffin", 2003

Many fields have data and services that large institutions, whether for-profit or not-for-profit, would like to control so as to capture their value. Ithaka's attempt to control the database of academic papers with JSTOR and Portico is an example of this phenomenon.

Universities have an interest in both making the output of their scholars accessible, and in preserving it for the long term. They should not want to hand the output of their scholars to others so that they can profit (or not-for-profit) from it.

Various proposals are floating around for Universities to club together to rent cloud computing or cloud storage to scholars. Trying to compete with Amazon in this will fail because (a) Amazon will be cheaper and (b) Amazon will be better at marketing to the scholars, who in the end make the decision as to where to deploy their service. Worse, when it comes to services rather than one-off computations, it is self-defeating because the problem is rental itself, not the price of the rental. Scholars should not be publishing anything on rented infrastructure, in case they stop paying the rent, or on infrastructure they themselves own and maintain, in case they stop doing so. But they have to be persuaded, not just mandated, to do the right thing.

The only way for Universities to win a competition with Amazon for scholar's business is to provide the service to the scholars free. Amazon can't compete with free. But just to be sure, it would be good to impose special overhead charges on payments to cloud providers, and to discourage scholars from running their own service platforms by throttling incoming connections that don't go to the University service platform. Net neutrality doesn't apply to Universities as ISPs.

The cost of providing the platform will be higher than buying the equivalent service from Amazon, but it won't be a lot higher especially if Universities collaborate to provide it.

What Universities get for the extra cost is the permanence they need. The permanence comes from the fact that the University already has its hands on the data and the services in which it is wrapped, instantiated in highly robust and preservable hardware. Thus, no ingest costs and very low preservation costs. With the model of Amazon and a separate archiving service, as well as paying Amazon, Universities have to pay the archiving service, and pay the ingest costs. When these extra costs are taken in to account, because the ingest costs dominate, it is likely that Amazon would be more expensive.

But there is a technological discontinuity coming, with the FAWN-like architecture. This is a discontinuity for two reasons:

These nodes do not run conventional application programs because they are too small, at the most have very stripped-down operating systems, and in many cases (though not FAWN) are not Intel architecture but use ARM or other embedded processors.

Their programming model is different from the current model of application programming; it is massively parallel and fault-tolerant. Scaling up to Web service size can't be done with a programming model that excludes the possibility of failures in the infrastructure - this is another lesson from Google.

These changes break the current "rent-a-PC" model of simple cloud computing. This discontinuity is a HighWire-like opportunity. Much of the software is already in place with things like Hadoop.

It may be the case that change in scholarly communication will come slowly, if at all. Powerful interests will try to ensure that. However, I don't think a survey of scholars is good evidence for this. It assumes that the system responds to the desires of scholars rather than the business models of large publishers. The business models of publishers, libraries and archives are in trouble, in ways that are analogous to the problems of other media businesses. The technology underlying these businesses is about to change. If radical change happens, it provides an opportunity for Universities to act to improve the system.

The current business models both for journals and data publishing, curation and archiving are too expensive, too front-loaded, and too dependent on grants paying for things funders don't value to be successful. We need three things:

A radical cost reduction. The biggest cost is ingest, so that is what needs to be eliminated.

A radical delay of cost from creation and ingest time to access time.

A radical marketing strategy. It needs to compete with Amazon, so the only one that works is free.

10 comments:

Kevin Smith has a thoughtful commentary on this post at his blog. His thoughts on the effect of dynamic content on peer review, which I omitted to discuss in the talk, are similar to those I discussed in this blog's first post.

Andrew Orlowski at The Register examines Dell's recent settlement of the SEC's charges of fraudulent accounting between 2001 and 2006. This forms a fascinating postscript to the contrast I drew in this talk between Dell's and Sun's business models in the early 90s. Note that in the later 90s as Dell grew it suffered exactly the problem described by the Warren Buffet quote; it tried alternate channels such as Dell shops and found them all less efficient than its original vision. But, as the settlement shows, Dell eventually found an even better business model - extorting money from Intel.

In preparing for this talk I somehow missed Michael Nielsen making the same points even better than I did, and a year earlier. And I had even made a note to blog about it last September, so it wasn't like I hadn't paid attention.

I'm not the only one pointing out that scholarly publishers are in deep trouble. Rick Anderson makes the same case, and comes up with four strategies for them to pursue. Two of them (accept no growth, somehow generate more growth) won't work. Nor, as I pointed out in the talk, will the idea of switching to services. The idea of marketing around the libraries, also suggested by Joseph Esposito, deserves consideration. But it has the same probably fatal flaw as Pay-Per-View with a visible pay gate, in that it makes the reason for authors to post their own work to their own websites hard to ignore. Expecting the people who are supplying you with content to pay you to read it seems risky.

And speaking of switching to services, Elsevier continues to move slowly in that direction. But does anyone think these small steps will deliver "organic growth" to a corporation the size of Reed Elsevier in the time-frame Wall St. needs?

In the talk I reported C-MU research showing that FAWN systems could match conventional systems' performance on key/value queries. Yale researchers are now describing techniques by which these systems could match the performance of conventional database systems. This makes this architecture even more attractive.

Yet more evidence that the scope for publishers to create "organic growth" in the University market is limited comes from Ian Ayres, pointing out that the push-back against extortionate textbook prices has some new legal teeth.

Now even Arthur Sulzberger Jr. has agreed with Marc Andreessen that the New York Times has to turn off the presses. He just doesn't want to do it now. In fact he doesn't want to do it until it is too late.

In that post Henry Blodgett estimates that the NYT web site currently brings in about $150M/yr and that the newsroom costs about $200M/yr. If a paywall contributed another $100M/yr, they could afford to spend $100M/yr on the newsroom. So a big restructuring is in prospect.

It would be good if this looming downsizing of the newsroom kept the actual journalists and got rid of the stenographers who simply enable their anonymous sources to avoid accountability for their propaganda (see Miller, Judith).

The coming downsizing of the publishing industry isn't limited to academic publishing and newspapers. Via Slashdot, here is the sound of approaching doom for book publishers. John Locke self-publishes 99-cent crime novels for the Kindle. He currently holds top spot in the Amazon top 100 and has 6 books in the top 40. He gets 35c per download and is selling at the rate of over $1800/day. Who needs a publisher? Kevin Kelly agrees that 99-cent book downloads are the future.