Computational Complexity and other fun stuff in math and computer science from Lance Fortnow and Bill Gasarch

Wednesday, September 17, 2008

Opening up the ACM Digital Library

(Guest Post by Kamal Jain)

Opening Up the ACM Digital Library: An Alternate Method of Payment for the ACM Portal

The primary objective of the ACM digital library (ACM DL) is to make ACM
scientific content accessible. It is currently funded by various methods
including subscription fees and some commercial deals, such as referral
business. The subscription fee hinders broad access of the content.
I've been thinking about how we can make the portal freely available.
If the ACM DL is free and open, our scientific research will make more of
a contribution to society and human well-being, the first moral imperative
listed in the
ACM Code of Ethics.
Consider the contribution of Wikipedia to our society based on its being free and open.

Whether we could achieve an open ACM portal and other scientific content
lies in the subject of
Creative capitalism.
It is a complex subject and one can perhaps write a
dissertation on it with a chapter on free access to social content, i.e., the
content whose primary goal is to benefit the society.
I have a
longer post
on this topic, focusing on the opportunity to open up the ACM portal.
The paper avoids technical complexity and is easy to follow.

Feel free to give feedback in the comment section of this blog. If you prefer
you may also drop an email to me (firstname_lastinitial @ microsoft.com).

27 comments:

Reading your argumentation, my main objection is that you fail to consider the content dynamics in the "new Internet age".

With a large fraction of the authors posting their articles for free on the Web, how does the value of the ACM content evolve? There is a large body of content that is not accessible elsewhere (except libraries, of course). But the fraction of this legacy content is constantly decreasing. On the other hand, they keep adding content (new papers), but the fraction of people who post papers online also increases.

So I am fairly convinced that the ACM content is dropping in value over time. On the other hand, ACM has a large, relatively fixed cost from the bureaucracy. (Even if ACM is "our" academic society, we must not associate with it too closely and fail to see its faults.)

Thus, one might conclude that the business model is doomed to fail, due to fundamentals.

A particular way in which this failure can happen is if the search engine always lists the link to the author's website higher, to avoid paying for the ACM content. Then the cost per view becomes unreasonable, and the search engine drops the deal. This is quite a different Nash equilibrium than the one you envision, where every search engine wants to make the deal.

Moving away a bit, no matter what we do with the legacy content, we have to make sure that all new content is freely available, and 20 years from now we won't have this kind of dilemmas. I would propose a centralized Wiki-style site where everybody can post papers or links to papers.

MIP, thanks for the feedback. Whether ACM portal is valuable or not is an orthogonal discussion, and I agree an important one. At the moment, there are reasons to believe that ACM adds value for foreseeable future by providing the quality, reliability, and longevity of the content. Individuals are not good at maintaining libraries and also the total cost which individuals pay to maintain their libraries is perhaps far more than the centralized cost of making a single library, which we know will survive the test of the time and the quality. In case the papers are freely available, then the authors could post a link on their webpages to the official copy of the paper at ACM, and therefore even this loophole can be easily closed. (As I said, I am avoiding technicalities in the paper).

Regarding centralized wiki-style site, I know people, including myself, who have discussed this need, benefit, and cost of it. One primary problem is that even such a site requires money, both for development and implementation. Wikipedia collects donations, but not every site could compete for donations. In other words, one can say that ACM is already a centralized site, which you argue that we can make it cheaper and add additional wiki-like tools and features. Of course, if we could decrease the cost and improve the functionality, then that’s always an obvious thing to do. The assumption, I am making is that all that efficiency gains are also pursued.

So aside from the durability / implementation / distributed cost concerns of a public wiki (which are valid concerns, obviously), I don't understand how you address my other critique.

Is your solution to convince people to voluntarily post a link to ACM instead of their own version? You would need this to keep the value of ACM's assets high, and allow it to negotiate with the search engines.

I agree with you that hitting the ACM login page from home is horribly annoying, but already most people get most papers from Google, not from ACM. arXiv is but one of the many challenges to the ACM journals, reducing their assets.

I thought, I answered your question, at least implicitly. Explicitly there are multiple answers as follows:

1. In essence what you are saying is that author’s websites give competition to the ACM portal or AAAI website or MathSciNet. . I agree and isn’t that a good thing? But these portals still provide enough value that some people/institutions are paying for it, even in the presence of free content as you point out. If we allow alternate payment method for these portals, won’t we increase their value proposition? And why should we restrict the value proposition of these portals to fortunate people only?

2. There is a network value in a central compilation value. You get a single version. Is not that a reason we go to a search engine in the first place, even if we could type amazon.com directly in the address bar? Authors maintaining their own little libraries is an inferior and expensive (in effort) solution. Some authors are paying this effort cost because they want to make their content free and forced to invest their time. Some other authors may not have time or will to incur this effort, even then, these authors like to provide their content for free. If it is reliably accessible from one place then it is not only less effort for the author but also a better quality for the readers.

3. Authors would prefer to link to a free externally managed copy because that’s more convenient and satisfy their desire of making their work easily accessible. Currently authors are required to display/embed copyright notice. The new equilibrium will be better for the authors, readers, and the community, e.g., it may require either a link to the official copy or a usage token embedded in the local copy.

4. Modernization of the scientific portal, e.g., including wiki features, requires both money and structure. For instance, in the old age world we have references to only the past work. Well in the internet age we could easily have references to the future work, e.g., when a conjecture is settled we want that a reference in the original paper appear automatically. I have spent a considerable amount of mine and other people time on conceiving a collaboration portal, but the puck stops at the need of money.

5. Toumas Sandholm reportedly gave the following opinion, “AAAI (and other scientific societies) are expert at providing high-quality peer review and organizing energetic scientific communities of people producing content and receiving reviews, feedback, and certification by the society. The people providing web access, e.g., via search engines (or other portals) are expert at providing the access.” In essence what I understand from Toumas opinion is that it is good for us if we let the experts do the work. We do not grow our own food, do we?

"ACM adds value for foreseeable future by providing the quality, reliability, and longevity of the content."

Quality? The quality of ACM content is provided by community volunteers (authors, editors, referees, program committees, etc.), not by the publishing arm of ACM. The same research community freely provides the same quality controls to ACM, IEEE, SIAM, Springer, Elsevier, etc.

Reliability and longevity? ACM provides reliable and permanent electronic storage (at least until the next time their business model changes). But I'm sure Google or Yahoo or Microsoft would be willing to donate the (for them) minimal storage, bandwidth, and computational resources required to maintain the Digital Library. Other organizations like wikipedia, arxiv.org, and archive.org use massive amounts of storage, paid entirely by donations and/or grants, almost certainly at lower cost than ACM's bureaucracy.

What ACM does provide is a common *interface* (in the form of some really crappy latex style files, some really crappy submission web pages, and a good but overworked publishing subcontractor) for authors to publish their community-filtered research. They execute the social contract by which the papers at their conferences are polished and distributed. That same social contract could instead be executed by volunteers, as it is for many open-access journals already, with minimal additional cost.

I'm not disagreeing with your main point: We need new economic models that support reliable, consistent, wide, and free distribution of our work, whether that's through existing publishers or not. But let's not exaggerate the publishers' contributions.

Jeffe, ACM portal is just a proxy for every scientific portal out there. There is a lot of high quality “social” content behind login walls, e.g., MathSciNet.

Whether ACM is bureaucratic or not is a moot point and the current ACM officers could provide their insights on it. To do best possible job of making the scientific content freely accessible requires financial resources.

The ACM DL provides a service well beyond the content itself. The references are disambiguated and there are consistent bibliographic entries for material that ACM itself does not publish. (Compare the quality of this metadata to that of Citeseer, for example.) The cost of putting the raw data online is small relative to cost of producing the metadata. If you couldn't convince ACM to make both free then you could imagine an intermediate model of making the raw data free and charging for the high quality metadata.

On the hand, I end up consulting DBLP much more than the DL for the definitive information about referenced papers so maybe there isn't enough incentive for people to cover ACM's costs if the base content is free.

I have an alternate proposal which is a slight modification of your 1. Exclusive contract.

Instead of the Search Engine paying for the content, it provides the same service that the ACM DL currently provides, which is, at a high level, organizing/indexing the content. This way, we, the scientific community generate the content and provide it exclusively to the search engine, which in return, organizes it (to our satisfaction) and provides free access. This makes sense because this is what search engines do, organize information. In fact, Google already does this to some extent, with Google Scholar. Also Google Books is another interesting exmple. With this, Google has shown that it is willing to provide extra service (that of scanning books) in order to be able to index it. Will a Search Engine be willing to provide publishing services (for standardized looks, etc) in order to be able to index premium content?

There are some concerns with this, one is as you said, exclusivity.But there is a bigger concern, which is lock-in. What if the Search Engine degrades its service over time, or does not keep up with its part of the deal. Then we might have to pay huge switching costs.

A bit tangential, but my experience with ACM is that it is far from a wasteful beaurocracy: they do a great deal with limited resources, providing more value for less cost than comparable institutions (read: IEEE). So IMO it would be hard to replicate in a subtantially more efficient form as a wiki, etc.

Unfortunately, as I described in the paper, direct advertisement revenue is not sufficient for ACM DL or other scientific publications.

If you treat search engine and the content (i.e., the rest of the web) as complementary pair of services and advertisement revenue as payments for these services, then coincidentally the current revenue share between these two components agree with an old economics theorem, which states that (under simplifying assumptions) the two components in a pair of complementary services gets 50% revenue each. In my knowledge about half the internet advertisement revenue goes to a handful of search engine and the rest half to the rest of the web.The theorem also states that often this is not the best outcome for the users of the pair of the complementary services.

One way to overcome this theorem in economics is that the two components in a pair of complementary services write up a contract. That's essentially I am proposing. Search and the content complement each other, and we as users benefit if the revenue we generate is distributed optimally (optimized from user's point of view). That way we could ensure the continued growth of the open web content.

- I wonder how much ad revenue would increase if the content were free, since both the number of users and the value per user would increase. Not enough to pay for everything, based on the paper -- but it might be a significant amount that would be usefully combined with market segmentation.

- Market segmentation. Offer two services, Better ($$$) and Worse (free). Institutions will pay for the better service but everyone still gets some access. An example is JSTOR's moving wall which in effect provides access for articles that are more than N years old, where N depends on the journal. This example is imperfect because JSTOR is not free so it's really Better ($$$) and Worse ($), but maybe it could be free with ads (JSTOR doesn't have ads as far as I can see). Perhaps with a carefully designed solution, very few institutions would stop subscribing to the Better service, while at the same time, ad revenue would increase. But are there better ways to segment the market? Maybe in the Worse service, there is a cover sheet on the paper including some ads.

- More focused users might make them less likely to be distracted by ads, as the paper suggests, but it also seems like an opportunity. Ads could be targeted better to those focused interests, e.g. by advertising books on relevant technical topics via Amazon's referral program. I searched for "algorithms" and "routing" on the DL -- two topics that certainly have books available -- and the Google ads are software-related but otherwise irrelevant. Also, technical users' eyeballs are likely more desirable since the humans attached to them may have more money to spend.

- Lower costs instead of higher revenue. You said you've advocated wiki features but the implementation cost is too high. But one could also use a wiki to lower costs rather than just add features, by having the wiki users, rather than ACM employees, catalogue/index/cross-reference the papers. But I have no idea what fraction of the ACM's costs this would avoid. And it may be a pretty radical departure from the current system.

Brighten, non-search ad revenue is (roughly) 2 order of magnitude lower than search ad revenue. Roughly speaking, the ad revenue earned on all the search pages is about the same as the rest of the web pages but people spend a lot of their time on the web.

Unless ACM goes very aggressive on ads, like ads in the papers themselves, my guess (which could be wrong) is that ACM won't raise enough money without significantly cutting the cost.

For time being assume that ACM content is free? Would not you think it will increase the number of searches, say on Google scholar? Would not it mean that people who use Google scholar would also use more of Google, where the advertisement revenue is 2 order of magnitude higher?

Why could not we rightfully ask the search engines to share the incremental revenue it gets from us?

In my library analogy, if you have more books in a library then it increases the use of the library index. If the index builder is making money proportional to the usage, then normal economics arguments suggest that the library should get a share of that money and use it to improve the quality of book services (e.g, providing more number of free book!).

I understand your reasoning regarding getting search engines to share the wealth. It's a very interesting idea. It has some disadvantages too, so I was trying to think of alternatives.

In those alternatives, I was not suggesting that ACM rely only on ad revenue. For the market segmentation approach, the question I suppose would be: if you add to the current first-class (paid) service a second-class (free) service, does the additional ad revenue from the free service offset the lost revenue from the (hopefully few) subscribers who choose to drop the first-class service?

Secondarily, this approach could be augmented by (1) increasing ad revenue with ads that are more targeted than standard Google ads, like ads for CS reference books and textbooks; (2) lowering costs by leveraging a wiki.

I'm not sure whether the alternatives are better than the approach you suggest of pulling money from search engines. It would be interesting if you have an opinion on this.

Dave, that's a nice innovation from Yahoo Search. It shows that when there is a competition in the market, status quo is never sufficient. I wish Google could also start doing similar things, that is provide more on top of their status quo value.

Even though economics of the entertainment content and academic content is differnt but the inspiration could be the same. Search engines hypocritically claim that they send traffic to other websites, whereas in many cases it is the other way round. If Google New or Live News, remove NYT from their index then in the short term, it will hurt NYT more than the search engine, but eventually when people find out that NYT is not being indexed by a search engine, people will use alternate methods to reach NYT. So if a website has a desirable content, then the website is actually sending the traffic to the search engines. ACM DL is not different either. The only problem is that ACM DL is so small compare to big search engines, then they do not mind taking the traffic for free. But on the flip side, if a small favor is done in the reverse direction then the search engines have developed methods to get monetary credits for these small favors (long tail of advertisements). In an ideal world, the favors should either be free in both the directions or none. In the existing setup the favors are free only in one direction.

Brighten, I agree that it may be possible to cut cost by decreasing ACM DL services and using more volunteer hours. But the cheapest cost, and lowest service may not be the optimal tradeoff point on the quality vs price curve. So that's not what I am proposing.

Cosnider the following hypothetical example. Suppose I come with a magic technology so that Microsoft could produce the same amount of software with 20% less engineers. It does not mean, Microsoft should continue the existing amount of agenda and save 20% on engineering expenses. What Microsoft investor's would want is that Microsoft takes at least 25% more projects and provide even more software to the world. 25% extra (1/.8 - 1), is a point on the current trade off curve, but since if Microsoft is 20% more efficient its capital risks decrease, so it would make more sense to increase the total amount of engineering resources to do more than 25% additional activities.

Microsoft measures its output in terms of revenue. A public service measures its output in other ways, i.e., how much it contributes in the public life, but the basic math and economics should remain the same. If a public service becomes more efficient then its capital requirement should increase. Or in other words, if you ignore the business models and optimize for the social welfare of the world, a more efficient system (under simplified assumptions) would get a higher allocation.

So I am not against making ACM DL more efficient on per dollar. But if it becomes more efficient then it does not necessarily mean that it spends less, it may also mean that it broadens its agenda. After all, the internet opens new opportunities for everyone with every kind of agenda.

Hi Kamal, I agree with your point that cutting costs can be orthogonal to generating more revenue.

But what about the market segmentation approach? This approach doesn't cut costs. In fact, it would increase costs slightly (e.g. paying for bandwidth to serve more users).

It also doesn't decrease ACM DL services; it would add a new (free) service in addition to the current (paid) service. I guess you could look at it in a different way, that the additional free service is inferior to the paid service. The key seems to be to design the mechanism such that the decrease in service is small (to promote open access to scientific work), but at the same time, most people who buy the paid service still do so even with the free service available.

One possibility would be the Moving Wall approach which I mentioned above. Actually, wait, we are computer scientists so we should call it a Sliding Window. ;-) Here the two market segments are (1) those that want to ensure access to all articles as soon as they are available, and (2) those that are willing to wait a few months or years.

I'm not sure Sliding Window is the optimal approach, but it seems workable in terms of cost/revenue, and also a strict improvement over the present situation. Moreover, I believe this model actually has been used in a different way for CS research: corporations sponsor university research and get early access to results that may be useful to the company; but eventually the results are published and everyone has free access.

Another possibility that a lot of web sites use is segmenting users based on how much they value their own time. In the paid service you get your papers instantly. In the free service you can get any paper you want, but you have to wait 30 seconds to download it. I know there are free file storage sites that do this.

Or here's a strange idea: the site is free after business hours.

I could go on, but coming up with reasonable ideas probably requires knowledge of where ACM DL gets most of its revenue from. University subscriptions? Corporate research lab subscriptions? Individual articles being purchased for $10 or $20 or whatever the fee is?

Brighten:I also feel the lack of not knowing the cash-flow of ACM. I tried to find ACM's financial statement, and I am not sure if it is available publically. It is a public institution so it should put its accounts online. We should be able to see ACM cash-flow. Transparency in ACM's cash-flow is even more important than the agenda of this blog post.

First, as a mathematical theorem we should realize that a market segmentation which creates a free service decreases the overall revenue. Market segmentation creates more revenue by generating X number of low paying customers at the expense of cannabilizing Y number of high paying customers, so that X generates more money than Y. In case X generates zero, then as a mathematical fact this implies decrease in revenue.

So it does require ACM to cut its cost and possibly shows ads to this the increased traffic of readers. I am fine with it, as long as ACM could cut without effecting the quality of its service. Since I do not have ACM cash flow statements, I can't say for sure that ACM has bureaucratic costs to cut without cutting its level of services.

Regarding your specific market segmentation, I think these are sound. The sliding window is used by the publisher of entertainment content. For an example, a movie comes first on theater and after some times come on DVD.

Eseentially you are suggesting that, do not open the entire ACM DL but open the historic archives of it. Essentially cutting ACM DL subscription to recent past only. In the near term it should work. In the long term when institution optimizes their budgets, they will find ACM DL subscription a bit more expensive.

Ideally, I like to make the entire ACM DL free and open, but for starter I am willing to accept the historic archives to be free and open.

The recent articles I want to be free and open, so that ACM DL could add additional services to them. I feel, our collaboration technology is outdated. Recent articles could act as a seed germs for a collaboration portal. You read a recent article, some new ideas germinate in your mind, you should be able to associate those ideas with the article so that the like minded or complementary skill collaborators could find each other. And of course you get some official credit by putting the idea on an official collaboration portal, which is easily discoverable by people interested in that particular article/theorem/conjecture/subject etc.

I like your free portal during business hours. Long distance companies have used this idea successfully. Partly for managing congestion, but primarily for market segmentation. During day time, people talk business so they have higher utility for a phone call but during offtime people chat with friends.

Similarly, during business hours, our respective institutions should be able to sponsor our activity, i.e., sort of high utility time. But it allows free offtime access. So those who are not fortunate could still access the article late in the evening or early morning. I have two minor issues.

Business hours are defined locally. Could somebody use a proxy in another timezone to misreport the time?

The second concern is, are people willing to download articles in the evening and use it the next day? How often does the situation arise when we like to have an article immediately?

Another segmentation is that viewing the article is free. You can't save or print it. You could simply view it during offtime. I think in such a case institutions would subscribe to the ACM DL because the employees have a natural need/excuse of keeping offtime as family time.

I see the general direction where Mark Cuban is going with that article. I mostly agree with the inspiration, though his suggested implementation is more oriented towards search engines and mine towards the social welfare. For an example, I neither suggest nor favor that content be pulled from a search index, on the contrary I want Google to open its Book Scanning content too to other search engines. I, and in my impression Google too, want that the content be open for indexing even if it is 100% financed by a commercial entity, but in the book scanning case, only a small fraction is financed by Google -- Google is paying only for the format conversion, from paper to electronic, but the real seed is the content itself. Still Google keep the content exclusive. May be Google's definition of a search engine is Google.

Google book's content and the scientific content have a lot of commonality. The underlying content is mostly free. Formatting, re-formatting, logistical support, etc is expensive. In one case, since Google as a owner is keeping the content exclusive to one search engine, it basically is equivalent to an internal transfer of credit from search engine business to the book scanning business. If Google Print was a separate company, Google search engine would either be paying to Google Print for this exclusivity or Google Print would not be exclusive. So in essense if two complementary businesses are owned by a single entity, then we do not necessarily see the transfer of payment but it is there.

Google News does not have ads, and Marissa Mayer, an executive at Google, once reported that Google makes about $100 millions from the Google News traffic (http://bigtech.blogs.fortune.cnn.com/2008/07/22/whats-google-news-worth-100-million/)

Externally we do not see this transfer of payment. But both Google books content and ACM DL create positive value for the respective search engines.