A central question for many people involved in Open Access is whether it can, or will save money. Most analyses suggest that a fully OA environment is cheaper (or at worst similar in cost) for institutions (see below for the catch that every analysis that says costs will rise misses). But for research intensive institutions in particular taking the lead by investing in a transition to Open Access while also covering the costs of existing subscriptions could be expensive. At the same time real concerns are emerging about some traditional publishers successfully driving costs higher. How can countries and institutions invest in creating an Open Access environment that serves their needs and brings costs down without spending too much on the transition?

This issue has been getting attention recently due to a press release from the team negotiating a new contract for the Elsevier subscription for the Netherlands. The release is interesting for a whole range of reasons – which can be the subject of later posts – but here I want to focus on the financial implications of a full scale cancellation.

There are essentially two ways to realise the potential savings of an OA environment, or equally to minimise the costs of transition. The first is to negotiate with subscription publishers for a direct rebate where Open Access payments are made. The UK, with RCUK and Wellcome money backing the transition, has led on negotiations with the Royal Society of Chemistry and Institute of Physics both offering some form of direct rebate – essentially what an institution pays in APCs to a publisher gets taken off the subscription costs. This is presumably what the Netherlands was seeking to put into the Elsevier agreement.

The second approach is more radical. Cut subscriptions, rely on access via repositories or personal requests in the short term for access, and directly use the liberated budgets to support Open Access. The second approach gives an institution much more flexibility but obviously at the risk of reduced access and potential researcher anger. No doubt researchers across the Netherlands are currently receiving messages from Elsevier concerned about their potential loss of access and keen to make the case for continued subscriptions.

Lets look at these two options from a purely financial perspective. There are a number of problems with access to data but with some available data and educated guesses we can work through the figures.

The bottom line

The details of the calculations and sources of numbers are given below. The Netherlands publishes around 40,000 articles a year, about 10,000 of those with Elsevier. For those articles published free to read from the Netherlands today the average fee across all “pure OA” and hybrid articles is €1,087. Two different estimates suggest that of those Open Access papers with a Dutch affiliation, around 60% are billed to a Dutch address – an important proportion. The overall Dutch subscription costs are around €34M of which Elsevier likely have a share of around €7M.

If the Netherlands converted unilaterally to Open Access funded through APCs at current rates the overall cost would be €26M (40,000 x 60% x €1,087) a saving of around €8M compared to the current subscriptions of €34M. The question is how quickly could the Netherlands reduce its subscription payments?

Cancelling the Elsevier subscription would liberate enough money to cover the costs of publishing all those papers currently published with Elsevier (10,000 x 60% x €1,087 = €6.5M). But how much would it cost to publish them with Elsevier? For an average charge by Elsevier we could use the average paid for Elsevier papers from Wellcome Trust funds (€3,100). From this we can calculate how much it would cost to publish those papers OA with Elsevier ( €18.6M). Alternately if we use the average paid by RCUK, which represents a set of private deals and discounts (€1,600) we end up with a figure slightly higher than the saving (€9.6M). It’s likely that if the subscription were cancelled the fees would be at the high end of this range, but if the subscription were retained bigger discounts would be applied (but of course, they’d still be paying the subscription then).

So the irony is that cancelling the Elsevier subscription would liberate enough money to make accessible all those articles the Netherlands currently publishes with Elsevier, but not if they were published by Elsevier. In a pure OA world at Elsevier prices the Netherlands would pay over €74M (at the Wellcome rate).

Managing the negotiation

The negotiators could use this analysis to define the OA price point they need to negotiate with Elsevier to find a break even. The problem with this is that the money will remain tied to one supplier – and a very expensive supplier at that. The negotiators should look in detail at RCUK APC reports to see how good a deal can be done if that’s the route they want to take. Elsevier can cross subsidise from their subscription revenue to reduce the apparent cost of APCs – something new players in the space cannot do. From a short term perspective this can probably even be made to look financially attractive. But in the long term Elsevier cannot afford to charge APCs at the level where the Netherlands would break even without a substantial drop in revenue.

The reality is that if the Netherlands wants to use the leverage that their resources provides they should cancel the subscription and liberate the funding. Those resources can be used to shape the future scholarly communications market. This analysis is highly sensitive to the average cost of APCs paid. The Netherlands, with the resources available to it, has the leverage to shape the market. They could choose to spend that money so as to reduce APCs by favouring lower cost suppliers. This will help to realise the potential savings that an Open Access environment could bring. They could use liberated resources to fund APCs. Alternately they could support new publishing ventures or platforms for low cost publication. All of these are possible – all of these would have a massive boost from €7M. None of them are possible without cutting subscriptions.

Full Disclosure Comment: Obviously PLOS would stand to benefit from an expansion in the availability of APC funding in the Netherlands.

Data sources and details of the calculations

Wouter Gerritsma has published a very useful blog post that contains a series of key figures for the Netherlands (although as you’ll see below I disagree with its main conclusion). In particular he gives figures of total NL subscriptions of about €34M, an average APC for free to read articles of €1,087 (or €1,220 if you only include accessible articles where an APC was paid) and a total number of NL articles of around 40,000.

It’s hard to know what the Netherlands subscription payment to Elsevier is but we can make a rough guess. If we combine figures from FOI requests to UK universities with SCONUL figures for “Total Information Provision” from the same universities then (if we restrict our view to research intensives similar to the VSNU institutions) we get a rough figure of 20% for Elsevier market share. This gives us a figure of around €6.8M. It seems that WoS and Scopus suggest that Elsevier publish around 25-30% of Netherlands articles (lets say 10,000).

We can calculate the cost of unilaterally taking the Netherlands to an Open Access footing based on these figures but we need one more – the 40,000 articles that Gerritsma quotes are all those with a Netherlands affiliation. We need to know how many of those would actually be billed to the Netherlands. Gerritsma has done an analysis looking at those articles where either all affiliation are to Netherlands addresses or where the corresponding author has a Netherlands address. He arrives at a figure of 67% as the “billed proportion” giving a total cost of €27.7M, substantially less than the subscription payments

Within PLOS we also have data on this proportion because we have the location of actual billing addresses as well as author affiliations for papers we publish. Looking at our records the proportion of Netherlands affiliated papers actually billed to the Netherlands is 60-62%. This gives a slightly lower figure of €26M for unilateral OA for the whole Netherlands corpus. This proportion is actually very high compared to other countries and institutions. Many calculations of projected costs fail to take this proportion account which is why claims are often made that Open Access would be more expensive. Generally this billed proportion is around 40-50%. In no case where I have done the calculation using an empirically determined billing proportion has the cost of OA exceeded that of current subscriptions.

So cancelling the Elsevier subscription won’t save enough money to cover OA for all Netherlands papers. Does it provide enough to cover OA for those papers Elsevier currently publishes? We can use average figures from per article returns for Wellcome Trust (£2,448 or €3,100) or RCUK (£1,268 or €1,600) funded articles to get an average APC for Elsevier articles. The differences in these figures are due to package deals so they represent a range from list price to discounts that the Netherlands may be able to negotiate.

If we take those articles that Elsevier currently publishes with an Netherlands affiliation (10,000 as per above) and we multiply that by the proportion likely to be billed to the Netherlands (60%) then we get 6,000 articles that need to be paid for. At the Wellcome Trust prices this would cost €18.6M to publish with Elsevier (6,000 x €3,100) or at the RCUK prices €9.6M. By contrast if those articles were published elsewhere cancelling would liberate enough to pay if we use the average current APC calculated by Gerritsma (6,000 x €1,087 = €6.5M) and within the margin of error if we use the higher figure based only on the average where a fee was paid (6,000 x €1,220 = €7.3M).

Last week, along with a number of PLOS folks, I attended the 1AM Meeting (for “First (UK) Altmetrics Conference”) in London. The meeting was very interesting with a lot of technical progress being made and interest from potential users of the various metrics and indicators that are emerging. Bubbling along underneath all this there is an outstanding question that although it appeared in different contexts remains unanswered. What do these various indicators mean? Or perhaps more sharply what are they useful for?

This question was probably most sharply raised in the online discussion by Professor David Colquhoun. David is a consistent and trenchant critic of all research assessment metrics. I agree with many of his criticisms of the use and analysis of all sorts of research metrics, while disagreeing with his overall position of rejecting all measures, all the time. Primarily this comes down to us being interested in different questions. David’s challenge can perhaps be summed up in his tweet: “I’d appreciate an example of a question that CAN be answered by metrics”. Here I will give some examples of questions that can (or could in the future) be answered, with a focus on non-traditional indicators.

Provide evidence that…

Much of the data we have is sparse. That is, the absence of an indicator can not reliably be taken to mean an absence of activity. For example a lack of Mendeley bookmarks may not mean that a paper is not being saved by researchers, just that those who do are not using Mendeley to do it. A lack of tweets about an article does not mean it is not being discussed. But we can use the data that does exist to show that some activity is occurring. Some examples might include:

Provide evidence that relevant communities are aware of a specific paper. I identified the fact that this paper was mentioned by crisis centres, sexual health organisations and discrimination support groups in South Africa when I was looking for UCT papers with South African twitter activity using Altmetric.com.

Provide evidence that this relatively under cited paper is having a research impact. There is a certain kind of research article, often a method description or a position paper that is influential without being (apparently) heavily cited. For instance this article has a respectable 14,000 views and 116 Mendeley bookmarks but a relatively (for the number of views) small number of WoS citations (19) compared to say this article which is similar in age and number of views but has many more citations.

Provide evidence of public interest in…A lot of the very top articles by views or social media mentions are of ephemeral (or prurient) interest, the usual trilogy of sex, drugs, and rock and roll. However dig a little deeper and a wide range of articles surface, often not highly cited but clearly of wider interest. This article for instance has high page views and Facebook activity amongst papers with a Harvard affiliation but is neither about sex, drugs nor rock and roll. Unfortunately because this is Facebook data we can’t see who is talking about it which limits our ability to say which publics are talking about it which could be quite interesting.

Compare…

Comparisons using social media or download statistics needs real care. As noted above the data are sparse so it is important that comparisons are fair. Also comparisons need to be on the basis of something that the data can actually tell you (“which article is discussed more by this online community” not “which article is discussed more”).

Compare the extent to which these articles are discussed by this online patient group. Or possibly specific online communities in general. Here the online communities might be a proxy of a broader community or there might be a specific interest in knowing whether the dissemination strategy reaches this community. It is clear that in the longer term social media will be a substantial pathway for research to reach a wide range of audiences, understanding which communities are discussing what research will help us to optimise the communication.

Compare the readership of these articles in these countries. One thing that most data sources are very weak on at the moment is demographics but in principle the data is there. Are these articles that deal with diseases of specific areas actually being viewed by readers in those areas? If not, why not? Do they have internet access, could lay summaries improve dissemination, are they going to secondary online sources instead?

Compare the communities discussing these articles online. Is most conversation driven by science communicators or by researchers? Are policy makers, or those who influence them involved? What about practitioner communities. These comparisons require care and simple counting rarely provides useful information. But understanding which people within which networks are driving conversations can give insight into who is aware of the work and whether it is reaching target audiences.

What flavour is it…

Priem, Piwowar and Hemminger (2012) in what remains in my mind one of the most thoughtful analyses of the PLOS Article Level Metrics dataset used principle component analysis to define different “flavours of impact” based on the way different combinations of signals seemed to point to different kinds of interest. Much of the above use cases are variants on this theme – what kind of article is this? Is it a policy piece, of public interest? Is it of interest to a niche research community or does it have wider public implications? Is it being used in education or in health practice? And to what extent are these different kinds of use independent from each other?

I’ve been frustrated for a while with the idea that correlating one set of numbers with another could ever tell us anything useful. It’s important to realise that these data are proxies of things we don’t truly understand. They are signals of the flow of information down paths that we haven’t mapped. To me this is the most exciting possibility and one we are only just starting to explore. What can these signals tell us about the underlying pathways down which information flows? How do different combinations of signals tell us about who is using that information now, and how they might be applying it in the future. Correlation analysis can’t tell us this, but more sophisticated approaches might. And with that information in hand we could truly design scholarly communication systems to maximise their reach, value and efficiency.

Some final thoughts

These applications largely relate to communication of research to non-traditional audiences. And none of them directly tell us about the “importance” of any given piece of research. The data is also limited and sparse, meaning that comparisons should be viewed with care. But the data is becoming more complete and patterns are emerging that may let us determine not so much whether one piece of work is “better” than another but what kind of work it is – who is finding it useful, what kinds of pathways is the information flowing down?

Fundamentally there is a gulf between the idea of some sort of linear ranking of “quality” – whatever that might mean – and the qualities of a piece of work. “Better” makes no sense at all in isolation. Its only useful if we say “better at…” or “better for…”. Counting anything in isolation makes no sense, whether it’s citations, tweets or distance from Harvard Yard. Using data to help us understand how work is being, and could be, used does make sense. Building models, critiquing them – checking the data and testing those models to destruction will help us to build better communications systems.

Gathering evidence, to build and improve models, to apply in the real world. It’s what we scholars do after all.

We know that those Open Access policies that work are the ones that have teeth. Both institutional and funder policies work better when tied to reporting requirements. The success of the University of Liege in filling its repository is in large part due to the fact that works not in the repository do not count for annual reviews. Both the NIH and Wellcome policies have seen substantial jumps in the proportion of articles reaching the repository when grantees final payments or ability to apply for new grants was withheld until issues were corrected.

Each of these steps are difficult or impossible in our current data environment. Each of them could be radically improved with some small steps in policy design and metadata provision, alongside the wider release of data on funded outputs.

Identifying relevant outputs

It may seem strange but it remains the case that the hardest step in auditing policy implementation is the first. Identifying which outputs are subject to the policy. There is no comprehensive public database of research outputs. Crossref and Pubmed come closest to providing this information but both have substantial weaknesses. Pubmed only covers a subset of the literature, missing most of the social sciences and virtually all the humanities. Crossref does a better job of covering a wider range of disciplines.

Affiliation and funder are the two key signifiers of policy requirements. Pubmed only provides affiliation for corresponding authors, Crossref metadata currently has very entries with author affiliations. Crossref’s Fundref project is gradually adding funder information but is currently limited in coverage. Pubmed only has funder information for Pubmed partners. Private data sources such as Web of Knowledge and Scopus can provide some of this data but are also incomplete and can not be publicly audited.

Funders and institutions rarely provide any public-facing list of their outputs. RCUK is probably the leader in this space with Gateway to Research providing an API that allows querying via institution, funder, grant or person. GtR is a good system but is reliant on author reporting. It therefore takes some years for outputs to be registered. In principle the SHARE notification system could go some way to addressing this updating issue but to manage the process of keeping records updated at scale will require standards development. Pubmed and Europe PubMed Central provide the most up to date public information linking outputs to funding currently available but as noted above have disciplinary gaps and weaknesses in terms of affiliation information.

Identifiers for research outputs are crucial here. Pretty much any large scale tool for identifying and auditing the implementation of any scholarly communications policy will need to pull data from multiple sources. To do this at scale requires that we can cross-reference outputs and compare data across these sources. Unique identifiers such as DOIs, ISBNs and Handles make a huge difference to accuracy. Without them, many outputs will simply be missed or the data will be too messy to handle. Disciplines that have not adopted identifiers will therefore be systematically under represented and under reported.

Identifying accessible copies

Assuming we can create a list of relevant outputs it might seem simple to test whether it is possible to find accessible copies. A quick Google Scholar search should suffice. And this will work for one, or ten or perhaps a hundred outputs. But if we are to track implementation across a funder, a large institution or a country we will be dealing with tens or hundreds of thousands, or millions of outputs. Manual checks will be very labour intensive (as the poor souls preparing returns from UK universities for the RCUK review can currently attest).

Check the publisher copy

As noted above a substantial proportion of the scholarly literature does not have a unique ID. This means finding even the ‘official’ copy can be challenging. Where an ID is available it should be straightforward to reach the publisher copy but determining whether this is ‘accessible’ is not trivial. While many publishers will mark accessible outputs in some way this is done inconsistent across publishers. Currently this requires outputs to be checked manually, an approach that will not scale. Consistent metadata is required to make it possible to check accessibility status via machine. This is gradually improving for journal articles but books, with a wider range of mixed models for Open Access volumes will remain a challenge for some time.

Find a repository copy

If the publisher copy can’t be found or isn’t accessible then it is important to find a copy in a repository…somewhere. Google might be indexing the repository, but does not provide an API. This means each article needs to be checked by hand. Aggregators like CORE, BASE and OpenAIREmight be pulling from the repository in question providing a search mechanism that scales. But many repositories do not provide information in the right form for aggregation.

While there is a standard metadata format provided by repository systems, OAI-PMH, it is applied very differently by different repositories. Many repositories do not record publisher identifiers such as DOIs and titles are frequently different from the publisher version making it difficult to efficiently search the records. More importantly OAI-PMH is a harvesting protocol. It is not designed for querying a repository and identifying whether it holds resources relating to specific outputs.

Even if we do find a record in a repository it does not necessarily mean that a full text copy of the output has been deposited, nor that it is actually available. Institutional repositories are very inconsistent in the way they index articles and in the metadata they provide. CORE harvests files so if a file is available it is generally full text. OpenAIRE provides metadata on whether an output is available. Both have limitations in coverage and neither have appropriate infrastructure funding for the long term future.

Determining output compliance

Once accessible copies of outputs have been identified it remains to be determined whether all the policy requirements have been met. Requirements fall into two broad categories: the time of availability (i.e. any embargo on public access to the document) and licensing requirements. Neither of these can currently be tested at scale.

Embargos and availability

Most access policies require that outputs made available via repositories are made public within a specified period after publication. Precise wording of the policy often varies on this point. Most policies specify that the output must be available after some acceptable embargo period but differ on when the output should be deposited (on acceptance, on publication, before the embargo ends).

The metadata provided by most OAI-PMH feeds does not provide sufficient information to determine whether a full text copy is available. Where any information is provided on copies that are deposited but not accessible this is not provided in a consistent form. OpenAIRE specifies requirements for repository metadata that define whether a full text copy is currently embargoed but only a subset of repositories are currently OpenAIRE compliant.

Overall it is currently not possible to comprehensively survey repositories to determine whether a full text copy has been deposited, whether it is available to read, and if not when it will be. Confusion created by policy wording on when any acceptable embargo period commences is also not helpful. If it starts on the date of publication is that the date of release online or the formal date of publication (which can be months or years later)? If it is the date of acceptance where is this recorded? What does acceptance even mean if we are talking about a monograph?

The work on RIOXX metadata standards will address many aspects of this and illustrates the need for consistent metadata profiles to enable automated auditing. The challenges also illustrate the need for standardising policy language and for expressing policy requirements in measurable terms.

Publisher site licensing

If an output is made available via a journal then there are often requirements associated with this. For RCUK, Wellcome Trust and FWF where an APC has been paid the article must be published under a CC BY license. The experience of implementation has been patchy with many traditional publishers doing a fairly poor job of providing the correct licenses. This means that each and every output needs to be checked.

As with repository auditing this is a challenge. Different publishers, and different outputs from the same publisher, have differing and inconsistent ways of expressing license statements. Some journal publishers even manage to express licenses inconsistently on the same article.

To address this PLOS funded a tool, built by Cottage Labs which aims to check, for individual journal articles, what the license is. It does this by following the DOI to the article page, reading the HTML and checking for known licenses statements for that website. The tool provides an API that allows a user to query a thousand IDs at a time for available license information. This approach is limited. It only works when we have an identifier (DOI or PMID). It is focussed on journal articles. It breaks when publishers change their website design. It can only recognize known statements.

But worst of all it depends on publishers actually making the license clear. While some traditional publishers (NPG and Oxford University Press deserve credit here) do a good job of this many do not. Taylor and Francis place a license statement only in the PDF (and in a context which makes it hard to detect which license applies). Springer sometimes do and sometimes don’t make the license statement available on the abstract page, sometimes only on the article page. Elsevier’s API (which we have to use because they make article pages difficult to parse) is not always consistent with the human readable license on the article. And the American Chemical Society create a link on the article with the text “CC BY” which links to a page which isn’t really the Creative Commons Attribution license but an ‘enhanced’ version with more limitations.

The NISO Accessibility and Licensing Information Working Group (full disclosure: I am a co-chair of this group) has proposed a metadata framework which could address these issues by providing a standardised way of expressing licenses – while not restricting the ability of publishers to choose which license to apply. Crossref is already offering a means for publishers to bulk upload license references for existing DOIs. This needs to be expanded across all publishers if we are to effectively monitor implementation of policies.

Policy Design

As we move from the politics of policy development to the (social) engineering problem of policy implementation our needs are changing. It is no longer enough to simply state aspirations, we need to be able to test performance. At the moment this is being done via manual and ad hoc processes. This is both inefficient and not scalable. At the same time, with the right information environment it should be possible to not just monitor our implementation of Open Access but the continuously monitor it in real time.

The majority of public access policies to date have been designed as human readable documents. Little thought has gone into how the policy goals translate into auditable requirements. As a result the burden of monitoring implementation is going up. In many cases there are no mechanisms to monitor implementation at all.

As we move from aspirational policies to the details of implementation we need efficient means of generating data that help us to understand what works, and what does not. To do this we need to move towards requirements that are auditable at scale, that work from sustainably public datasets using consistent metadata formats.

Policies are necessarily political documents. The devil is in the details of implementation. For clarity and consistency it would be valuable to develop formalrequirements documents, alongside policy expressions that provide explicit detail on how implementation will be monitored. None of the infrastructure required is terribly difficult to build and much of it is already in place. What is required is coordination and a commitment to standardising the flow of information between all the stakeholders involved.

Recommendations

Identification of Relevant Outputs: Policy design should include mechanisms for identifying and publicly listing outputs that are subject to the policy. The use of community standard persistable and unique identifiers should be strongly recommended. Further work is needed on creating community mechanisms that identify author affiliations and funding sources across the scholarly literature.

Discovery of Accessible Versions: Policy design should express compliance requirements for repositories and journals in terms of metadata standards that enable aggregation and consistent harvesting. The infrastructure to enable this harvesting should be seen as a core part of the public investment in scholarly communications.

Auditing Policy Implementation: Policy requirements should be expressed in terms of metadata requirements that allow for automated implementation monitoring. RIOXX and ALI proposals represent a step towards enabling automated auditing but further work, testing and refinement will be required to make this work at scale.

Since our coalition of over 50 signatories first released our letter to the STM Association calling on them to withdraw their new model licenses there has been overwhelming support. We’ve added new signatories daily to now reach 85. The most recent additions are publisher-oriented - GigaScience Journal and UC University Press - the latter notable as being a publisher with a strong history in the social sciences and humanities. See the letter itself for the full list of what is now a very wide ranging group of signatories

Many signatories have also blogged their own perspective. A full list of the posts and media coverage we know about is below but in this post I wanted to pick out one aspect that is particularly important. While PLOS (and many of those of us associated with it) are vociferous supporters of CC BY as the right license for scholarly work, many of the signatories to the letter choose to use other Creative Commons licenses, including some of the more restrictive variants. See for instance the ACRL Post on their use of CC BY-NC or the Wikimedia Foundation post that emphasises CC BY-SA.

This is important because it shows that even while we disagree on important issues of principle around which licenses to use, we all agree that we should work within a single framework. This means that we can have the important discussions on those principles and know that until we resolve them we are as compatible from a legal perspective as possible. And it means that if and when we do resolve those issues that it is possible to shift from one CC license to another with as few unexpected side effects as possible.

The thing that most disappointed me about the STM response to our letter is the way it mistakenly equates the use of Creative Commons licenses with the use of the CC BY license specifically. STM should be showing leadership through educating its members on the range of CC licenses.

What the growing list of signatories, coming from a wide range of perspectives and the coverage below shows is that there is plenty of space for a diversity of opinions on business models and user rights within a single interoperable framework of Creative Commons licenses.

We thought it might be useful to get some data on just how many CC licensed peer reviewed articles are out there. This turns out to be a non-trivial exercise but I think it’s feasible to come up with a reasonable lower bound. The too-long didn’t-read version: there are at least 1.2M CC licensed scholarly articles in the wild, with over 720,000 of them being licensed CC BY.

Our first call is the Directory of Open Access Journals. The DOAJ, alongside its listing of journals also has the opportunity for providing article metadata, including the default license for the journal. At the search page there is an option to limit the search to articles and if you then click on the licenses selector tab you can get the number of articles registered under different CC licenses. When I looked this gave around 547,138 CC BY licensed articles, 311,956 CC BY-NC articles and so on to give a total of just over one million CC licensed articles in total.

However this isn’t a complete representation of the picture. A number of large publishers (including [cough] PLOS) don’t deposit article level metadata with DOAJ. So 1M is undercounting. For some publishers of pure OA journals we have data from OASPA up to 2013 on CC BY licensed articles. The big contributors missing from the DOAJ dataset are Springer Open, PLOS, OUP and MDPI. These publishers contribute a further 144,203 articles up to the end of 2013, bringing our total to over 1.1M. I can add the 22k articles published by PLOS and 3,463 published by SpringerOpen in 2014 to this total (but not those from Biomed Central which are included in the DOAJ numbers).

There are some further gaps, NPG’s Scientific Reports uses CC licenses (5,793 articles according to Pubmed) and Nature Communications uses CC licenses for its free-to-read papers (I obtained a total of 1,026 free to read articles from this data set). Nature Communications illustrates a big gap in our knowledge. We know that there is substantial uptake of CC licenses by big publishers including Wiley, Taylor and Francis, Sage, OUP, and Elsevier for their hybrid offerings but we have limited information on the scale of that at the moment. The sources and quality of information are likely to improve substantially by the end of the year but at the moment the best I could do is guess that these might amount to a few tens of thousands but not yet hundreds. I’m therefore leaving them out of my current estimates.

Some caveats – clearly I’m missing a range of articles here, particularly from smaller publishers that have journals not registered with DOAJ. But if I’m going to claim this is a reasonable lower bound I also need to ensure I’m not double counting. A search for all the publishers for which I’ve added articles above and beyond those in DOAJ gives zero results except for a search for PLOS (650) and Springer (99). I’m also missing a substantial number of papers from Springer. They recently announced reaching 200,000 OA papers with CC licenses from various imprints including Biomed Central and Springer Open. The totals in my numbers are 136,895 papers from Biomed Central (via DOAJ) and 18,375 for Springer Open (based on the OASPA data and a search for 2014). Therefore there are another ~45k papers I’m missing. Similarly for OUP I’m missing maybe another 10,000 papers in journals like Nucleic Acids Research that are now mostly CC licensed.

One criticism of these figures might be the fact that the DOAJ does contain some journals that are currently being removed as they do not meet the stricter quality conditions being imposed. Am I therefore including dodgy journals in my figures? The counter argument is that those publishers that think about licensing and provision of article metadata tend to be the most reliable. The fact that the data is there at all is a good indicator of a serious publisher. Overall I think the balance of clear undercounting above vs the risk of these potential issues contributing significant numbers of articles is approximately a wash.

Overall the total figures come out to slightly over 1.2M articles with CC licenses. Of these at least 724,000 use the CC BY license. You can of course take the links I’ve given and check my maths. The data can also be split out by year, and although that gets more messy with missing data it looks like around 200k CC licensed articles were released in 2012 and 2013 making them a substantial proportion of the whole literature.

Last Thursday we published a letter with 57 other organisations calling on the International Association of Scientific, Technical and Medical Publishers to withdraw their model licences and work within the Creative Commons framework.

Opposition to the STM model licences is also coming from other sources, most notably the Wellcome Trust. Chris Bird (Senior Legal Counsel) and Robert Kiley (Head of Digital Services) outlined in a post on Friday why they think the STM licences are not helpful:

“Put simply, we see no value in these new licences, and believe that if a publisher wishes to restrict how content can be used (excluding Wellcome funded, OA papers which must always be published under the CC-BY licence), the existing Creative Commons licences (e.g. CC-BY-NC and CC-BY-NC-ND) are more than adequate.”

The Wikimania meeting is the annual jamboree of the Wikimedia movement. The sessions cover museums, pop culture, politics, technology, communities and tools. Two thousand people have descended on the Barbican Centre in London to talk not just about Wikipedia (or more properly the Wikipedias in various languages) but a myriad of other projects that use the platforms or infrastructure the foundation stewards or take inspiration from the successes of this movement. The energy and the optimism here is infectious. The people around me are showing in session after session what happens when you give motivated people access to information resources and platforms to work with them.

From the perspective of academia, or of scholarly publishing it is easy, even traditional, to be dismissive of these efforts. There is perhaps no more pejorative term in the academic lexicon than ‘amateur’. This is a serious mistake. The community here are a knowledge creation and curation community – the most successful such community of the digital age.

There is much that they can teach us about managing information at scale and making it accessible and usable. The infrastructure they are building could be an important contribution to our own information platforms. There are tools and systems I have seen demonstrated here, many of them built by those ‘amateurs’, which far outstrip the capabilities we have in the academic information ecosystem. And we don’t come to the table empty handed – we have experience and knowledge of curation and validation at different scales, on how to manage review when appropriate experts are rare, on handling conflicts of interest and the ethical conduct of information gathering.

But we are just one contributor to a rich tapestry of resources, just one piece in a puzzle. One of the things I find most disappointing about the STM Association response to yesterday’s letter is the way it perpetuates the idea that it makes sense to keep scholarly publishing somehow separate from the rest of the web. The idea that “Creative Commons Licenses…are not specifically designed for academic and scholarly publishing” aside from being a misrepresentation (a subject for another post) makes very little sense unless you insist on the idea that scholarly work needs to be kept separate from the rest of the world’s knowledge.

Now don’t get me wrong – scholarly knowledge is special. It is special because of the validation and assessment processes it goes though. But the containers it sits in. They’re not special. The business models that provide those containers aren’t particularly special. But most importantly the ways in which that knowledge could be used by a motivated community aren’t any different from that of other knowledge resources. And if we don’t make it easy to use our content then it will simply be passed over for other more accessible, more easily useable materials.

This community, this massive, engaged and motivated community are our natural allies in knowledge creation, dissemination, research engagement and ultimately justifying public research funding. We disengage from them at our peril. And we don’t get to dictate the terms of that engagement because they are bigger and more important than us. But if we choose to engage then the benefits to both our communities could be enormous.

It is comfortable to be the big fish in the small pond – to put up barriers and say “but we are different, we’re special” – but if we want to make a difference we should choose to swim actively in the main stream. Because that’s what this community is. The main stream of information and knowledge dissemination in the digital age.

As the 10th Wikimania Conference in London this week bears witness, millions of people in different locations and jurisdictions are using open tools, open software and Open Access content to interact with different types of information. Wikipedia alone receives 21 billion hits and is being added to at a rate of 30,000,000 words each month. But it is not just Wikipedia that is being adopted enthusiastically, there are online educational resources becoming available all over the world. This thirst for knowledge – and for knowledge creation – comes from every sector of society and every corner of the globe and represents an unparalleled opportunity for the academic literature. It’s an opportunity to open up scholarly content to a much wider community, for it to reach audiences where those audiences are based and to tap into the cultures that will facilitate its reuse. It’s an opportunity to democratize the scholarly literature – to enable the many rather than the few.

But this is not something the scholarly community can do alone. Such a vision requires an infrastructure that enables people and computers to talk to each other wherever they are based. It requires platforms, services and communities that work together regardless of their geographic location or legal jurisdiction. We can’t do this alone, but the academic community can make it easier. And ensuring that the licenses controlling scholarly content enable use and reuse is one part of that. The Creative Commons Licenses already provide a common legal framework to ensure that copyright owners can let others share and integrate their work with other human knowledge. They are being used not just for Wikipedia and Wikimedia but for education and policy and for music and images. People are adopting them as a global standard because they are widely understood, straightforward to implement and machine readable. They are adopting them because they work.

The model open access licenses recently released by the STM are not a global standard. And unlike the more liberal Creative Commons licences, even the most liberal STM licences restrict some form of commercial or derivative reuse. No STM-licensed work can be used on Wikipedia.

But worse than that, they are incompatible. The STM licences are legally complex, with confusing and undefined terminology. They are claimed to interoperate with the Creative Commons licences, but the restrictions they impose mean that they are barely compatible with the most restrictive Creative Commons licence (which permits neither commercial nor derivative reuse). Consequently, authors who wish to create new works licenced under a Creative Commons Attribution Licence (or indeed any other public licence) will not be able to use content from work published under any of the STM licences.

PLOS and other Open Access publishers such as Hindawi, funders such as the Wellcome Trust and bodies such as the World Health Organization all favor a Creative Commons license that permits liberal reuse, including unrestricted text and data mining, while ensuring that an author’s work is properly attributed (CC BY). Signing the letter published today calling for the withdrawal of the STM model licenses is not an endorsement of the more restrictive Creative Commons licenses; it is an endorsement of the Creative Commons framework, of a global standard that is already established.

Collectively we have an opportunity to open up research content to the wider world. Some will not want to, or be able to move as fast, but we should at least adopt a common legal framework. The Creative Commons licenses are not perfect but they have been shown to work and have been applied to over a billion objects from hundreds of millions of creators. They provide the flexibility for a wide range of options from the restricted to the fully open. But above all they provide a framework that we can all work within that will make it easier to connect with the wider world of the web.

Thank you for the opportunity to respond to your call for evidence. PLOS has been at the forefront of experimenting with and advocating for new modes of research assessment for a decade. Recent developments such as DORA and your own enquiry suggest that the time is appropriate for a substantial consideration of our approaches and tools for research assessment.

Our ability to track the use of research through online interactions has increased at an unprecedented rate providing new forms of data that might be used to inform resource allocation. At the same time the research community has profound misgivings, as demonstrated by submissions to your enquiry by e.g. David Colquhoun or by Meera Sabaratnam and Paul Kirby with the “metrication” of research evaluation (although see also a response by Steve Fuller. These disparate strands do not however need to be in direct opposition.

As a research community we are experienced with working with imperfect and limited evidence. Neither extreme uncritical adoption of data, nor wholesale rejection of potentially useful evidence should be countenanced. Rather we should use all the critical faculties that we bring to research itself to gather and critique evidence that is relevant to the question at hand. We would argue that determining the usefulness of any given indicator or proxy, whether qualitative or quantitative depends on the question or decision at hand.

In establishing the value of any given indicator or proxy for assisting in answering a specific question we should therefore bring a critical scholarly perspective to the quality of data, the appropriateness of any analysis framework or model as well as to how the question is framed. Such considerations may draw on approaches from the quantitative sciences, social sciences or the humanities or ideally a combination of all of them. And in doing so they must adhere to scholarly standards of transparency and data availability.

In summary, therefore, we will argue in answers to the questions you pose that there are many new (and old) sources of data that will be valuable in providing quantitative and qualitative evidence in supporting evaluative and resource allocation decisions associated with research assessment. The application of this data and its analysis to date has been both naive and limited by issues of access to underlying data and proprietary control. To enable a rich critical analysis requires that we work to ensure that data is openly available, that its analysis is transparent and reproducible, and that its production and use is subject to full scholarly critique.

Yours truly,
Cameron Neylon
Advocacy Director
PLOS

Summary of Submission

The increasing availability of data on the use and impact of research outputs as a result of the movement of scholarship online offers an unprecedented opportunity to support evidence-based decision-making in research resource allocation decisions.

The use of quantitative or metrics-based assessment across the whole research enterprise (e.g. in a future REF) is premature, because both our access to data and our understanding of its quality and the tools for its analysis are limited. In addition, it is unclear whether any unique quality of research influence or impact is sufficiently general to be measured.

To support the improvement of data quality, sophisticated and appropriate analysis and scholarly critique of the analysis and application of data, it is crucial that the underlying usage data used to support decision making be open.

To gain acceptance of the use of this evidence in resource allocation decisions, it is crucial that the various stakeholder communities be engaged in a discussion of the quality, analysis and application of such data. Such a discussion must be underpinned by transparent approaches and systems that support the community engagement that will lead to trust.

HEFCE should take a global leadership position in supporting the creation of a future data and analysis environment in which a wide range of indicators acting as proxies for many diverse forms of research impact (in its broadest sense) are openly available for community analysis, use and critique. HEFCE is well placed, alongside other key stakeholders to support pilots and community development towards trusted community observatories of the research enterprise.

Seven months ago, after little sleep I boarded a plane to Berlin to attend a conference and launch a project I’d been working tirelessly on for five months. That project was the Open Access Button, a browser plug-in which visualises when paywalls stop people reading research. Since the launch, which was covered in the Guardian, Scientific American and got the attention of EU science ministers the project has continued to progress. As the co-founder normally I’d now go on to talk all about it. Today is different though, I’m going to briefly tell the story of the conferences which launched, grew and gave birth to the Button and why we, as a community should support a new one, OpenCon 2014, which will do the same for many other ideas.

The Berlin 11 Satellite Conference for Students and Early Stage Researchers conference, which brought together more than 70 participants from 35 countries (and was webcast to many more around the world) to engage on Open Access was the stage for the Button’s launch. We launched the Button on stage with a timed-social media push (Thunderclap) which reached over 800,000 people. Without this platform we’d have never been able to obtain the level of publicity or move the project forward at the pace we have since.

The story of instrumental conferences goes back further though. Months before our launch we met with organisational leaders from across the globe at the Right to Research Coalition general assembly. This was the first time we truly were able to talk about the Button with our peers. We sought feedback, buyin and help moving the project forwards – all of which we got in spades. An afternoon training session then used the Button as a case study and the ideas from student leaders all then fed into what we did.

The final conference worth highlighting, is the one where it all began. While attending a conference of the International Federation of Medical Students I and my co-founder (David Carroll) got talking to Nick Shockey, Director of the Right to Research Coalition. Prior to that conversation, David and I knew no alternative to the system of publishing that frustrated us both. After it, well, the Open Access Button was born.

These three events provided us with a launching venue, a place to develop our ideas, raised our awareness and inspired us to act. In-between each is hundreds of hours of work, but these were each transformative points in our journey. We’re not alone in this experience though, at each event we were just one of many projects doing the same. I’m now, along with a student team from across the global working to make a conference which will do this for many others.

OpenCon 2014: is a unique Student and Early Career Researcher Conference on Open Access, Open Education and Open Data. On November 15-17 in Washington, DC, the event will bring together attendees from across the world to learn, develop critical skills, and return home ready to catalyze action toward a more open system for sharing the world’s information — from scholarly research, to educational materials, to digital data.

OpenCon 2014’s three day program will begin with two days of conference-style keynotes, panels, and interactive workshops, from leaders in the Open Access, Open Education and Open Data movements and participants who have led successful projects. The final day will be a half-day of advocacy training followed by the opportunity for in-person meetings with relevant policymakers, ranging from members of the U.S. Congress to representatives from national embassies and NGOs. Participants will arrive with plans of action or projects they’d like to take forwards and leave with a deeper understanding of the conference’s three issue areas, stronger skills in organizing projects, and connections with policymakers and prominent leaders.

Plans this ambitious though come with a price tag. To help support the travel of students from across the globe, feed them and provide them with the vital lifeblood of conferences (coffee) and put on the best conference possible we need the support of the Open Access, Open Education and Open Data movements. There are a huge variety of sponsorship opportunities, each with it’s own unique benefits which can be found here, but equally we appreciate to help of anyone in draw attention to the event or of course attending.

Author:
Joe McArthur
Assistant Director at the Right to Research Coalition
Co-founder of the Open Access ButtonJoe@righttoresearch.org
@mcarthur_joe

The content of guest posts is always the view of the authors and not the position of the PLOS Opens Blog or PLOS.

Click for more information on Open Access at PLOS

About PLOS Opens

The PLOS Opens blog provides news and views on the ongoing transformation of research communication. We talk about open access, policy, and approaches to open research. Posts will cover evidence and data, opinions and critical analysis from the PLOS Advocacy Team, other PLOS staff and invited guests.

The PLOS Advocacy Team

Catriona MacCallum studied evolutionary biology at Edinburgh. She joined PLOS in July 2003 as a launch editor of PLOS Biology and was also involved in the development of the Community Journals and PLOS ONE. As part of the advocacy team, she focuses on EU policy. She is also a Consulting Editor on PLOS ONE and a member of the Board of OASPA. On Twitter @catmacOA

Cameron Neylon is a biophysicist who has always worked in interdisciplinary areas and has become a dedicated advocate of open research practice and improved data management. As PLOS Advocacy Director, he plays a key role in shaping the organization's Open Access organizing, educational and outreach activities. As a respected leader in the Open Access movement, he participates in legislative and policy initiatives around the world. Cameron joined PLOS in 2012. On Twitter @CameronNeylon

Donna Okubo joined PLOS in late 2004 and has more than 15 years of non-profit membership and fundraising management experience. As a part of the Advocacy team, she coordinates educational and outreach activities and legislative initiatives with individuals and organizations across the broad Open Access community.