Still the definition and scope of anything new is always hazy and as such my thoughts on the matter are going to be pretty unrefined, so please let me think aloud.

But why library analytics? Libraries have always collected data and analysed them (hopefully), so what's new this time around?

In many ways, interest in library analytics can be seen to arise from a confluence of many factors both from within and outside the academic libraries. Here are some reasons why.

Trend 1 :Rising interest in big data, data science and AI in general

I don't like to say what we libraries deal in is really big data (probably the biggest data sets we deal with is in ezproxy logs which can be manageable depending on the size of your institution) , but we are increasingly told that data scientists are sexy and we are seeing more and more data mining, machine learning, deep learning and all that to generate insights and aid decision making.

In case you think these are sky in the pie projects - already IBM Watson is threatening to replace Law librarians , and I've read of libraries starting projects to use IBM Watson at reference desks.

Academic libraries are unlikely to draw hard core data scientists as employees, but we are usually blessed to be situated near pockets of talent and research scientists who can collaborate with the library.

As Universities start offering courses focusing on Analytics and data science, you will get hordes of students looking for clients to practice on and the academic library is a very natural target as a client to practice on.

Trend 2: Library systems are becoming more open and more capable at analytics

Recently, I saw someone tweeting that Jim Tallman who is CEO of Innovative Interfaces declaring that libraries are 8-10 years behind other industries in analytics.

Well if we are, a big culprit is the integrated library system (ILS) that libraries have been using for decades. I haven't had much experience poking at the back-end of systems like Millennium (owned by Innovative), but I'm always been told that report generation is pretty much a pain beyond fixed standard reports.

As a sidenote, I always enjoy watching conventionally trained IT people come into the library industry and then hear them rant about ILS. :)

In any case, with the rise of Library Open service platforms like Alma, Sierra (though someone told me that all it does is basically adds SQL but that's a big improvement) etc more and more data is capable of being easily uncovered and exposed.

A good example is Ex Libris's Alma analytics system. Unlike in the old days where most library systems were black boxes and you had great difficulty generating all but the most simple reports, systems like Alma and other Library Service Platforms of its class, are built almost ground up to support analytics.

You don't even have to be a hard core IT person to drill into the data, though you can still use SQL commands if you want.

With Alma you can access COUNTER usage statistics uploaded with Ustat (eventually Ustat is to be absorbed into Alma) using Alma analytics. Add Primo Analytics, Google analytics or similar that most Universities use and a big part of the digital footprints of users is captured.

Alma analytics - COUNTER usage of Journals from one Platform

Want to generate users and the number of loans by school made in Alma? A couple of clicks and you have it.

Unfortunately there still seems to be no easy way to track usage of electronic resources by users as COUNTER statistics are not granular enough. The only way is by mining ezproxy logs which can get complicated particularly if you are interested in downloads not just sessions.

This is still early days of course, but things will only get better with open APIs etc.

A common trend on Top trends list for academic libraries in recent years (whether lists by ACRL or Horizon reports) is assessment and/or showing value and library analytics has potential to allow academic libraries to do so.

Both assessment (understanding to improve or make decisions) or advocacy (showing value) require data and analytics

For me, the most stereotypical way for a academic library to show value would be to run correlations showing high usage of library services would be highly correlated with good grades GPA.

So for example Nottingham Trent University , provides all students with a engagement dashboard allowing them to benchmark themselves against others . Sources used to make up the engagement score include access of learning management systems, use of library and university buildings.

From the academic library side , we increasingly focus on the challenges of collecting, curating , managing and storing research data. There are rising fields like GIS, Digital Humanties that put the spotlight on data. We no longer focus not just on open access for articles, but on open data if not open science.

While library analytics is a separate job from librarians who are involved in research data management , there is synergy to be had between the two job functions as both deal with data. Both jobs requires skills in handling of large data sets, protection of sensitive data, data visualization etc.

For example the person doing library analytics can act as a client for the research data management librarian to practice on when producing reports and research papers. In return, the later can gain experience handling relatively large datasets by doing analytics projects.

But what does library analytics entail? Here are some common types of activities that might fall into that umbrella.

Assisting with operational aspects of decision making.

Traditionally a large part of this involves collection development and evaluation.

In many institutions like mine it involves using alma analytics,Ezproxy logs, Google analytics, Gate counts and other systems that track user behavior etc.

This in many ways isn't anything new, though these days there are typically more of such systems to use and products are starting to compete on the quality of analytics available.

This type of activity can be opportunistic, ad hoc and in some libraries siloed within individual library areas.

Implementation and operational aspects of library dashboard projects

A increasing hot trend, many libraries are starting to pull all their data together from diverse systems into one central dashboard using systems like Qlikview, Tableau, or free javascript libraries like D3.js

Typically such dashboard can be setup for public view or more commonly for internal users (usually within-library, ideally institution wide) but the main characteristic is that they go beyond showing data from one library system or function (so for example a Alma dashboard or a Google Analytics dashboard doesn't quite qualify as a library dashboard the way I defined it here).

Remember I mentioned above that library systems are becoming more "open" with APIs? This helps to keep dashboards up-to date without much manual work.

Setting up the dashboard is relatively straightforward technically speaking, more important is sustaining it. What data should we present? How should we visualize the data? Is the data presented useful to decision makers? How can we tell? At what levels of decision makers are we targeting it at? Should the data be made public?

This type of activity breaks down barriers between library functions though it can still be siloed in the sense that it is just the work of a University Library separate from the rest of the University.

Implementation or involvement in correlation studies, impact studies for value of libraries.

Such studies could be one off studies, in which case arguably the value is much less as compared to a approach like University of Wollongong's Library Cube where a data warehouse is setup to provide dynamic uptodate data that people can use to explore the data.

Predictive analytics/learning analytics

Studies that show impact of library services on student success are well and good, but the next step beyond it I believe is getting involved in predictive analytics or learning analytics which will help people whether it be students, lecturers or librarians use the data to improve their own performance.

I've already mentioned Nottingham Trent University's engagement scores, where students can log into the learning management system to look at how well they do compared to their peers.

The dashboard also is able to tell them things like "Historically 80% of people who scored XYZ in engagement scores get Y results".

This type of analytics I believe is going to be the most impactful of all.

Hierarchy of analytics use in libraries

I propose that the activities I list above are listed in increasing levels of capability and perhaps impact.

It goes from

Level 1 - Any analysis done is library function specific. Typically ad-hoc analytics but there might be dashboard systems created for only one specific area (e.g collection dashboard for Alma or web dashboard for Google analytics)

Level 2 - A centralised library wide dashboard is created covering most functional areas in the library

Many academic libraries are at Level 1 or 2 and a few leaders are at level 3 or even level 4.

Analytics requires deep collaboration

This way of looking at things I think misses a important element. I believe as you move up the levels, increasingly silos get broken & collaboration increases.

For instance while you can easily do analytics for specific library functions in a silos way (level 1), by building a library dashboard that covers library wide areas would break down the silos between library functions (level 2).

In fact, there are two ways to reach level 2.

Firstly, libraries can go their own way and implement a solution specific to just their library. Even better is if there is a University wide platform that the University is pushing for and the library is just one among various departments implementing dashboards.

The reason why the latter is better is if there is a University wide push for dashboards, the next stage is much easier to achieve because data is already on the University dashboard and University wide there is already familiarity with thinking about and handling of data.

Similarly at level 3, where you show value and run correlation studies and assessment studies you could do it in two ways. You could request for one off access to student data (particularly you need cooperation for many student outcome variables like GPA, though there can be public accessible data like class of degree and Honours' lists) or if there is already a University wide push towards a common dashboard platform, you could connect the data together creating a data warehouse. The later is more desirable of course.

By the time you reach level 4, it would be almost impossible for the library to go it alone.

Conclusion

Obviously I've presented a rosy picture of library analytics. But as always new emerging areas in libraries tend to be at the mercy of the hype cycle. Though conditions seem to be ripe for a focus on library analytics, it's unclear the best way to organize the library to push for it.

Should the library highlight one person who's sole responsibility is analytics? But beware of the Co-ordinator syndrome! Should it be a team? a standing committee? a taskforce? a intergroup? It's unclear.

Monday, November 7, 2016

Recently, a researcher I was talking to remarked to me that University staff can be jumpy around copyright questions and some would immediately duck for cover the moment they heard the word "copyright". I'm not that bad, but as a academic librarian my knowledge of copyright is not as good as I want it to be.

But last month, I attended a great engagement session at my library by Intellectual Property Office of Singapore (IPOS) and Ministry of Law where the speakers gave a great talk on copyright in Singapore and addressed some of these proposed changes. They managed to concisely summarize the copyright law in Singapore, the current situation (the irony of how the copyright law in Singapore pretty much copied the Australia one which itself is based on UK was not lost on the speaker) and the rationale for change.

Given that understanding basic copyright is going to be increasingly one of the fundamental skill sets needed by academic librarians, I benefited a great deal from attending.

Like in the UK law, I believe the proposed change will also disallow restriction of text data mining via contract.

Why is this proposed change important?

One of the most common issues we face today is the fact that increasingly many researchers are starting to do text data mining on content in our subscribed databases, they could be doing it in newspaper databases (e.g. Factiva) or journals (e.g. Sciencedirect) or other resources.

Many researchers I find aren't quite aware that for most part when the library signs an agreement for access, such rights exclude TDM (or do not state TDM as a allowed use).

Most databases we subscribe to also have a system to detect "mass downloads" and as such any TDM eis most likely going to be detected (though I believe some researchers may try to bypass this by scripting human-like behavior).

Businesses are never one to forgo a revenue opportunity and many databases require we pay an additional known expensive fee on top to allow TDM.

As text data mining can be more easily done via API through than scraping data, another approach is to offer a guide of the APIs that can be used. One example is MIT's libguide

http://libguides.mit.edu/apis

The proposed law would have two effects. Firstly, the status of researcher's doing data mining of the open web was always hazy. In theory if you mine say reviews on blogger say and use it for your research, I understand content owners of the blog could possibly sue you for copyright infringement. The proposed changes clarify this and allow TDM of such data (but not merely aggregation) of such data.

More interestingly for data that researchers have legitimate access to aka subscribed databases, there is no longer any distinction between reading an article and doing text data mining. And such a right cannot be excluded by contract by the vendors.

The data/position paper set out by the ministry of law/ipos here is a great read, and it points out that if such a change comes into effect, it is likely vendors who already charge for TDM will "price in" the cost of TDM because they can no longer exclude these rights.

Will the exception disadvantage libraries that don't have users that won't do TDM?

There was an interesting Q&A afterwards mostly centering around the TDM exemption.

One of the more obvious points made was, is it necessarily desirable to put in these exemptions when it will lead to vendors "pricing-in" TDM rights for database packages automatically? While the bigger Universities and institutions would probably have staff that would do TDM, the smaller institutions would be unfairly affected resulting in higher prices for no benefit. Why not allow each institution to negotiate with vendors and allow exclusion of TDM depending on each institution's need?

I am sympathetic to this view point.

But my current gut feel is that overall this will be beneficial.

Let me try out this line of argument.

Libraries tend to be in a far weaker negotiation positions than the vendors (due to the fact that a lot of vendor material is unique) and what often happens is that under current law many libraries will simply play it safe, pay only for basic read access but not TDM because it's very hard to predict who will want to TDM even for big Universities. Some librarians will even refuse out of principle to pay for TDM.

So vendors will not be sure at first how much they are losing by not charging for TDM as whatever they getting now is probably less than true demand.

The proposed changes package everything into one, and it turns the game into a game of chicken. While the vendor might want to price things as high as possible and to even recapture all the possible TDM revenue but there is a need to compromise (anchored around current prices that exclude TDM) or they will end up earning nothing.

That should put a cap on too exorbitant price increases at least initially (though in future periods they might be able to properly estimate the real TDM demand and price accordingly). I suspect the net effect is while prices will go up ,overall a lot more TDM will occur and if the intent is to encourage TDM that is a win and TDM generates sufficient benefits it will be a win.

But this is a wild guess.

I'm also wondering once the law forbids vendors from preventing TDM once libraries have paid for lawful access to the database, can they say "Okay, you can now do TDM but only via method A (probably API) and not via scraping or trying a script to do automatic download via the usual human facing interface?". This seems to suggest No.

It would be great if we could learn from the UK experience and I started asking around my usual international network of librarians but came up empty.

One librarian pointed out to me that even though the law was passed in 2014, given subscriptions cycles of 1 year or more, and research lag time, any such research probably is still in the works!

Still I ask readers of my blog, if you work in UK as a academic librarian what was your experience like? Did you find prices of databases that are most often targets of data mining start to rise even faster? Did the sales people reference the change in law as a reason? If you are a researcher in UK who has done TDM under this law, what was your experience like?

Even anecdotes would be nice. You can comment below or send me emails privately if you like and I will preserve any anonymity.

What law are the contracts signed under?

Another point that was brought up that was more damaging was that when libraries sign contracts with database vendors which jurisdiction of law will the contract be under? If the contract is to be under US law (fairly common?), the changes in the copyright act would have no sway over the breach of contract, effectively making it toothless.

I'm not a lawyer so I do not know what will happen if a library was sued for breach of contract overseas outside Singapore and awarded damages.

Other comments and questions

The Q&A was a good exchange of opinions and views between both the speakers and the audience (made up of faculty and librarians). Topics covered included open access (Gold open access is usually frowned upon by librarians in Asia which I think is quite different compared to the west), copyright for MOOs and more.

One interesting point made by the speaker was that he was a bit surprised to see while there was organization on the author /creator side with organizations like The Copyright Licensing and Administration Society of Singapore Limited (CLASS), Composers and Authors Society of Singapore (COMPASS) representing the author rights, there wasn't such a group on the user side.

He suggested perhaps the Universities in Singapore band together to negotiate collectively on some agreed core content? Is this what we call a library consortium?

Then again Singapore is a really small market, so who knows perhaps the law would make little difference and vendors might just let it go?

Saturday, October 29, 2016

Despite writing a bit more on open access and repositories in the last few years, I find the issues incredibly deep and nuanced and I am always thinking and learning about them. As this is open access week, here are 5 new thoughts that occurred to me recently.

They probably seem obvious to many open access specialists but I set them out here anyway in case they are not obvious to others.

1. There are multiple goals for institutional repositories and supporting open access by accumulating full text of published output is just one goal.
I suspect like many librarians, I first heard of institutional repositories in the context of open access. In particular, we were told to aim to support Green OA by getting copies of published output by faculty (final published version if possible, if not postprint or preprint). But in fact, looking back at the beginning of IRs and Open Access things were not so straight forward.

a) “to serve as tangible indicators of a university's quality and to demonstrate the scientific, societal, and economic relevance of its research activities, thus Increasing the institution's visibility, status, and public value” (Crow 2002)

All these goals are not mutually exclusive with the mission of supporting open access by accumulating published scholarly output but they are not necessarily complementary either.

For example, one can showcase the university output by merely depositing metadata without free full text, something that is occurring in many Institutional Repositories today that are filled with metadata of the scholarly output of their researchers with precious little full text.

Similarly, systems like Converis, or Pure or systems like Vivo that showcase institutional and reseaarch expertise do not necessarily need to support open access.

It also seems that at the time Clifford envisioned an alternative route for IRs to focus on collecting non-traditional scholarly outputs which includes grey literature instead of collecting published scholarly output. Following that vision, today most University IRs collect Electronic thesis and dissertations at the very least, others collect learning objects, Open Education resources and many are beginning to collect datasets.

2. Self archiving can differ in terms of timing , purpose and there are multiple views on how high rates of self archiving will eventually impact the scholar communication system

Even if you agree the goal of IRs is to collected deposits of published scholarly output there are still more nuances to why you are doing so and what your ultimate aims are.

At what stage is the papers deposited?

As a librarian with little disciplinary connections, I never gave much thought to subject repositories and focused more on institutional ones.

Most researchers who submit to subject repositories do so primarily with the goal of getting feedback and this also leads up to the speeding up of scientific communication. While many papers in subject repositories are deposited and immediately submitted to journals for consideration, many are put up in more raw form and are replaced by new versions many times before finally being submitted for publication and many that don't end up been submitted in any journal at all, hence making the term "preprint server" a bit leading. All this is discipline specific of course.

Contrast this with IRs, where rarely researchers put up copies of their papers in IRs until the paper is accepted for publication or more likely already published. The goal here is to provide access for the scholarly poor of published or near published scholarly output and the carrot for researchers is citation advantage of open access papers.

However as the papers in the IR are placed much later in the research cycle, they generally are already in finalised form and nothing much happens to them.

As Dorothea Salo's memorable paper Innkeeper at the Roach Motel states “[The institutional repository] is like a roach motel. Data goes in, but it doesn’t come out.” This line might also refer to point #4 below....

I am told that there really isn't any obstacle functionally for IRs to accept preprints (in the sense of papers that are going through peer review but haven't been accepted yet or haven't even yet been submitted for consideration for publication), but in actual fact this seldom occurs (though I'm sure there are examples perhaps with say CRIS systems).Two views of Green Open access

The motivation and final end game for self archiving in IRs also differ among people.

Even if one agrees IRs should only collect post prints (or the final published version if allowed) and the main aim is to provide access to published scholarly material, but what is the ultimate goal or vision here?

Some would envision , green open access working thriving alongside the traditional publishing system today and for all time. In this view, green open access is not a threat to traditional publishing, and that a status quo would result, where there is both green open access self archiving in IRs and libraries continue to subscribe to journals as usual and they point to the effect (or lack of) of high rates of self archiving for high energy physics on subscriptions in that area.

Another view doesn't see self archiving just for the sake of access, they actually aim to eventually disrupt the current scholarly system. They believe that when "universal green OA" is achieved , then we can leverage a favorable transition (in terms of costs/prices) to Gold open access (because there is an alternative to getting the final published version in the post-print version). Without achieving universal green OA, flipping to Gold OA leads to "fool's Gold" and even if open access is achieved it is of very high cost.

This is of course the Steven Harnard view. It's usually paired with the idea of a immediate deposit/optional access mandate, where all researchers will need to deposit their paper at the moment of acceptance. In response to critics that publishers will not sit back and allow Green OA to prevail if it really catches on and they will start imposing embargos, Harnard suggests countering that with a "Request a copy" button on such embargoed item.

I'm not qualified to assess the merits of these arguments but it does seem to me that these two camps are essentially in conflict, as one camp is telling publishers that are in no threat to green open access and there is no likely disruption in the future and the Harnard camp which is trumpeting loudly what they intend to do once Green OA becomes dominant.

There is a even more radical purpose to collecting papers in repositories. If you read Crow's The Case for Institutional Repositories: A SPARC Position Paper, he actually suggests a far more radical idea then just collecting post-prints that have been published by publishers and be happy with the status quo, or even the Harnard idea of flipping to Gold OA on favourable terms eventually,

The future he suggests actually involves competing with traditional publishers. In such a model, researchers would submit papers into IRs, reviewers as per usual would review them, but the key thing is that everything would be done through the repository, and universities, researchers could "take back" the scholarly publication system from traditional publishers.

3. Much of the disadvantages in local institutional repositories vs more centralised subject repositories or academic social networks like ResearchGate hinges on the lack of network effects due to poor interoperability

As I noted in a talk recently, academic social networks like ResearchGate are not new, and there were a flood of them in 2007-2009, including now defunct attempts by Elsevier and Nature.

Yet it is only in recent years it seems ResearchGate and Academia.edu seem to become dominant.

The major reason why this is happening only in the last 2 years or so, is that the field of competition as now narrowed to two major systems left standing ResearchGate and Academia.edu (if you count Mendeley that's a third) and network effects are starting to dominate.

While it is true that if you consider the "denominator" of subject repositories (all scholarly output from a specific subject) or of say ResearchGate (all scholarly output?), they aren't necessarily doing better than institutional repositories (all scholarly output of that institution), in absolute terms the material centralised repositories have dwarfs that of most individual Institutional repositories.

As more papers appear in ResearchGate or a subject repositories network effects kick in. More people will visit the site to search, if there are any social aspects and functionality (which ResearchGate has a ton of) they will start becoming even more useful, and even statistics become more useful.

How so? Put your paper in a IR like Dspace, and even if you have the most innovative developer working on it, with the most interesting statistics, you still are limited to benchmarking your papers against the pitiful number of papers (by standards of centralised repositories) in your isolated institutional repositories.

Put it on SSRN, or ResearchGate and you can compare yourself easily with tons more researchers, papers or institutions.

Above shows ranking of university departments in the field of Acccounting.

In this way, the hosted network of repositories on Bepress Digital commons actually seems the way to go compared to isolated Dspace repositories because one can actually do the same types of comparison on the Digital Common Network that aggregates all the data across various repositories using Digital Commons.

So my institution is currently on Bepress Digital commons and faculty put their papers on it.

So in the above example, I can see how well Faculty from the School of Accountancy here are doing versus various peers in the same field who also put their papers on their IR. Happily I can report, the dean of the accountancy school here is one of September's most popular authors in terms of downloads.

4. interoperability among repositories is the only way to make network effects matter less

My merger understanding of OAI-PMH was that it was indeed designed to ensure all repositories could work together . The ideas was that individual repositories could host papers but others could build services that sat on top of them all and harvest and aggregate all the output into one service.

I know it's fashionable to bash OAI-PMH these days and I would not like to jump on the band wagon.

Still it strikes me that a protocol that works only on metadata was on hindsight a mistake. Perhaps it was understandable to assume that all records in IRs would have full text as the model back then was arxiv which was full text. But as mentioned above, there were in fact multiple goals and objectives for IRs, and many became filled with metadata only records due to this.

This made it really painful for aggregators to work when they tried to pull all the records together from various IRs using OAI-PMH as they couldn't tell for sure whether there was full text or not. This is the main reason why systems like BASE can't 100% tell for sure a record they harvested has full text (I understand there can be rough algorithmic methods to try to guess if there is full text attached), and it's also the same reason why many libraries running web scale discovery service can't tell if a record they have from their own IR has full text or not. (Also they don't turn on in their discovery index other IRs that are available in the index for the same reason).

In truth making repositories work together involves a host of issues, from having standardized metadata (including subject, content type etc) so aggregators like BASE or CORE and offer better searching, browsing and slicing features, ensuring that full text can easily "flow" from one repository to another or ensuring usage statistics are standardized (or can be combined?).

In fact, there are protocols like OAI-ORE and SWORD (Simple Web-service Offering Repository Deposit) that try to solve some of these problems. For example SWORD allow one to deposit to multiple repositories at the same time etc and do a repository to repository deposit, but I am unsure how well supported they are in practice.

If individual repositories are to thrive, these issues need to be solved, allowing easy flow and aggregation of metadata, full text and perhaps usage statistics, allowing them to counter the network and size effects of big centralised repositories.

5. There seems to be a move towards integration among the full research cycle and or into author workflows.
The pitch we have always made is this to researchers, give us your full text, we will put it online and you will gain the benefits (e.g more visibility, the satisfaction of knowing you are helping science progress, or that you are pushing back against commercial publishers), but sadly that doesn't seem to be enough for most to motivate them.

So what can we do?

Integration with University Research management systems from and to repositories

Firstly, we can tell them we are going to reuse all the data they are already giving us. Among other things, we can use their data to populate cv/resume systems like Vivo. Since all the data is already there we can use it for performance assessment at the individual, department and university levels by combining the data with citation metrics.

We can make it easier on the other end too. Instead of getting researchers to enter metadata manually, we can pull them into our systems using Scopus, Web of Science, ORCID or other systems that allow us to pull in researchers by institution.

What I describe above is indeed the idea of a class of software currently known as CRIS (Current research information systems) or RIMS (Research Information Management system). It is basically a faculty/research management workflow that can track the whole life cycle of research system, often including things typically done by other systems such as grants management and integrates with other institution systems like HR or Finance systems.

The three main systems out there are Pure, Converis and Symplectic elements. The point to notice is that these systems are not mainly about supporting open access, but it can be one of their functions.

For example while Converis's publication module accepts publication full text, this full text isn't necessarily available online publicly if you do not get the Research portal module (this isn't mandatory). In the case of Symplectic, I understand it doesn't even have a public facing component but there are integrations with IRs like Dspace available.

But we can have more integrations than this.

Integration with Publisher systems to repositories

How about considering a integration between a publisher and a IR system? Sounds impossible?

The University of Florida has a tie up with Elsevier where using the Sciencedirect API, metadata from Sciencedirect will automatically populate their IR with articles from their institution. Unfortunately the links on the IR will point to articles on the Sciencedirect plaform. While a few will be open access , most will not be so.

If you have ever tried to get a researcher to find the right version of the paper for depositing into IRs, you know how much of a game changer this will be.

Logically it makes so much sense, the publishers have the postprints already in their publication/manuscript submission systems, so why not give it to IRs? Well the obvious reason is we don't believe publishers would want to do that as it's not in their best interest? Yet ...........

Integration with Publisher systems from repositories

Besides an integration from post-print to IR, the logical counterpart to that would be an integration from pre-print to publisher submission systems and where pre-prints are sitting is often in Subject repositories.

Indeed this is happening as PLOS as announced a link with their submission system and Bioarxiv.

In the same vein, the earlier mentioned overlay journals, can be said to be having the same idea.

Integration with reference managers?

What other types of integration could occur? Another obvious one would be from Reference managers.

Elsevier happens to own Mendeley, so a obvious route would be people collaborating via Mendeley groups and with a click push it to journal submission system.

Proquest which now owns a pretty big part of the author work flow including various discovery services, reference managers like Flow and refworks could do something similar, for example I remember some vague talk about interacting say Flow their reference manager with depositing thesis into ETD say.

Will a day come where I can push my paper from my reference manager or preprint server to the journal submission system and when accepted the post-print seamlessly goes into my IR of choice and in my IR the data further furthers into other systems for populating my cv profile and/or expert system?

I doubt it, but I can dream.

A 100% friction-less world?

Conclusion

This post has been a whirlwind of different ideas and thoughts, reflecting my still evolving thoughts on open access and repositories. I welcome any discussions, corrections of any misconceptions or errors in my post.

The one great strength of Institutional Repositories

On the GOAL mailing list, it was pointed out that the distributed nature of institutional repositories which are owned by individual universities was a great defense against monopolistic take-overs, as no single commercial entity could buy up all institutional repositories in the world. No one could do with IRs what Elsevier did with purchasing SSRN, hence taking a big slice of the OA market market (in certain disciplines) in one blow.

A response to that by a certain Eric F. Van de Velde caught my eye. He basically outlined why he thought institutional repositories would fail and why subject repositories or even commercial based sites like ResearchGate were winning out.

It resonated with me because I was coming to the same conclusion.

Last month, I found he expanded his short reply into a post provokingly entitled "Let IR RIP " .

How provocative? It begins "The Institutional Repository (IR) is obsolete. Its flawed foundation cannot be repaired. The IR must be phased out and replaced with viable alternatives."

Eric as he explains was a early believer and advocate in the future of institutional repositories (going way back to 1999). This is someone who has managed and knows IRs and was hoping that they could eventually disrupt the scholarly communication system. Such a person now thinks IRs are a "dead end".

I don't have even a tenth of his experience in this field, but as a humble librarian working on the ground, I must concur with his points.

It seems to me, no matter how we librarians try, most researchers don't seem to have half the enthusiasm (assuming they had in the first place) they have with depositing full text in institutional repositories as compared to subject repositories or even social networking sites like ResearchGate.

Why is this so? You should really read his post , but here's my rambling take from a librarian point of view.

1. Institutional affiliations will change and control is lost when it happens.

Many faculty will move at least once in their career (twice if you include their time as a Phd) as such this doesn't incentivize them much to learn how to use or manage their own local IR systems.

Compare this to someone who invests in setting up his profile and/or deposits in ResearchGate or SSRN. This is something they will own and control throughout their career no matter where they go.

ORCID helps solves part of this problem, but even in a ideal world where you update in ORCID and it pushes to various profiles, the full text has to exist somewhere.

And if you upload it to a IR, the moment you leave, you lose control of everything there. And some progressive IRs include public statistics like downloads and views of your papers which is all well and good (especially if you are smart enough to create metadata records in multiple venues but link back to your IR copy) until you leave the institution and you can't bring them over to aggregate with your future papers.

Why would someone devote so much time on something they may not fully own? Compare this to someone setting up SSRN/ResearchGate profile, where all the work you do, all the statistics you accumulate in terms of downloads etc will forever be with you centralized in one place.

SSRN Statistics

Incidentally that's also why I suspect implementing the "copy request button" idea on institutional repositories tends to not work so well.

STORRE: Stirling Online Research Repository

For those of you who are unaware, the idea here is that you can legally? circumvent embargo by adding the "copy request button". Just list the record (with no full text) on repositories and the visitor to the metadata only record can click on a "Copy request" button to instantly request a copy from the author. You as the author get the email, you can either reply with the file or in some systems simply give approval and the file will be released automatically to the individual.

This idea works very well in theory but in practice when you leave a institution it is likely the IR will continue to list your old invalid email!

Since I started my profile in ResearchGate, I've gotten requests for thesis and papers written when I was a undergraduate and later as a library masters student.

I would not have seen these requests if I relied on my old institution's IR "Copy request" buttons!

2. Lack of consistency across IRs

Though most University IRs are using a relatively small set of common software such as Digital Commons, Dspace, Eprints they can differ quite greatly depending on the customization and feature set, and this can be very off putting to the researcher.

It's not just surface usability and features, but also because there are no standards for metadata, content etc, it's becomes as Eric says "a mishmash of formats" when you try to search across them using aggregator systems like CORE, BASE etc. Each IR will have it's own system of classifying research, subjects, fields used etc. This is also something familiar to those of us who have tried to include IR contents into discovery services and find to our dismay we often have to turn them off.

A researcher who wants to use the IR when he switches institutions will have to struggle with all this and why would he when he could use something more familiar that he has been using since his grad school days....

3. Subject/Discipline affiliations are stable while institution affiliations are not.

Subject Repositories have the advantage of greater familiarity to scholars and can have systems custom built for each researcher's community.

4. IRs generally lag behind in terms of features and sophistication

Not every institution is a rich top Tier 1 University that is capable of investing time and money to provide a useful and usable IR that can compete with the best in the commercial world.

For example, there's a belief (which I think might be justified but I have no evidence) floating around that it's better to put your outputs in ResearchGate, Academia.edu than in IRs because the former two have greater visibility in Google.

I'm no expert but I find systems like ResearchGate and Academia.edu are just more usable. I've deposited to Dspace , Digital Commons systems before and they take me easily 30 minutes to get through it, and I'm a librarian!

ResearchGate and company are also more aggressive in encouraging deposits, for example if I list a metadata only record, it will often check Sherpa Romeo automatically for me and encourage me to deposit when it's allowed.

Maybe there are Dspace, Eprint etc systems out there with such features but the few ones I have used don't seem to do that. (CRIS systems do that I believe?)

While many find ResearchGate and Academia.edu annoying and intrusive, I think you can see they try to work on human psychology to encourage desired behaviors to deposit through gamification techniques or just evoking old fashioned human curiosity.

For example, Researchgate can tell you who viewed your record, who downloaded and read your paper (if they were signed on while doing so) and you can even respond to such information by asking the identified readers for a review!

Not everyone thinks such features are a positive (privacy!) but the point here is that they are innovating much quicker and IRs, at least the average IRs are lagging. Often I feel it is akin to library vendors talking about bringing "Social features" into catalogues in 2012 and expecting us librarians to cheer.

Others such as Dorothea Salo in Innkeeper at the Roach Motel have long pointed out the many shortcomings of IR software like Dspace. Under the section "Institutional repository software", she lists a depressing inventory of problems with IRs.

These include poor UX, lack of tracking statistics, siloed repositories which lack inter-operation-ability and the lack of batch uploading and download tools, the inability to support document versioning (something subject repositories do decently well), means faculty won't use IRs not even for the final version.

Add outdated protocols like OAI-PMH (which Google Scholar ignores) and the realities of how most IRs are a mix of full-text and metadata, rather than 100% full text as envisioned, IRs have had a uphill task.

Most of the above was written back in 2007, I'm unsure if much has changed since then.

5. IRs lack mass

When was the last time you went specifically to the IR homepage to do something besides deposit a paper?

How about the last time you decided to go to your IR homepage to search for a topic?

IRs just simply don't have enough central mass (one institution's output is insignificant even if it was all full-text) to be worth visiting to browse and search compared to say a typical Subject repository.

As such, the most common way for a user to end up on a IR page or more likely just a pdf download is via Google Scholar.

Is this a problem? In a way it is because the lack of reasons for authors to visit the IRs means that any possible social networking effect is not present and as the saying goes out of sight, out of mind.

Conclusion

I would like to say here that I fully respect efforts and achievements of my colleagues & librarians around the world who directly manage the IR. It's can't be an easy task particularly since many can be labouring under what The Loon calls the coordinator syndome (though hopefully this problem has diminished over the years given that scholarly communication jobs are better understood, see also the tongue in cheek "How to Scuttle a Scholarly CommunicationInitiative".)

The point here is while some IRs have achieved some success eg MIT hitting 44% total output deposited (and consider that MIT is a early pioneer and leader of the open access movement), many have failed to attract all but the most minimal amount of deposits.

Perhaps this is purely anecdotal, but my impression is while you can find researchers who put their papers on Subject repositories/Social networking researcher sites AND institution repositories (aka researchers who just crave visibility and are willing to juggle multiple profiles and sites) or those who just put in the former only, it's rare to find those who only put things in the IR and nowhere else.

Various studies (e.g this and this ) are starting to show more and more free text are reside in sites like say ResearchGate than institutional repositories.

This doesn't augur well.

I'm not saying though it's not possible to coerce researchers to deposit into IRs.

For example it seems an Immediate-Deposit/Optional-Access model like that done by the University of Liège seems to achieve much success by making researchers deposit all their papers on publication whether it can be released open access or not immediately or at all. This coupled with a understanding that papers not submitted into the IR will not be considered for performance purposes seems sufficient to cause high rates of compliance.

However doing so is going against the wishes of the researchers who seem to naturally not favor open access via IRs and it seems to me would rather do it via SR, researchgate or even through gold OA (if money is available).

A lot of problems I suggested for IR can have solutions, for instance more standardization of IRs would be one. More resources poured into doing UX to understand needs and motivations of researchers is another. Librarians can either push or pull full text to/from subject repositories on behalf of authors (via SWORD), work out a way to aggregate statistics across repositories perhaps. I've read COUNTER is working on this to standardise downloads, but I wonder if one could have ORCID like system that aggregates such COUNTER statistics of all papers registered to you?

But one wonders , perhaps this is a space librarians should cede if other methods work better.

With the rise of solutions like SocArXiv, bioRxiv and engrXiv , perhaps institutions should start running or sharing responsibility for aggregation of output at higher levels such as via subject repositories or even national repositories?

Of course, we all agree "solutions" like researchgate and academia.edu are not solutions at all because they are owned by commercial entities and might disappear at any moment.

But is it possible to have both the advantage of scale and centralization and yet be immune if not resistant to take-overs by commercial entities? Can subject repositories be the solution?

In any case let me end off with Eric's words.

"The IR is not equivalent with Green Open Access. The IR is only one possible implementation of Green OA. With the IR at a dead end, Green OA must pivot towards alternatives that have viable paths forward: personal repositories, disciplinary repositories, social networks, and innovative combinations of all three."

What do you think? Are institutional repositories a dead end? Or are they needed as part of the eco system alongside subject repositories? I am frankly unsure myself.

Additional note : As I write this, there is some discussion about the idea of retiring IRs for CRIS . The idea seems to be that instead of running two systems that barely talk to one another, one should opt for a all in one system. There is grave suspicion by some against such a move because of the entities who own the software. How this factors into my arguments above I am still mulling over. On a personal note, I will be taking a month off my usual blogging schedule and will resume in Oct 2016.

Friday, July 8, 2016

2016 seems to the year Sci-hub has broken out into popular consciousness. The service that provides access to academic papers for free , often dubbed "The Napster" of academic papers by media is having it's moment in the sun.

To me though the most interesting bit was finding out how much usage of Sci-Hub seems to by people (either researchers or academics) who have access to academic library services.

In Science's "Who's downloading pirated papers? Everyone", John Bohannon in the section "Need or convenience?" suggests "Many U.S. Sci-Hub users seem to congregate near universities that have good journal access."

The % of usage from each country within University IP ranges varies but it is surprisingly high for some countries like Australia (where just below 20% of Sci-Hub usage comes from University IP ranges).

We can't tell if users with access to academic libraries are using Sci-hub because their library doesn't provide immediate access and they are too lazy to wait for document delivery or worse they just find it easier to use Sci-hub than fiddle with library access!

(As an aside, this is why it is truly a bone-headed move by publishers to suggest Universities introduce more barriers like two factor authentication to access articles. That's going to drive even more people away!)

But I'll bet one reason most users don't use library subscriptions to access articles is because our systems generally don't make it easy to access articles if users don't start searching via library systems (discovery services, databases etc). Roger C. Schonfeld's "Meeting Researchers Where They StartStreamlining Access to Scholarly Resources" is a recent great exploration of these issues, and is unusual because it comes from a publisher and hence useful to explain to other publishers since it comes from one of their own. (Most librarians working in this space are aware of these issues).

This method is lightweight, works on most browsers including many mobile ones (though the initial setup can be tricky) and you can do some more fancy tricks to track usage but essentially this idea has been around for years. (As a sidenote, the earliest mention I can find of this idea is in 2005 by Tony Hirst of Open University UK)

A quick search in Google or Youtube will find hundreds of academic libraries that mention or offer a variation of this idea though I highly suspect for many it's a experiment someone setup and quickly forgot without popularizing much (with some exceptions).

2. UU Easy Access Chrome extension - A improved proxy bookmarklet in the form of a chrome extension

Firstly, users did not understand why the proxy bookmarklet would occasionally fail. Part of it was that they would proxy pages that made no logical sense (for example trying it on Scribd, Institution repositories, Free abstracting and indexing sites) because they were taught "whenever you asked to pay for something click the button". They loved it when it works but were bewildered when it didn't.

Failure could also occur for certain resources where the subdomain or even domain were slightly different depending on the country or institution (e.g Lexis Nexis sites) you were from.

Secondly, occasionally the library would have access to full text of a item via another source but they would land on another site where proxying that site would lead to an error.

A very common scenario would be someone landing on a publisher site via Google, but the library has access via a aggregator like Proquest or EBSCO. Users would happily click on proxy bookmarklet, fail and give up thinking the library didn't have access.

While some institutions might see less of such failures (e.g Bigger institutions that have "everything" and subscribe mostly through publishers rather than aggregators tend to work more), in general failures can lead to a lot of confusion and users might lose confidence in the tool after failing many times and not knowing why.

The next idea done by Utrecht University avoids the first issue and provides what I considers the next step in the evolution of the proxy bookmarklet idea.

Their solution is UU Easy Access - a chrome extension , currently in beta.

The chrome avoids the first problem described above where users are confused on when they can add the proxy by natively including a list of domains that can be proxied in the extension and when you land on such pages it will recognise the page and invite you to proxy the page.

You can also try to click on the extension button to proxy any site but it will check against the list of domains allowed and will display a informative message if it's a site that isn't allowed to be proxied.

This is much better than a system that makes you login and then issue a typically cryptic message like "You are trying to access a resource that the Library Proxy Service has not been configured to work."

I've found users sometimes interpret this message as saying the library just needs to configure things and they will then be able to access the item they want.

Still, installing a proxy bookmarklet is also somewhat clunky compared to installing an extension and less savvy users might not be able to follow the instructions on their own.

Currently UU Easy Access only has a Chrome extension and does not yet support Firefox.

3. LibX - A browser plugin to aid library access

Both methods #1 and #2 above are unable to deal with the fact that a user may have access to full text via another source other than the site they are on. In such a case, adding the proxy will still fail.

Libx a project licensed under the Mozilla Public License can occasionally work around the issue.

Some of the nice features it has include

Function to proxy any page you are on (same as the bookmarklet)

Support autolinking for supported identifiers such as ISBNs, ISSNs, DOIs,

autocues that show availability of items on book vendors sites like Amazon, Barnes and Nobles and

So for example if a page has embedded a indentifer like DOI or PMID, it will be hyperlinked such that when you click on it, you will be sent to your library's link resolver and redirected to the appropriate copy that you can access where-ever it is.

Most would agree that Google Scholar is probably one of the easiest way to find free full text, just pop the article title into Google Scholar and see if there is any pdf or html link at the side of the results. With their huge index due to permissions from many vendors to crawl full-text and unbeatable web crawling matched with the ability to recognise "Scholarly work", they are capable of finding free articles whereever they lurk on the web and are not restricted to simply find free pdfs on Scholarly sites or institutional repositories.

Add the ability to see if your institution has access to a subscribed version via the presence of a link resolver link (as most academic libraries support Google's Library Link Program), Google Scholar is the ultimate full text finder.

Never used Google Scholar before? Below shows a example of a result

Highlighted in yellow is the free full text, "Find it@SMU Library" - provides full text via the library link resolver

But what happens if you don't start from Google Scholar and land on a page that is asking you to pay and you are too lazy to open another tab and search for the article in Google Scholar? Use the Google Scholar button released by Google last year instead.

On any page, you can click on the Google Scholar button extension and it will attempt to figure out the article title you are looking for, run the search in Google Scholar in the background and display

a) the free full text (if any)
b) the link resolver link (if your library has a copy of the article)

If the title detection isn't working or if you want to check for other articles say in the reference, you can highlight the title and click on the button.

A secondary function is the ability to create citations similar to the "cite" function in Google Scholar.

1. The citation option supports over 900 styles compared to just a handful in Google Scholar button

2. Ability to block non-scholarly sites for a period (for self control)

3. More sharing options to not just reference managers but also to Facebook etc

4. Many more I probably missed out.

Here's how it looks like

I'm really impressed by the variety of functions, the main criticism I can make is that it might be overkill for many users with a very complicated interface.

For example in the above example, under the Full text check, you see 8 options!

The official site says "The green icons are non-PDF full texts that Lazy Scholar is highly confident are 100% free, whereas the yellow icon means that Lazy Scholar is moderately confident that it is a free full text".

The EZ icon next to it allows you to add the proxy string to the URL (like the bookmarklet) and the icon with books is the link resolver link scraped from Google Scholar.

Off hand, I would say it would be cleaner just to offer say the top 3 options (including the link resolver option) and hide the rest under a dropbox menu.

Still it's crazy impressive for a personal project by someone who has no ties to any libraries. The variety of sources/api he pulls from/ use is seriously amazing.

Many are well known such as Altmetrics.com, Google scholar but some are lesser known systems like comments and annotation systems like Pubpeer, Hypothes.is etc or even dare I say pretty obscure like DOAI (Digital Open Access Identifier) that tries to resolve you to find a free version of a paper.

Conclusion

Can we ever make our systems to access articles truly 100% seamless and frictionless? Even within-campus or with VPN (off campus), users can still find it tough to determine if we have access to full text via alternative venues.

Anyone know of other useful tricks or tools that can help?

Perhaps this is one of the other attractions of open access, in a world where open access is dominant, we need not waste time and effort creating these workarounds to make access friendly.