more on aggregating article metadata

In a post on the Pegasus Librarian blog about the EBSCO land grabs, a very interesting discussion ensued on, among other things, what libraries could/should be doing to serve our users while maintaining control of what we’re doing and it’s costs, instead of entrusting our fate to monopolizing ‘big deal’ vendors who lock us in with no escape route.

One person brainstormed the idea of cooperative catalaloging (ala that facilitated by putative and actual cooperatives like OCLC and it’s previous more multiple predecessors). Here’s my contribution to the discussion:

Good conversation. Here’s a hypothesis:

We probably don’t need to create a cooperative metadata creation initiative for article-level metadata, because that metadata (of varying quality, but my hypothesis is “good enough”) is ALREADY out there in the digital world. It’s already been created, pretty much every publisher these days has electronic metadata for their articles published. We just need to _collect_ it. And in many cases, we don’t even need a special business relationship or license to collect it, as the metadata is already being shared open access — which doens’t mean that collecting and aggregating it in a useful way is cheap or easy. It is a non-trivial project that could benefit from some cooperative economies-of-scale action, but it’s not a ‘cataloging’ or metadata _generation_ project exactly.

Consider the JournalTOCs service. Many many publishers these days provide RSS feeds with metadata of their recent publications. By consuming these feeds, and storing what you get over time, JournalTOCs is building a giant database of article metadata — that only goes back as far as when they started collecting it, but that’s still pretty good. My impression is that JournalTOCs is looking for a way to monetize this at a profit however, rather than provide it in a cooperative cost-sharing basis.

Or consider OAISter, as I think Dorothea mentioned. Also a giant collection of article-level metadata, although also not neccesarily going back very far historically. (Historical articles are more valuable in some fields than others; however as with JournalTOCs, as the years march on the Year Zero at which this kind of metadata collection starts recedes further into the past). However, also something that, while it originated as a community-benefit project at a university, has been transfered to OCLC, a vendor that many of us think generally _acts_ much like other vendors looking to monopolize and monetize for greatest profit, despite their stated mission/organizational structure to cooperatively share on a cost-benefit. (Hint: An entity which acted like it’s primary mission was cooperatively sharing at a cost-recovery basis would be EAGER to share their metadata with all comers, if it resulted in overall reduced costs to the members in their businesses, even if some of those costs were to shift to other entities. Can you imagine OCLC doing such a thing on purpose?)

There are definite possibilities to building stores of article-level metadata in a freely shared store that is not hobbled by vendor lock-in, but instead shared as diversely and widely as possible to facilitate library technology experimentation and innovation. Definite possibilities, but still not cheap or easy — it will require investment in both R&D and actual implementation. Exactly the sort of thing that could benefit from a cooperative project funded on a cost-recovery model, to achieve ecnomies of scale, but still under our own control and direction rather than being locked in to a monopolizing vendor.

But libraries do not seem organizationally competent to or capable of coordinating their efforts in such a fashion anymore on large scale projects. In fact, we’re sending our projects in the other direction, with OAISter becomign a monetized for-profit ‘product’ instead.

One notable exception is HathiTrust, which seems to be having some success at a cooperative cost-recovery model. Ironically, HathiTrust comes from the same institution that gave up OAISter — although it makes sense that any given insitution can only spearhead one thing, I wish they had worked harder to find better hands to entrust it to. (If HathiTrust itself was a service offered by OCLC, I think we all can be confident it would cost, oh, four or five timeas much, as memberhship in HT actually does). And HathiTrust of course has it’s own handicaps in being reliant on so much data from Google, a vendor trying to enforce it’s own kind of vendor-lock-in it’s agreements to use it’s scans; the data isn’t really unencumbered.

At the end of the day, somehow, the technical effort needs to be funded.

You are quite right when you say “But libraries do not seem organizationally competent to or capable of coordinating their efforts in such a fashion anymore on large scale projects” – I had experience of exactly that with Intute.

We (JournalTOCs) have also had contact with Mendeley. Originally, CrossRef were going to get more involved. This hasn’t so far worked out. ResearchGate and Academia.edu also have similar initiatives. There’s also
the very interesting Knowledge for All Universal Citation Index: A Proposal for the Global Library Community http://library.upei.ca/k4all which we’ve had contact with.

Mendeley could kick in at least 35M records now, and possibly more to come. It’s also worth noting that Mendeley is working on a process whereby an author can use Mendeley to facilitate one-click deposition of material into the author’s local IR.

Data in the Mendeley catalog is licensed CC-BY (http://dev.mendeley.com/docs/license), so while there is a string, we’d hope that simply mentioning that you got some data from us isn’t too much of a burden.

That sounds pretty excellent, thanks Mr. Gunn, very good to know. If only I had the time/resources to do something with it! But I might eventually, or else I hope somebody else does.

Not about Mendeley in particular, but I’ve been unclear on CC-BY applied to data like this in general. Let’s say I get some data from someone licensed CC-BY. I put it in a database and mix and match and shuffle it around with data from other sources, such that I no longer can even tell what came from where. Is that okay? Now, if I were to redistribute my mashed up database, I’d just need to credit Mendeley (and possibly a bunch of other CC-BY-licensed providers mashed in there), when distributing the database — and also tell people that the mashed aggregate database is licensed CC-BY and that complete list of CC-BY contributors needs to be kept intact? What if I’m not redistributing the database, but just providing an application backed by the database — my application needs to provide credit to that complete list too? On the footer of every single page backed by that data? Just the home page?

I should make a blog post just about this issue, I’ve been confused about the actual practical effects of CC-BY on data in a database in general. But if you want to share Mendeley’s interpretation of what they’d expect here, I would be curious. I’m not really sure what data providers of CC-BY licenses _expect_ from these situations. I hope they don’t expect that for every individual element of data, I keep track of exactly who I got it from, making it a lot harder/more expensive to have a database that combines data from multiple sources. A lot harder if I need to be constantly keeping track of where each element came from as I merge and mash.

Yes, that’s precisely why the Creative Commons is now recommending CC0 as the preferred license to maximize reuse.

However, I might suggest from a practical point of view that an entry in the page where you display results or in the about page of your application, a statement that data was obtained from Mendeley would be fairly lightweight and doable.

Actually, I heard a rumor that CC is about to change their position and find ways to support attribution and other restrictions. If so, I think that’s ill-conceived and doomed to failure.

But good to hear what Mendeley would expect… or what you think Mendeley would expect, don’t know if that’s an official answer. I guess if I was then re-distributing my mashed together database…. well, honestly what I’d say is “Some portions are claimed as copyrighted and licensed CC-BY by Mendeley [and X, Y, and Z]. I am not quite sure what that means for you. I make no claims to this data, and it’s fine with ME if you do whatever you want with it.”

If that rumor is true, you have better sources than I. I spoke to someone who’s in a position to know just a few days ago and was told that CC0 is the way they’re going.

As far as what you can say, all you need to say is “Some data provided by Mendeley”, and provide a link to us. Certainly don’t say that the data is copyrighted by Mendeley, as that’s not true and I wouldn’t want anyone to think that.

Ah, right, the question is if I’m _redistributing_ the data, if someone else is downloading my data (portions of which hypothetically came from Mendeley) — what do I need to tell them about their license?

The only way you can license something under CC-BY, as I understand it, is if you own copyright on it (or have been assigned rights to license by the copyright holder). Thus the mention of a claim of copyright.

but that’s my confusion about the requirements of CC-BY with regard to redistribution. If I have, say, an ordinary textual article in English that I got from someone else under CC-BY, then I know what to do — I can tell others, sure, you can use this article, it’s licensed by the original author under CC-BY, so you can use it under those terms. If I have some data that was licensed to me under CC-BY, and I combine it with a bunch of other data and mix and match in my database — can I tell others they are allowed to use my database? Under what restrictions? I don’t really understand how it would work, and don’t want to try to interpret either the laws of it or the intent of the original claimed licensor (Mendeley in this hypothetical example), so I’d probably just throw up my hands and say “It’s fine with ME if you use this data, but some of it came from Mendeley, who told me it was licensed under CC-BY. I can’t tell anymore which parts came from Mendeley and which didn’t, and I can’t really tell you what you’re allowed to do with the parts that did come from Mendeley, so maybe just talk to them. And to the other [hypotethical] 3 or 4 sources that other parts of the data came from.”

In reality, this would probably discourage anyone from using my database, being unsure of what the legality of using it is.

Yes, that’s really the problem of CC-BY and data, as opposed to articles. It really is a grey area, but it would be a tragedy if someone said they couldn’t use your resource because they couldn’t or wouldn’t want to put a “Some data provided by Mendeley” link somewhere. The whole point of us making this collection is to spread the data as far and wide as possible and to make it as useful as possible, not to exert control. Mendeley just wants to be given some credit, like any author of a paper would be given.

The license used is under consideration, but I haven’t heard of any actual cases similar to this where the CC-BY license (which is acceptable under the Panton Principles and has an escape clause for non -copyrightable facts) has caused someone real problems, in practice. If you have any case studies of this, please send them along.

I’ll talk to the others at Mendeley, make them aware of your project, and see what they think.

This is all just hypothetical, I don’t actually have a project yet! I dont’ mean to be picking on Mendeley, it’s AWESOME you guys are sharing your data, seriously, thanks. It just provides a case to talk about the general issue.

An escape clause for non-copyrightable facts? So… if someone thinks all the citation data from Mendeley is a non-copyrightable fact… you’d expect them to justifiably not be bound by the BY license?

The Panton Principles actually say that “Creative Commons licenses (apart from CCZero)… are NOT appropriate for data and their use is STRONGLY discouraged.” They also say “it is STRONGLY recommended that data, especially where publicly funded, be explicitly placed in the public domain.” The principles don’t mention attribution-required licenses at all, either pro or con. So, yeah, I guess it’s “acceptable”, since the principles are mainly recommendations, not hard and fast rules to judge something on. But the CC license is disouraged, and public domain is encouraged.

I’ve got absolutely no problem crediting Mendeley in an ‘about’ page or even a page footer, in any (just hypothetical at this point!) app I created using Mendeley data. I’d do it just because Mendeley asked me to, it wouldn’t take a legal obligation. My problem is just that the legal obligations in a license like this make things really confusing for further multi-generation downstream use — re-uses of re-uses of uses, where every stage mixes together data from multiple sources as well as provides original enhancement. That’s the real awesome promise of open data, the multi-generational remixing to create awesome stuff, rather than just taking someone’s database and keeping it as an integral unit. But to really get that going, people need to be SURE they’ve got the legal rights to do it, and attribution (or even more so non-commercial, certainly) licenses make things really confusing, when applied to re-mixed data.

“So… if someone thinks all the citation data from Mendeley is a non-copyrightable fact… you’d expect them to justifiably not be bound by the BY license?”

Yes, that would be the implication, and I’ve heard there may be some policy developments on this front, actually.

“That’s the real awesome promise of open data, the multi-generational remixing to create awesome stuff, rather than just taking someone’s database and keeping it as an integral unit.”

I totally agree. That’s the whole point and purpose of the API – you can get the stuff piecemeal. Anyways, the license is basically just to keep away the jerks, not honest folks like yourself, so if you ever move this beyond a hypothetical, I hope it doesn’t present a barrier. If you find that it does, in practice, then consider it up for reconsideration.

Ah, yes. I should have specified in the beginning that the statement I heard was also referring to scientific data, like gene sequences and such. Still a bit murky exactly what that means, but as long as you’re not being a jerk, probably easier to ask for forgiveness than permission.