An obvious place to start looking for duplicates is a personal bibliography, but I’m not a regular user of Mendeley. However, I do have a collection of stuff on citeulike. Thankfully users of citeulike can synchronise data with their mendeley accounts, to save re-entering publications (again). So I pulled my data from citeulike, entered my citeulike username into the importer and Bingo! I’m a Mendeley user – nice and easy. Looking closely at some of the papers, its easy to spot quite a few duplicates that are not unique, and this problem hinges on the thorny issue of identity. Any given paper be can be identified in different ways and these need to be resolved. For example, all the identifiers below use different ways to identify the same paper:

Part of what all reference management software does is resolve multiple identities to the same thing. Database people call this normalisation – and it can be tricky to do with web data. In citeulike for example, all of these different id’s are ultimately recognised and normalised to the same unique thing: citeulike.org/article/3467077 which has been saved by 297 users. Citeulike, which currently stands at just over 4 million articles is doing a reasonable job of detecting and merging duplicates although it’s not perfect – it’s a hard problem. How does Mendeley compare on the same problem?

This particular paper may be an extreme case, but this kind of redundant duplication is certainly not uncommon. A quick search for some other papers reveals that many have at least one duplicate in Mendeley – which means there is room for improvement. But popular papers like Error bars in experimental biology (saved by 894 users), don’t seem to have any duplicates at all – maybe that’s why they appear to be so popular?

So how many unique papers are there in Mendeley? It depends on how many duplicates there are, and that’s quite difficult to calculate accurately. Some papers have zero duplicates, others have as many as seven. So Mendeley might have as little as ~20 million unique documents or it have as many as ~30 million, who knows? But it’s probably not as much as 36 million. Well, at least not just yet anyway…

Like this:

Related

Mendeley have been called out on this problem numerous times over the past several years and always dodge it. One must wonder about their other figures as well. Company claim they have 500,000 users, but the API seems to be revealing some problems with those numbers. http://pipes.yahoo.com/pipes/pipe.info?_id=6a5ba92b83b777964009f8eaf341ad9f only shows a handful of people from institutions like Harvard (19), Princeton (21), and Yale (2). Something doesn’t smell right. Reminds one of Soviet industrial and agricultural output figures.

Your recent blog post “How many journal articles have been published (ever)?” also makes me uneasy about M’s claimed article figures. Do they (or we) really think they have even 10% of the cumulative scholarly output, let alone 70%, which is what 35M of 50M would of course be?

Another way to look at this is with (my possibly faulty) math, and even using M’s figures it doesn’t add up. 36,109,930 papers / 487,953 users = 74. That means, on average, every single M user has 74 entirely unique papers, held by no one else. It would be surprising if each user had on average 74 papers at all in their libraries, but 74 unique ones seems highly implausible.

Every database has to deal with issues of duplicate content, but as far as duplicate papers go, we’re currently collapsing the duplicates into canonical papers and have this issue mostly solved, as you’ll see over the next few weeks. The stats we currently deliver on the web page are as complete as we can technically make them right now, and shouldn’t be that far off from the “true” number. Because the documents continue to come in at a exponential rate, time will substantiate the picture we’re painting. Likewise, when I search for Harvard at Mendeley, I get 90 results. Some of these will be mentions of Harvard in other places on their profile, but it’s quite easy to see that it’s higher than the 19 returned by the pipe, and remember that most users don’t yet list their institution on their profile, so it’s some multiple of that number.

The issues raised by BWG are important ones, but they’ve been asked and answered before (by the same people, even!) For those who may not be familiar with the conversation, just google “BWG Mendeley.

Duncan, if anything is unsatisfactory to you about the answers I’ve given, please let me know. As far as I know, there’s no hard data on how many unique papers a researcher has in their reference manager (or filing cabinet), but it doesn’t seem unreasonable to me that it would be 74 or 100. We should do a better job making usage stats available, but I’d especially caution against trying to extrapolate in too global a fashion from the small but rapidly growing sample of users that Mendeley represents.

Every database has some level of duplicates. Facebook, for example, has inconsistencies in the number of users they report, but no one really complains about this, because the story is that they’re far and away the largest of a service whose most valuable attribute is the number of people who use it. Likewise, Mendeley has built an research catalog that will soon surpass the Web of Knowledge and we’re letting anyone, man or machine, query it for free.

Hi William, thanks for your comments. First up, I’ve nothing against Mendeley, I think the move to make this data more open and queryable is a good and exciting one.

I’m disputing the 36 million nearly-as-big-as-Scopus-and-WoK claim because I don’t think the data currently supports this based on that example paper. If we assume the average number of duplicates is 1, this halves the size of Mendeley in terms of unique publications.

I look forward to seeing the new de-duplicated Mendeley, and I’d expect the figure of 36 million to go down a bit – by how much will be interesting to see. As for duplication being common, yes it’s a problem for everyone (citeulike suffers too, but it currently does a much better job than Mendeley IMHO).

Databases like Scopus and WoK presumably spend a lot of time and money manually and automatically recognising and removing duplicates, and I’ve haven’t found any duplicates in their database for individual *papers* yet. It’s different for individual *author* duplicates of course, but that’s a different story…

Duncan – The number on the site is correct. The average number of duplicates is not 1, rather some small fraction of that, as duplicates aren’t that common to begin with and there’s been some deduplication already applied to the results. Duplicates are understandably enriched among the popular papers, such as yours, and it’s harder to go from 6 duplicates to 1 canonical document than from 2 to one, because the variability is higher. However, even if the number were larger than 1, that still wouldn’t make news, IMO, because there’s no other crowd-sourced open research catalog even close to to compiling what we’ve now released for free.

That said, it’s understandable that people will try to come up with their own back-of-the-envelope calculations where our data is sparse, and we should do a better job of defining what the numbers mean.

I’ve not done a comprehensive survey, but based on a few sample papers from my own library it seems duplication is fairly common, here are two more examples: paper a andpaper b.

I can’t see anything unusual about these papers, they are all in pubmed and have DOI’s (which many other scientific papers do) – so it seems quite likely that this is a widespread problem in Mendeley, not just a quirk of my own library.

Duncan – Google’s index lags our catalog by some degree, so the extra link was probably indexed before our internal de-duplication became active. There are also many pages in our catalog that will not have yet appeared in Google’s index, so I would recommend against using this means of searching Mendeley Web, at least for now. As you can see from the same search at Mendeley.com, there are only two results reported and not the three you see in Google’s index. Two isn’t one, or course, but from what we can tell, the average number of duplicates is much less than one. We should know more about this in a few weeks.

Having made this observation about duplicates, how would you recommend it guide our development efforts? Should we stop accepting new documents until we have this sorted? Should we halt development on the citation style editor or the API and focus solely on deduplication? Should we not brag about ourselves to the media?

Here’s the thing: Mendeley is about 30 people, all gathered together in one room, trying to disrupt the stagnant, old publishing infrastructure which we all complain about. At the end of a long day, we need to believe that what we’re doing really can change the world, even if it’s just to explain to our spouses why we’re putting in another late night. Does that make sense?

Mr. Gunn… you write that “Duplicates are understandably enriched among the popular papers, such as yours, and it’s harder to go from 6 duplicates to 1 canonical document than from 2 to one, because the variability is higher.” … but I am pretty sure my papers are not that popular is the given example… still, I see quite a few duplications.

More importantly, please let us know how we can manually report duplication… I could not find on the website how to do this. Why not crowd-source that part of the database too? You’ll probably find out that people are eager to report such issues, as merging duplicates in ones own set of publications will raise the readermeter.org statistics…

Just had a discussion with the developers here, and they agree crowdsourcing would be a great way to handle this problem. In fact, they already had a prototype of this under development. Does anyone have clever ideas for how to prevent abusive uses of a crowdsourced duplicate detection approach?

@Mr Gunn… the crowd-sourcing could be restricted to identification of duplicates, that could be curated by people at Mendeley… (e.g. with priority to paying customers)… I guess many incorrect mergers can be easily detected… many of the fields must have a high similarity, and fields like DOI must be identical…

DOI, actually, isn’t the infalliable indicator I used to think, because many people mis-enter the DOI manually to look up the metadata, so while the metadata and the DOI are consistent, the attached PDF is a different paper. If we could only use the DOIs extracted from papers, that would go some way toward handling the issue, but there remain so many things that don’t have DOIs that we really need a better solution.

The abuse potential via disgruntled academics trying to bury competitors is distant, but we still want to architect the system such that we can more easily handle problems when they arise. Probably a good start would be to allow Mendeley users to flag items that can’t be automatically identified, and do the cleanup manually on the back-end for those.

Mendeley (and other groups of people) are doing a great job to “disrupt the stagnant, old publishing infrastructure”… many people want this (including me), in fact it’s surprising it’s taking such a long time to happen. So here’s my wish list for Mendeley:

1. Sort out duplication, to my mind, this is a serious problem that needs sorting out – it will never be 100% perfect, but it could be much much better. If Mendeley was as good as citeulike in doing merges, then that would be a start.

2. Better statistics which means that size isn’t everything, quality matters too. I’d like to see more realistic statistics on how many unique documents there are in Mendeley. It would also be useful to know how much overlap there is between Mendeley/Scopus/WoK/PubMed etc. When Mendeley hits 40 million documents, will it really completely subsume them all?

Agreed on all points, Duncan. Deduplication is being sorted now, at least for documents which can be matched via title or that have an identifier, and further work is prioritized.

The stats is something where we have to balance usefulness vs. completeness. We’d like to get out the message that we’re growing, but releasing numbers often brings about a sort of horse-race mentality which I think is really distracting, would take a large amount of time and effort to maintain, and may invite comparisons between sets of numbers that aren’t really equivalent, such as between Mendeley and other sites that don’t really address the same populations, have the same business model, etc. The numbers are for press releases, but the value of Mendeley is how much it improves your personal workflow.