As every librarian knows, there are three main sources of citation data. The three citation indexes (in increasing order of size) are Web of Science, Scopus and Google Scholar.

However, they are not the only sources, and recently, I noticed studies showing that two other sources, ResearchGate and Microsoft Academic search are getting large enough to be worth considering.

Could they possibly complement Google Scholar to serve as alternatives to paid indexes?

ResearchGate

While Mendeley offers “readers” as a statistic (basically the number of people who have a paper in their reference library), their citation data comes directly from Scopus.

Mendeley’s citations are supplied by Scopus

In contrast, ResearchGate perhaps the biggest social networking site for academics provides their own citation metrics

But how good is this citation index? A recent study in Scientometrics (OA version) studying 86 Information Science and Library Science journals from January 2016 to March 2017 found that while ResearchGate found less citations than Google Scholar , it generally found more citations than the paid indexes like Web of Science or Scopus.

As you will see later, many bibliometrics experts are very excited by this new service. But why?

Microsoft describes it as thus in a 2015 paper “At the core of MAS is a heterogeneous entity graph comprised of six types of entities that model the scholarly activities: field of study, author, institution, paper, venue, and event. In addition to obtaining these entities from the publisher feeds as in the previous effort, we in this version include data mining results from the Web index and an in-house knowledge base from Bing, a major commercial search engine.”

In short, it seems to offer the best of both worlds, using Google Scholar type crawling technology combined with publisher based metadata feeds to build large indexes with an attention to metadata fields similar to what you get from library type databases. On top of that, unlike the well known difficulty of extracting data from Google Scholar, you can do so easily in Microsoft Academic Search via an API, or downloading Microsoft Academic Graph (MAG).

I’ve been playing around with the service, setting up my own profile, kicking the tires etc.

I’m not ready to do a full review yet, but it does look promising despite the bugs.

Some preliminary things I noticed

Search wise the number of results generated for searches seems closer to what you would see search in library databases. For example a search for terms web scale discovery gets you around 111 results, but you get 2.6 million in Google Scholar! It seems unlikely that Microsoft academic search index is smaller by such a degree (see later) , so it is probably because it does not search and match within full text (for whatever technical reason).

And this was confirmed via Twitter

@aarontay We default to semantic search over full text, so less but more accurate results. If semantic fails we do fall back to full text though

The other major difference between it and Google Scholar is that Google Scholar shows [citation] results or items that are not indexed, while Microsoft academic search does not.

[Citation] results in Google Scholar

All this perhaps explains some of the reasons why you get fewer results compared to Google Scholar.

The size of Microsoft Academic search versus the rest

Microsoft claims an index of 83 million publications record in 2015, and by 2016, this rose to 140 million publication records, 40 million authors, 60,000 journal titles. As estimates for the size of Google Scholar’s index typically fall into the 100+ million range (it’s notoriously hard to get any hard facts on the size of Google Scholar) , Microsoft is now seemingly hitting within the same ballpark and is significantly bigger than Scopus and Web of Science , which is perhaps 60%-70% of it’s size.

But that’s what is claimed, what does the research by Harzing and other researchers show?

Harzing of course is well known as the author of the free “Publish or Perish” tool, the only tool allowed by Google to extract citation data from Google Scholar. She has now added support to version 5.0 for Microsoft Academic Search.

For example, her blog post that studies coverage of her own works finds that practically all her publications that is also indexed in Scopus (all but 2) and Web of Science (all but 1) is also in Microsoft Academic Search. On top of that, Microsoft Academic Search can find 30/43 more of her works than Scopus and Web of Science respectively. Google Scholar still dominates Microsoft Academic Search though.

In her more comprehensive study , of 145 Associate Professors and Full Professors at the University of Melbourne she studies citation counts to works of these authors.

In general, Scopus & Web of Science detects slightly more citations than Microsoft Academic Search in the Life Sciences (11% more) , Sciences (7% more) & pretty much ties for Engineering but Microsoft Academic Search beats the other two handily in Humanities (170% of Scopus) and Social Sciences (145% of Scopus). Google Scholar clearly dominates all as usual.

Hazing goes on to explain that Microsoft Academic Search uses machine learning to drop citations that it can’t verify that is a true cite and attempts to correct for this to “estimate “true” citation counts”. This leads to the following comparison.

When looking at this estimated true citation count (MA ECC) , Microsoft Academic Search actually finds more citations than Google Scholar in Life Sciences and just barely loses out in Science & Engineering.
But Google Scholar continues to dominate in Social Sciences and particularly Humanities. This is probably due to the impact of Google books for book related items.

Completeness of metadata fields

But I suspect, the interest in Microsoft Academic Search is not just purely based on the size of index. After all Google Scholar still seems to have the edge in size.

But interest here lies in the fact that the service is now sufficiently big and also the richness of the metadata and the ease of extraction of the data, both areas Google Scholar is extremely poor at. The only official licensed tool by Google, Publish or Perish is often unreliable and cannot be used for large scale extraction for example.

Similarly, Citation Analysis with Microsoft Academic goes in depth to assess the suitability of Microsoft Academic Search as a bibliometric tool in terms of the completeness of metadata field and easy accessibility of data for extraction.

In general, the results are positive, it’s far easier to extract & manipulate data than Google Scholar through the API or by downloading the Microsoft Academic graph which has a much richer and structured data available than Google Scholar. Even something like having internal Microsoft Academic Search assigned ids for “papers, references, authors, affiliations, fields of study, journals and venues” is very helpful.

It’s not a perfect tool though, for example examining the attributes available they realize that there is no document type (which makes metrics that normalizes using document type hard to do), nor does it have the very obvious doi attribute (a strange omission). While there is a subject type “field of study”, it’s dynamically generated and far too specific (50,000 field of studies?).

Of course for ordinary users who just want to calculate their citation counts, you can use Hazings’ Publish Or Perish V5 and above with a free API Key from Microsoft to mine Microsoft Academic search.

Conclusion

Currently our citation sources consists of either paid services like Scopus and Web of Science, or free to access services by commercial companies – Google Scholar and Microsoft Academic Search. Both are not ideal.

The OpenCitations Project is probably the best solution but as of writing there is no study I know of quantifying the size of this index.

Still, one wonders if it might be the beginning of the end for paid citation indexes. Use of Scopus and Web of Science as discovery tools have greatly declined in recent years and much of its value now lies mostly in generating citation metrics.

As open access continues to march on, more and more content will be freely available. This will free up citations/references as well to be mined (albeit not always in structured format), so citation indexes will have to compete on data quality, feature sets and ease of use.

Players that have strengths in handling and cleaning of large datasets (e.g. Google) will have a big edge here of course. Traditional companies that serve libraries and academia may not be able to match this but do have strengths in terms of better understanding of academics so it’s going to be interesting to watch.