March 5, 2013

Why Google isn’t good enough for academic search

People often ask: why all the fuss about Search for academic papers? Google does a fine job, we can find everything we need, what’s the problem?

I gave an answer to this in a comment on Mike Taylor’s blog and it got a bit of twitter pickup, so reposting my comment here for this audience. Summary: no one can build on the results!

Google isn’t an acceptable answer to Searching across academic papers (toll access, green OA, gold OA, whatever) because it doesn’t support a way for people to digest the search results, add value, and apply the results in new and innovative ways. Google search results can only be used on Google’s website manually, or embedded as-is in other websites.

Neither Google nor Google Scholar offer an API — for love nor money, as far as I can tell, point me to it if I am wrong — that would let us do a Google Search and then sort/filter/enhance the results to add value and use in research and in scholarly tools.

Totally unacceptable as a search solution for the scholarly literature. Think of the opportunity cost to research and research tools, and all the things that better research tools facilitate.

It doesn’t have to be this way. Search results can be openly available for reuse (see the search APIs and API terms of use for PLOS, PMC, etc).

Also: Google (and other text-indexing search engines) are just not that good at non-trivial searches. If I already know what paper I want, Google is good and finding it by title. But if all I know is that I want late 19th-century monographs on Morrison-formation sauropods, it doesn’t know where to start. All the metadata needed to handle such searches is generally available for academic papers, but general-purpose search engines don’t know what to do with it.

Right. “Opaque” is a much worse problem than “incomplete”. It isn’t just that not everything is there, it’s that you can’t tell what’s there and what isn’t in any more systematic way than manually probing with searches.

None of this is to criticise Google, of course: it does an amazing job, and it’s even more amazing that it’s general-purpose approach works as well as it does on academic papers. But it’s not nearly enough, and it would be awful if people’s Google-acclimatisation led them to accept the level of its functionality as defining what’s possible.

It may not be sufficient, but it is the only place where one can search for a paper and come up with a small, independent, low-visibility journal next to the publishing giants—with no differentiation between the two. It has levelled the playing field for small journals, especially those from developing countries. I am so grateful for the service and what it has done for the democratization of knowledge, that I am inclined to forgive the transgression. That said, I do agree with your overall sentiment and, in a perfect world, we the inclusiveness and the openness would coexist.

I asked googlescholar again recently if they’d changed their policies regarding the of tracking of datasets/data DOIs in light of Thomson-Reuters now doing it and the recent letter in Nature and they replied: “There has been no change at our end regarding indexing datasets”. Another wasted opportunity from them really, and whilst Thomson-Reuters get a lot of flak, they at least have put their money where their mouth is and bothered to make a data citation index.

Hello everyone – This is a very relevant and well stated position. I wanted to know if anyone was planning on attending #btPDF2 (http://www.force11.org/beyondthepdf2) as these issues will be center front. Also, if your unable to make it the event will be live streamed as well.

Dear Heather, this was exactly the point I have been trying to make for the last 2 years. A recent paper, where I discuss the issues of Google Scholar and MS Academic search has been published in D-ib “CORE: Three Access Levels to Underpin Open Access.” http://www.dlib.org/dlib/november12/knoth/11knoth.html .

Citing a section from the paper: “So, what is it that Google Scholar, Microsoft Academic Search and the mentioned cross-repository search systems are missing? What makes them insufficient for becoming the backbone of OA technical infrastructure? To answer this question, one should consider the services they provide on top of the aggregated content at the three access levels, identified in the Introduction, and think about how these services can contribute to the implementation of the infrastructure for connected repositories. Table 1 below shows the support provided by academic search engines at these access levels. As we can see, these systems provide only very limited support for those wanting to build new tools on top of them, for those who need flexible access to the indexed content and consequently also for those who need to use the content for analytical purposes. In addition, they do not distinguish between Open Access and subscription based content, which makes them unsuitable for realising the above mentioned vision of connected OARs.”

Thanks for the reference, Petr! I’ve been surprised there have been so few other people talking about this…. I guess we are all too dispersed to know of each other? Anyway, very glad to hear it and make the connections.