Google unveils search engine for open data

September 06, 2018

The tool, called Google Dataset Search, should help researchers to find the data they need more easily.

Davide Castelvecchi

Google has unveiled a search engine to help researchers locate online data that is freely available for use. The company launched the service on 5 September, saying that it is aimed at “scientists, data journalists, data geeks, or anyone else”.

Dataset Search, now available alongside Google’s other specialized search engines, such as those for news and images — as well as Google Scholar and Google Books — locates files and databases on the basis of how their owners have classified them. It does not read the content of the files themselves in the way search engines do for web pages.

Credit: Robert Gumpert/Redux/eyevine

Experts say that it fills a gap and could contribute significantly to the success of the open-data movement, which aims to make data openly available for use and re-use.

Government agencies, scientific publishers, research institutions and even individual researchers maintain thousands of open-data repositories around the world, containing millions of data sets.

But researchers who want to know what types of data are available, or who hope to locate data they know already exist, often have to rely on word of mouth, says Natasha Noy, a computer scientist at Google AI in Mountain View, California.

This problem is especially serious for early-career researchers who are not already “plugged” into a network of professional connections, Noy says. It’s also a downside for those who do cross-disciplinary research — for example, an epidemiologist who needs access to climate data that could be relevant to the spread of a virus.

Classified search

Noy and her Google colleague Dan Brickley first described a strategy for solving that problem in a blogpost in January 2017.

Typical search engines work in two main stages. The first is to index the available pages by continuously trawling the Internet. The second is to rank those indexed pages, so that when a user enters search terms, the engine can provide results in order of relevance.

To aid search engines in indexing existing data sets, Noy and Brickley wrote, those who own the data sets should ‘tag’ them, using a standardized vocabulary called Schema.org, an initiative founded by Google and three other search-engine giants (Microsoft, Yahoo and Yandex), and which Brickley manages. The Google team also developed a special algorithm for ranking datasets in search results.

Given Google’s dominance in web searching, news that the company was moving into the data ecosystem quickly prompted major players to fall in line and standardize their metadata, says Mark Hahnel, chief executive of the data-sharing company Figshare in London. (Figshare is operated by the Holtzbrinck Publishing Group, which also has a majority share in Nature’s publisher.)

“By November, all the universities we’re working for had their stuff marked up,” Hahnel says. “I think this is a game changer for open data in the academia.”

Funding agencies sometimes mandate that research data must be made available, and they﻿ are going to reach their ultimate goals only if the information is effectively recoverable, he says. “It legitimizes what the funders have been trying to do.”

Agency partnerships

An early supporter of Google’s experiment was the US National Oceanic and Atmospheric Administration (NOAA). The agency’s remit ranges from fisheries to the Sun’s corona, and its archives contain nearly 70,000 data sets — including ship logs from the 1800s. The trove adds up to more than 35 petabytes, comparable to the content of 35,000 typical hard drives.

Google’s tool will help NOAA to meet its open-data mission, says NOAA Chief Data Officer Edward Kearns in Asheville, North Carolina. “We want to explore new ways to make those data available to others,” Kearns adds.

For Dataset Search to work, having the data owners’ collaboration was a crucial step. Although the system might become more sophisticated in the future, Google currently has no plans to actually read the data or analyse them, as it does with web pages or images. “A search tool like this one is only as good as the metadata that data publishers are willing to provide,” Noy says.

Like Google Scholar, Dataset Search currently offers no access for automated querying, or application programming interface (API) — although the company says that it might add that functionality in the future.

Noy says that as researchers begin to use Dataset Search, Google will watch how they interact with it and use that information to improve the search results. The company has no current plans to monetize the service, she says.

As Dataset Search evolves, it might also become integrated with Google Scholar, so that search results on a particular study could link to relevant data sets.