A new kind of search

This article appeared in the February/March issue of Position magazine. It was written by Darren Mottolini, Dr Ivana Ivánová, and Tristan Reed. Ivana will present a plenary on Spatial Knowledge Infrastructures at #Locate19 at 9:10am on Wednesday April 10.

Linking datasets with structured metadata may remake the web as we know it.

Organising the world’s information is a formidable task and for those that manage data, the task of publishing and ensuring data records remain current and useful is never-ending. In late 2018, Google launched a new online search engine specifically aimed at solving the challenge of finding and accessing the right dataset among the ever growing and increasingly fragmented array of online dataset repositories. A formidable undertaking indeed.

We have all experienced not being able to find the right data for our needs. The time spent, only to find that you must further process the data to make it fit-for-purpose, costs business and government alike. Often just knowing the right person shortcuts all the effort by allowing you to get your hands on the right data up front without having to go into a dormant data repository, with outdated metadata that includes a disclaimer to beware of the data inconsistencies. The cost of finding the right dataset and the effort required to determine its usefulness is a cost we all bear, noted by NBN Co, who purchased the Geocoded National Address File (GNAF) only to find it did not specifically meet their engineering needs, even though it did meet other needs.

The Google approach

Google Dataset Search applies the same principles as Google Search (which are also used for Google Scholar). As long as the right structure of metadata tags is included in the data repositories, it will index the metadata for discovery.

The purpose of Google Dataset Search is to improve the discovery of datasets from sectors such as life sciences, social sciences, civics and government. Google plans to do this by ensuring publishers provide structured metadata, this means each dataset must include support information describing the dataset. The idea of using structured metadata to allow machine to machine linkage is an area of research FrontierSI (formerly the CRC for Spatial Information) has been engaged in for the past seven years. While Google Dataset Search is a great advancement, the reality is that this new application is only dipping into the possibilities of what can be achieved through structured metadata and machine linkage. While still early, Google’s work is indirectly demonstrating that the research undertaken by FrontierSI has practical merit, while recognising there is still more work to be done. Searching for the right dataset through smarter use of structured metadata is only the low hanging fruit in optimising machine to machine links. Improving search so that users can use ‘natural language’ phrases such as ‘what is the grain production output within the wheatbelt’ should not only get you to the right dataset, but in the future provide you the right answer. We see this as the next level of research that Australia is well-positioned to lead. Let’s explore what is standing in our way — to not only improve what datasets we search for but help generate the answers we need.

Structured metadata?

You may be asking yourself: ‘Don’t we already have metadata standards?’ Searching for spatial datasets in the geo-information domain relies on the existence of dedicated catalogues (including metadata catalogues, geoportals or clearinghouses) and complex, standards-compliant metadata, such as ISO 19115.

Metadata is a structured collection of information fully describing the spatial resource, and includes information about the creator of the dataset, its spatial and temporal reference system, content, quality and constraints on its use. The ISO standard recommends a minimal metadata set which should serve for data discovery and identification, yet despite having a complex and exhaustive metadata standard, there are persistent and well-known problems with spatial data discovery.

Spatial metadata is scarce or, if available, not well maintained, which is caused by two major problems:

First, the use of standards is not mandatory, and even if mandated (e.g. by national or corporate Spatial Data Infrastructure ‘SDI’ policy), the standard does not specify a minimum metadata requirement. As such, it is frequently up to data producers to decide how much metadata and what information to provide.

Second, metadata is provided in specialised jargon, understandable only by geo-information professionals and often only those from the same specialised area as the producer.

To add further difficulty, searching for spatial resources relies on prior knowledge of these dedicated catalogues and where they can be accessed. Currently, attempting to use mainstream search engines requires an intricate and advanced knowledge of crafting search query strings to guide the search engine to specified data catalogue location. Once there the search engine further needs to interact with the data catalogues system (such as using a OGC Catalogue Web Service request: http://www.opengeospatial.org/standards/cat) to identify the right dataset based on the original query string.

There were prior attempts to harmonise search for spatial datasets with dedicated catalogues using mainstream search engines – one such example is OpenSearch for GEO( http://www.opengeospatial.org/standards/opensearchgeo), however, search for the right dataset that is fit for the user’s desired purpose continues to be a challenge in the geospatial domain.

An initiative within INSPIRE, the European SDI, to align geospatial metadata standards with the web Data Catalog Vocabulary (DCAT) demonstrates the desire to expose currently ‘invisible’ data repositories to the web and aligns with recent developments in mainstream search engines, such as Google Dataset Search.

Google has recommended the use of RDF models and DCAT vocabularies to setup and design structured metadata for published data, but what does this mean? RDF stands for “Resource Description Framework”, a metadata model used as a general method for expressing conceptual descriptions or modelling of information that is implemented in web resources. It is a knowledge management technique that is founded on the idea of describing resources in the form of a triple – consisting of a subject, predicate and object. In the example below, the subject is a property, the predicate expresses the relationship “isLocatedOn”, the object in this case being a street. The expression would be “a property is located on a street” – subject, predicate, object.

Simple RDF expression demonstrating an inferred link.

DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogues published on the Web. DCAT does not make any assumptions about the format of the datasets contained within a catalogue but provides a standardised method of expressing the structure of a data catalogue and the metadata records of datasets within it.

How does it all work?

A collection of RDF statements intrinsically represents a directional graph data model that is suited to generating knowledge, from understanding how data is linked, while using machine inference to ‘fill in the gaps’. Using the concept of a mining haulage machine and the interlinked assets and components related to build such a machine, you can quickly map a graph data model by linking statements as such:

Fig 2. Represented Graph Data Model.

Once linked, using data analytics software (such as Apache Hadoop) can then infer other linkages without having to hard code them into applications or build new look up tables. However, in practice, RDF data is often stored in normal relational databases or native representations providing a mechanism for publishers to start building their own RDF statements and publishing these.

So why structure metadata this way?

When searching for datasets to meet the requirements of a solution, understanding that, for example, in a cadastre, a property may also be called, a lot, a land parcel, a land boundary, a property boundary, a title boundary, or several other possible terms, these additional descriptions become important for a search engine.However, in most cases search engines do not consider the fact that one ‘thing’ may be called many different things by other groups of people. As such, if we were to describe a dataset as containing information on ‘tree canopy’ a user querying the search engine with a more general term such as ‘vegetation’ would not find the dataset and so would not be aware of a dataset that may meet the requirements of their solution as it was described using other terms.

Google Dataset Search is currently a leader in this regard. For example, querying ‘bore hole’ and ‘borehole’ yield effectively the same results. This is in contrast with other dataset search engines in use, such as CKAN, which ignores all records containing ‘borehole’ if the search query is ‘bore hole’ and vice-versa.

Through expressing metadata in a structured RDF format, vocabularies can be linked to elements of the metadata to ‘expand’ or broaden the content. For example, existing or expert-generated vocabularies describing alternative representations for ‘bore hole’, ‘cadastre’ or ‘tree canopy’ could be used to automatically expand the keywords listed in the metadata records for the cases discussed above.

Spatial data also intrinsically contains extra context, be it implied through the geographic extent of where the spatial data itself is or the geographic extent to what the data covers which may be, specifically described in a metadata record. By applying the principles of RDF ‘triples’ to create context in the published metadata, dataset search engine results can be tailored for the end user by looking at the spatial relevance or suitability of a dataset.

One example would be describing a dataset’s extent as being ‘Northam’, a town in the Wheatbelt region of Western Australia. Using RDF compliant vocabularies, a user can query a search engine with a phrase such as ‘in the Wheatbelt’ and find said dataset. As such, a user looking to compare data from a set of related geographic areas only needs a single search query, rather than many as is currently required.

The Spatial Infrastructures program of FrontierSI has been at the forefront of research in this area for the past several years and new applications, such as Google Dataset Search show ongoing promise that we are on the right path. For now, FrontierSI is continuing to improve how spatial metadata can better leverage the “web of data”, while supporting Australian data publishers to ready their data for Google Dataset Search is part of the ongoing process.