The aim was to improve access to BGS's resources with an enhanced search tool that can cut through the tangle of complex geoscience terminology, and to improve navigation between information resources.

Why should we do this ?

BGS is custodian of a wealth of textual information, including the important observations and interpretations that accompanied BGS’s traditional core product, the hardcopy 2D geological map. Our digital-era information products make it easy to view interpretations of the spatial extent of geological properties but they are not easy to discover, are difficult for non-experts to understand and they are divorced from the documented evidence they are based on. These provenance links are increasingly important when information is used in decision making.

So how do we make a search "semantically intelligent"?

Well, if you do a regular web search or directory search for content, all you are matching is a sequence of letters. The search engine doesn't know that other terms mean the same thing, or are related to your search intent in some way, or that the term you are using has different meanings in different eras or contexts – all of which happens a lot in the long history of geoscience terminology. All of these problems mean that the user has to filter out "false positives" in the search results and often run repeated searches using the alternative terms that he/she knows about, potentially missing out on valid results, especially if they don’t know the experts’ terms.

But what if the experts who knew all about the terminology and all the relationships between the terms and the things (we call them concepts) had already captured all of that knowledge (we call that an ontology) ? And even better, if that was available in a machine readable form and the search engine could use it with some algorithms that made it behave as if it knew what your search term meant ? That is what we mean here by semantically intelligent.

The "spatially intelligent" bit means we also want the search engine to understand the various ways that locations are mentioned in unstructured text so that a user can find all information relevant to a chosen location no matter how it is represented.

To push things along further BGS have been supporting PhD research (with Robert Gordon University) into these topics, and our student Ike was able to join us at the hackathon and contribute algorithms he has written to use the ontologies.

So what happened during the hackathon ?

Planning

With handy post it notes on a flip chart, we identified 5 separate components to work on in parallel:

document indexing for the search engine

application to run the search service as a simple search

extending the search service to use the ontologies

web interface for search

coordinate to placename converter

links to the search from Groundhog Web virtual cross section and borehole viewer (as an example just because that’s an application that I develop anyway so know the codebase)

Indexing

On task A, Rachel, Marcus and Ike worked together to install elasticSearch (open source search engine software) and used it to create an index of plain text terms in a set of BGS publications – no ontologies involved yet. After a few failed starts this was then left to cook overnight.

On task B, Marcus and Ike implemented Ike’s existing search application as an elasticSearch client to run a simple text search.

Indexing complete

Search service running

Search interface working

On task C, Gemma created a web front end to launch the search, display the results, highlight relevant terms in the pdf documents and to capture user evaluation of the search results – useful for Ike’s PhD. “High five at 11.25” moment on the second day when the search was working properly and the web interface submitted searches, showed results and highlighted the found terms in the pdf documents.

Tim McCormick concentrating

On task E, Tim started adapting (and then found it was quicker to write from scratch – hacking isn’t always the best option for a quick win then !) a few database functions in PL/SQL to create a gazetteer translation and expansion tool. This meant querying a corporate spatial database of OS administrative placenames, map sheet names, some geological feature names to convert a coordinate location to a list of placenames, and expanding a single placename to a list of all co-located placenames. This performs a similar function to Ike’s semantic comparison algorithm but in the geographic space. Marcus created a small web service that would work as an interface to Tim’s function.

Search links added to virtual borehole viewer

On task F, Agelos and I tried (and failed) to get the BGS Groundhog Web code to compile locally (new PC missing some vital configuration that we couldn’t pin down), and in the end I captured an example page from its output and hardcoded new links to Gemma’s search form and Marcus’s interface to Tim’s Gazetteer tool. Going great at 12.08 – even with a bit of cheating!

Existing BGS Lexicon entry for Vale of York Formation

On task C, Ike continued indexing the document collection using concepts in the geological timescale ontology (BGS Chronostratigraphy), and adapting the search engine to use that ontology in his query expansion, semantic comparison and relevance ranking algorithm. Agelos also developed some web scraping scripts to pick up BGS Lexicon terms from structured web pages so that search links could be applied in that way if we wanted.

Going like a dream at 13.17 ! Tasks A,B,D,E,F complete and ready to demonstrate at the final presentation at 2pm. Task C was always going to be the tricky bit so we weren’t too worried that it wasn’t finished yet.

At the close of the hackathon we were able to demonstrate the web front end to the search and show it running to retrieve text-matched results from the indexed document collection. We also demonstrated a Groundhog virtual cross section from the Vale of York 3D model with new context sensitive “Search publications” links from the legend of model layers that open the search form pre-populated with the placenames and geological time or formation name term relevant to that part of the cross section. Just after the hackathon, and just a little too late to demonstrate, Ike managed to plug in the full semantic comparison algorithm using the Chronostratigraphy ontology, completing Task C.

The hackathon judging panel were impressed at how much we achieved and were excited about the possibilities and the way it could help users discover and navigate through our wealth of resources. We all enjoyed working in a new environment and with a different team of people to usual – despite the extreme heat on those days ! On a personal level this team effort brought together various strands of work that I have been working on – sometimes in the sidelines - for a number of years so it was really satisfying to finally have something to show. Huge thanks to all the great team members.

What happens next ?

We would like to build on the work we did to

implement a more robust version on our intranet for staff to assess

add further ontologies to the search tool

use third party online data sources or APIs for some of the gazetteer translation and expansion service rather than having to maintain our own copies of OS data

index documents by location by geoparsing for recognisable coordinates, or proxies for locations such as borehole registration numbers

provide a similar application on the BGS external website to search publications, showing snippets of documents and links to the BGS shop if the publication is not open access

eventually implementing a single point of entry to search and navigate through all BGS website resources