The idea was simple: identify location names based on unstructured project documents and turn them into geocodes so that they could, for example, be plotted on a map. Like every data science project, our first step was to find some reliable input data upon which to build. We found a treasure trove of potential sources embedded in the document links included in IATI activity files. Within these, evaluation documents seemed to be the best potential source of location data, so we created a library of these files to quickly sample from.

Data extraction

Documents in hand, we now had to try to extract the relevant location names and other useful information. Unfortunately, the types of documents that organisations attach to IATI activities are extremely diverse and come with a wide variety of structures and file formats. To begin to make sense of them, we first converted them into raw, uniform text. We then used a technique called named entity recognition to distill the messy raw text into the clean lists of organisations and locations hidden inside.

We now had a list of sub-national locations; some relevant, some not. Our next step was to convert these names into geocodes. Much of our exploration at this stage focused on whether these results could be integrated with Development Gateway’s Open Aid Data Geocoder, which can assign precise geospatial data based on the location names in accordance with the IATI schema.

Several practical hurdles remain before an approach like this could be used in production, but by the end of this short exercise we had discovered a simple way of automating IATI data creation by scraping text from project documents, extracting location data and then converting these results into geocodes.

What next?

If we are to take this idea further, our next step would be to identify the earliest instance in which subnational locations appear in available project documentation. Since evaluation documents are only available at the end of a project’s lifecycle, they are not an ideal choice for completing forward-looking IATI data, although they could be used to enhance data already on the IATI registry. We will also need to test ways to implement this approach at scale before integrating it into donor publication processes.

In any case, what we have learned so far is that location data is central to improved coordination and, with this automated extraction and geolocation process, it is well worth the effort to explore how we can improve its publication and availability.