Linking registers

Not that long ago on this blog we introduced registers as authoritative lists you can trust, and began to describe the characteristics of a register. In this post we delve into one of those characteristics. We'll explain why registers have links; how links help users of data; and discuss some of the implications of linking for an organisation operating a register.

Food hygiene ratings data is just one of a number of ways the FSA use their expertise and influence to help people trust the food they eat is safe and honest.

Making the data visible helps users assess a single place to eat, but raises some interesting questions such as how does the rating relate to its geographic location? Do food poisoning outbreaks happen in areas with poor hygiene ratings? Does the quality of the premises, or size of the business mean it is more likely or less likely to have a poor rating, and does a restaurant failing its food inspection mean it’s about to fail as a business?

Anyone can come up with other interesting questions, but many of us have needs for understanding the data beyond idle curiosity. Testing hypotheses with data can lead to better government policy, and help ministers to target investment more effectively.

You need a map and directions

Currently, testing hypotheses like these is quite difficult. You’ll need to conduct some research:

Firstly, you need to know how government is organised; you need to know about the role of the Food Standards Agency, the Valuation Office Agency, the Land Registry, Companies House, Local Authority, The Office for National Statistics and other agencies.

Then you need to know what data each of these organisations hold. You can find some of this information on GOV.UK and data.gov.uk.

Some of the data you need will be available through online tools, or periodically published as open data on data.gov.uk or GOV.UK, but you will still need to understand how to download or obtain the data which may be different in each case. Often the data isn’t readily available and you’ll need to enlist someone’s help, contacting the agency directly via their website or call-centre.

If you are lucky, the data will be available in a convenient format such as a single file of Comma-Separated Values, but a lot of data is still published in document formats such as PDF which can be difficult to process. Regardless of how it is formatted, you still need to process the data from each source differently — each dataset has different shapes and different names and types for data items meaning you have to work in a different way with each dataset.

Then you need to understand if your use is within the licensing terms of the data, which can be quite difficult particularly with data not published under the Open Government Licence.

Finally, before you can test your hypothesis, you’ll have to work out how to link the data together. And this is the hardest part. Often you’ll have to do some clerical work. You’ll need your eyes to manually match addresses in different datasets, your brain to relate the name of a food business to the operating company, and your fingers to fix any errors and create the links between the data.

Better data makes data better

I have learnt a new word since joining government: “nugatory”. It’s the name given to boring, pointless activity. The nugatory work of discovering, cleaning and linking data is repeated each time you want to ask a different question, and repeated by many others asking similar questions all across government.
Of course there’s a better way of linking data, and that’s to create links in the data at source, where and when it’s made. This is where registers can help.

Registers offer authoritative lists of allowed values for a field. Using registers helps designers build drop-downs, selectors and other widgets for forms.

A local-authority register will enable the Food Standards Agency to replace their own list with an official one, and a country register will enable Companies House to replace their country of origin with the authoritative list of countries, both of which currently appear as free text fields in forms.

Links simplify the data

Using links rather than text simplifies the data an organisation needs to hold. Using a company number allows an organisation to rely upon Companies House to provide the name of the company and other details which may change, such as the company name, its registered office address, the names of directors, and track if the company is still active.

A link can also connect a register to one of a number of different registers. For example the business running a food premises could be a CURIE (Compact URL), allowing it to be a school in the Department of Education register “school:1234”, a company in the Companies House register “company:9876” or a charity registered by the Charity Commission “charity:5678”.

Links demand trust across organisations

Links simplify the data a single organisation needs to maintain, but need agreements to share data names and data types with another organisation. A good way to scale agreements across a lot of registers is for us to identify and use standards.

Technical standards for data are something we need for linking, but for links to work they depend upon trust between different organisations.

To use company numbers rather than company descriptions the Food Standards Agency need to be able to trust the Companies House number will be stable and continue to identify the same company, and the company data they need will continue to be open and available to them.

Similarly before we can use country codes rather than accept free text Companies House need to be able to trust the Foreign & Commonwealth Office will continue to keep the list of countries recognised by the United Kingdom up to date.

We need data standards to be able to trust the data in a register will be available for as long as it is referenced in open data, and kept relevant for as long as it is needed to operate services.

Meeting data standards and maintaining the trust placed in links are just two of the responsibilities we should expect from the custodian of a register.

Nice article. One thing that got me thinking, particularly in light of your example questions to ask of the FSA data, is that the geographical location itself is a register. And it's a particularly useful one because so many other datasets have a link to this same register. The FSA data is already geocoded with a location (which conforms to the Standards Hub proposal for exchange of point data https://standards.data.gov.uk/challenge/exchange-location-point-information).
As an example, I harvested the ~450,000 records from the API and, using the location information in the data, averaged them in 2 different ways - by local authority and by an arbitrary 10km hex bin. I didn't perform any follow-up analysis (I'll leave that for others), but it initially looks like urban areas have lower scores.... I also had a quick look for a dataset of food poisoning outbreaks but couldn't find anything...
So links to local authority, postcode, ONS census output area, area of local football team allegiance and pretty much anything else you can think of are already in the data, via the location.
That said, I completely agree that the links to non-geographic registers you propose would improve the data a lot. As would publishing it via a queryable JSON-based API or even better an open-standard appropriate to this type of data (such as GeoJSON). I still applaud the FSA though for updating it daily rather than pushing a static copy to data.gov.uk and forgetting about it.

Hi Adam, we've looked at a few use-cases which involve registering geographic locations, including points and boundaries, but we've also discovered a number of datasets held by different organisations can be linked together using a common identifier for the business premises which is proving to be very useful for analysis. I'll aim to present these findings in a follow-up post.