Practical Metadata and Standards for Clinical Research

Light, A Tunnel and another Grand Départ

As quite a few of you will know I follow professional road cycling. Even with the dark side of the sport with the various doping scandals I love cycling, I love riding myself and I love watching the sport, and no race comes bigger than the Tour de France. The 2016 edition will start at Mont-St-Michel on Saturday (tomorrow as I write this) and I look forward to three weeks of working while keeping an eye on the events as they unfold on the roads of France.

Unfortunately I will not be making my way to France this year. This time last year I was packing up my motorhome for 10 days or so of following the tour. I returned from that trip and sat down at my desk and pondered an opportunity. Over the previous months I had developed some simple tools, written a few blogs but now I had to make a decision. On the 23rd July 2015 I decided to take a calculated risk; I was fed up with managing standard using Microsoft Excel, I felt that there had to be a better way, an easier way that would bring benefit. So as I sat there on that Thursday and I decided to develop a more robust tool to manage standards that also looked to remove the silos that bedevil our industry. The tool is now called Glandon named after a mountain climb that formed part of the route that day back in 2015.

It is now nearly a year on, another start of the tour is looming and I find myself in the middle of a two week development sprint. Glandon has come a long way, my work is starting to pay dividends and there is light at the end of the tunnel.

The current sprint is aimed at adding in better support for SDTM Models and Domain classes. This will allow Implementation Guide (IG) domains to reference the correct class models and link the IG definitions to the Class definitions. In having these improved definitions Glandon will be able to detect changes between versions and perform the normal change management functions, ease the construction of user-defined custom domains and ensure that such custom domains are well structured. This sprint will be followed by another two week period where I plan to upgrade the existing sponsor domain capabilities and build on top of the IG and Class definitions to allow for comprehensive domain specifications to be built within the system.

During the previous sprint when I was working on some Form functionality, I was asked a question about a particular change to the CDISC terminology. In the December 2015 release of the terminology we have an entry for Blood Urea Nitrogen (C61019) with both a Test Name ‘Blood Urea Nitrogen’ and an associated Test Code ‘BUN’ from code lists C67154 and C65047 respectively. These have been deleted in the March 2016 release. A search of the new version seems to suggest they have been replaced by ‘Urea Nitrogen’ & ‘UREAN’ (C-Code C125949) with the specimen pre-ordination being removed.

As I looked it struck me that this change provided a nice simple scenario to show impact analysis and the power of having linked metadata. To this end I set up the following:

A second form that used the ‘Blood Urea Nitrogen’ test name directly. This reflects the current method of building forms using individual questions rather than using a concept.

Impact analysis for deletion of Blood Urea Nitrogen C61019

Having got these definitions entered I then performed the impact analysis. What impact does changing my CDISC terminology version from the December 2015 to the March 2016 release have? A few mouse clicks give the grid you see in the image to the left. Seven rows indicating the Biomedical Concept, the two forms and the LB domain are impacted by the change of the code list item being deleted. You will also note the ‘via’ column to indicate the route by which something is impacted (e.g. the LB domain is impacted by the Biomedical Concept being associated with it).

A ‘graph’ on the impact analysis results.

As you will note this is a rectangular view and it suffers from that classic issue of the first few columns repeating (e.g. BUN three times). We see this everywhere in our data, our brains cope, but it is a sure sign that our world is not rectangular. However, our metadata is linked, this table is merely a presentation, but within the system the results are actually a graph, a set of nodes and edges (links) connecting the pieces of information. So Glandon allows for the results to be viewed as a graph and we get what you see in the image to the right. Here we see the code list items in green. They directly impact the Biomedical Concept in red. The concept has an onward impact onto a form in light blue and a domain in the darker blue. The Test Name has an impact on the second form.

So looping back to the title of this post and the light at the end of the tunnel. The vision for linked metadata begins to emerge. Glandon, under the normal looking displays of tables of terminology and domains, is built on the semantic web and it is built around one integrated model. The model may be divided into sections but each section is linked to the others as required. As a result we can see our metadata in its natural form and see the links. The graph above is a nice step towards the vision of metadata across the lifecycle. In this case we are seeing the silos of forms and SDTM being bridged, albeit simply at the moment.

The upgraded vision diagram

In my previous post I presented a diagram that has been one of those pictures that have guided me through the work performed to date. Having discussed it with a few people over the last few weeks I have upgraded it to clarify a few points but also to strengthen the notion of the linked metadata model (remember, and I had a discussion on this last week at the PhUSE European CSS meeting in Basel, metadata is only data, so we are talking linked data here) taking us from protocol to submission.

I have removed the duplication of symbols in the data layer (the model of our data) showing, hopefully, more clearly the link from protocol through to results. I have also tried to indicate the parts of the model that contribute to the presentation of the data (CDASH, SDTM, ADaM etc) by the boxes with the grey background between the data and presentation layers.

We start to see the link between the impact graph and the graph on the vision slide. And they should be the same. Really what we have is one large graph. When we ask a question all we are doing is picking a start point within the graph (a node) and then navigating the links and nodes to get our answer. An impact assessment is simply picking one node and asking “if I change this, what is connected (one or more links to other nodes and then repeating as we move further outwards) and thus impacted”? As we add to the model, be it protocol or analysis, our questions can become wider and more encompassing.

A big step comes if we can connect the definitional linked data (our metadata) with the captured data, ideally in a linked data form. Then we could ask questions such as which studies use BUN as a submission value that I may need to change?

The impact graph is the light at the end of the tunnel, the first glimpse of a new and better world. The next few sprints will be spent ensuring that light gets brighter while, of course, enjoying another three weeks of the Tour de France if only via the medium of television.