Daniel's collected open source and open data musings as a web developer, project manager and architect in Adelaide.

Friday, November 02, 2012

How to put a semantically enabled autocomplete control into your applications

One of the most common application design patterns is to implement a lookup table - some piece of business data has been given a description, and possible a code or identifier.

When creating new data, a user is often needing to select a code/identifier for a piece of information. This is usually done as a dropdown, or if they are many entries, an autocomplete control is often used.

This works well - some people will just make hashes storing the key/value pair in their code, others will ensure it's published into their relational data store.

Where it starts to fall down is in multiple applications working together - who can agree on the meaning of a code?
Your code of CASE_NIGHTMARE_GREEN is applied by a user and treated by one application as the coming of Chthulu, but after an ETL, CSV export or webservices message, the next application treats it as something different - users not up to date with the latest Lovecraftian spy thrillers start to misinterpret the data and apply it to anything involving green suitcases in horrible colours.

How do you fix this?
The next logical step often becomes to add a description, so that a UI can explain the term, but in a non services oriented environment, that's trapped in your datastore.

This won't work in a multiple vendor scenario, at least not unless you want to share your DB with them.

Another approach is the Code Table service - a service that has a focus on only retrieving data about a given input identifier.

I've seen this done in at least one SOA, and it's not a terrible pattern - but each vendor still has to stand up their own code table services, and there's a lot of repetition.

What else can you do?
Soon it becomes obvious that you want a decent way to find a code and the related data, but you also want to support aliases - my CASE_NIGHTMARE_GREEN is your WALK_IN_THE_PARK.

This gets tricky, quickly, as 1-1 mappings are difficult - and either a collection of vendors pull together and standardise on a list and the mappings, or no one really collaborates and fragile mapping code is introduced.

By this point, fear of change often sets in as the interfaces between parties are fragile, or to push changes through the consortium of vendors becomes a nightmare of project management and communication.

If you haven't had to roll out minor enhancements to a standard with a number of other parties who just aren't quite interested, take my word for it - it's painful.

All is not lost, there is another way - and it's simple.

What's the way forward?

My recommendation here is to push your codes into a triplestore. It doesn't fix everything, but it becomes trivial to relate information to the code - aliases, for example, or descriptions.

A triplestore is a RESTful service that allows you to execute queries - if you can deal with mongodb or mysql, you should be able to comprehend what's going on.

Here's what wikipedia has to say about SNOMED, if you haven't heard of it.

SNOMED CT Concepts are representational units that categorize all the things that characterize health care processes and need to be recorded therein. In 2011, SNOMED CT includes more than 311,000 concepts, which are uniquely identified by a concept ID, i.e. the concept 22298006 refers to Myocardial infarction. All SNOMED CT concepts are organized into acyclic taxonomic (is-a) hierarchies; for example, Viral pneumonia IS-A Infectious pneumonia IS-A Pneumonia IS-A Lung disease. Concepts may have multiple parents, for example Infectious pneumonia is also a child of Infectious disease. The taxonomic structure allows data to be recorded and later accessed at different levels of aggregation. SNOMED CT concepts are linked by approximately 1,360,000 links, called relationships

That's one big code table, and you can see it's grown beyond just code/name pairing to include more data.

One of the key things that has been highlighted by the freebase folks and a few other places is the common problem - from a bunch of user input, go locate an object or identifier related to that term.

The moment you have an autocomplete control like these, it instantly kicks your application from "user is entering data into a text field" into "user is describing a semantic object, and I can grab all of the information about it that is relevant to my user".

Where can I learn more about SPARQL?

Step 1, learn Turtle. If you can comprehend YAML, you should feel fairly comfortable.

Step 2, I'd try SPARQL by example. There's a good chance that if you are thinking of an SQL concept you want, such as LIKE matching; there's a SPARQL equivalent (FILTER regexp).

Luckily 95% of what you learned with turtle is simply reused by SPARQL - it introduces variables, where clauses/graphs, filters, and a few other things... but that's really all that's new.

Where to from here?

If you were to deploy this internally within an organisation, your service is pretty much good to go. You may want to look at Graph Access Control to add in some security, and the related SparqlServer/Update APIs.

Was this easy enough?

In comparison to the other approaches I have seen, it's fairly good.

It's trivial to put a front-end on your triplestore.You can roll your own with a minimum of fuss, or use things like https://github.com/kurtjx/SNORQL to provide an 'expert user' ability to inspect your data.

Adding, removing, etc aliases is trivial - there's no schema to migrate or anything else troublesome, and you can add in extra data at the drop of a hat - even if it's unrelated to your core set.