Building for the Cloud: SciBite’s Architecture Principles

At SciBite, we’ve created a range of powerful biomedical text solutions including TERMite for live semantic annotation, TExpress for text-mining and DOCStore for semantic document search. What’s more, our tools are specifically designed to be deployed in a wide range of different circumstances, read on to find out more...

Our vision

At SciBite, we’ve created a range of powerful biomedical text solutions including TERMite for live semantic annotation, TExpress for text-mining and DOCStore for semantic document search. What’s more, our tools are specifically designed to be deployed in a wide range of different circumstances. We have many users who run TERMite via python, java, Pipeline Pilot or web user interface, followed by manipulation in tools like Excel and Spotfire. Conversely, TERMite is also deployed as a semantic middleware layer, where text stored in an existing application flows through the TERMite service which enriches it with biomedical ontologies vastly enhancing scientific search and retrieval. When architecting our solutions, uppermost in our mind is the fact that we can’t predict where or how the software will be integrated and so we need to stay as flexible as possible. Fortunately, many emerging technologies in the current wave of informatics innovation lend themselves to exactly this, and offer us and our customers the opportunity to rapidly deliver real value from semantic approaches.

Portability

All of our software is based on Java and can run almost anywhere using a tried and tested scalable base. This also lends to one of the most popular advantages of TERMite, that customers can leverage the power of a big server where available but can also use exactly the same software on a laptop when away from the corporate network.

Cloud ready

A frequent question from new customers concerns the use of cloud-deployments such as Amazon Web Services, Virtual Machines and related modern deployment tools such as Docker, we’ve made sure our products support all of these deployment modes. This cloud-ready strategy means it’s easy to spin up and down instances linked to demand. Further, the VM/Docker solutions eliminate concerns from IT departments as to bespoke system configurations, reducing costs to the business. In short, we’re passionate that our products support the most modern mechanisms of software deployment, saving time (and money) when deploying in a customer’s organisation.

Swagger API spec

The innovation doesn’t stop there. All of our major products support the Swagger/Open API specification. For those unfamiliar with it, Swagger is a way of describing an API in a computer friendly way. Its fast becoming an absolute “must have’ for API providers as it makes it much easier for downstream clients to understand and interact with an API. In fact, its so good that a cool tool, the Swagger Editor, allows you to automatically generate an API client program for any common programming language without having to write a single line of code!

Now we can truly say that we have TERMIte and DOCStore clients for any programming language. Add to this our python toolkit, integration with graph systems such as Neo4J, RDF and triple stores and many, many more 3rd party tools and partners, we believe we have technical options to meet any use-case.

TERMite the API

A major architectural decision concerns the storage of semantically indexed data. Just to recap, TERMite is a state-less API to which you pass a chunk of biomedical text and it passes back a set of detailed annotations and relationships between the genes, drugs, adverse events, companies, procedures and many other key entities in the space. As a simple restful API there is 1 step – sending the text, and that’s it, no pre-configuration or set up is required. Its speed makes it an ideal candidate for large-scale text mining, but also enables services such as semantic auto-complete, which helps data entry systems leverage controlled vocabularies without encumbering the user. While TERMite is very powerful, it does not store semantically annotated documents for later retrieval. This makes it difficult to support use-cases such as “index this (large) document set then provide me a nice user interface for search and retrieval”. The document set might be the whole of Medline, a large patent corpus or millions of internal project documents.

DOCstore Semantic Search

To address this, we created DOCStore, essentially a database and interface for documents semantically indexed by TERMite. And thus, this begs the question of where to actually store the data, should this be a NoSQL database, a relational system, triple store etc.? We chose Elastic Search. Elastic is simply the best solution for indexing large sets of documents. Is horizontally scalable over commodity hardware (meaning it’s very easy to expand to whatever number of documents you need to store). It’s based on the open source acclaimed Lucene engine and has a thriving user community. Elastic is at home on AWS/Cloud/Docker-style platforms, meaning it sits perfectly with our own tools. The only drawback to Elastic is that it doesn’t understand science. It has no idea what a gene or drug is, preventing semantic-style queries such as “show me all drugs in documents that mention X”. This is where DOCStore’s semantic indexing adaptor comes in. This bridges the world of TERMite and Elastic Search, allowing semantic style queries to work using core Elastic/Lucene syntax. The combination of the tried and trusted scalability of Elastic and the power of TERMite means that the search and retrieval of large biomedical documents is now within reach of anyone. Combined with our adoption of cutting edge deployment techniques, this means we can have a full system up and running in no time at all. Indeed, some of you may have seen our stall at Bio-IT world in Boston where we had a fully semantically-indexed version of Medline running on a chain of $50 Raspberry Pi computers! How many other systems do you know that can scale to millions of documents, run on simple hardware, leverage the most powerful semantics in the industry and can be installed from scratch with the single “docker-run” command?

Article info

Posted: 17/05/16

Author: Neal Dunkinson

Category:

Technology

Related articles

The Evolution of Data

Over the 50 years how we collect and play music has changed dramatically from physical copies on Vinyl through to electronic mp3s. Each new technology often requires a new device and format to play yet it is still essentially just music.