If you wish to logout, click here. You may be prompted to
sign in again in order to view our content.

Login or register to view this section.

Login or register to view this section.

Your account gives you full control over what you receive from us. You must have an active account and be logged in to access certain parts of the website.

Within your account, you will also have the option to receive information from us, such as our magazine and our email newsletters. You may opt in or out of any of these options at any time, and may deactivate your account by contacting us.

We also ask a few questions about you and the information in which you are interested - these are optional, but help us ensure we can deliver you a more personalised experience on our site.

You're trying to access an editorial feature that is only available to logged in,
registered users of Scientific Computing World. Registering is completely free, so why not sign up with us?

By registering, as well as being able to browse all content on the site without further
interruption, you'll also have the option to receive our magazine (multiple times a year) and our
email newsletters.

Data lakes and cloud computing

Data types used are advancing from the simple text formats of old, writes Paul Denny-Gouldson

For research and development organisations, the rise of instrument and process automation is leading to a phenomenal increase in the amount, variety and complexity of scientific data that is gathered. All this data needs to be made available so it can be integrated into projects and new scientific approaches, both now and in the future. The requirement to be useable has been growing over the past decade and is reaching a critical point.

Instrument data is driving new science and, as organisations move to large image-based and high-density data structures to support their work e.g. phenotypic screening, the data types used are advancing from the simple text formats of old. To ensure these new data types are (re)useable in R&D and are consumable by existing and emerging technologies such as Artificial Intelligence (AI) and machine learning, the data has to be accessible, clean and adequately tagged with metadata. These high value ‘data lakes’ can become silted up and quickly turn into swamps if data is not properly tagged with all relevant contextual information – projects, tested molecules, results, downstream use, conclusions, derived data, related data etc.

Designing and keeping data lakes in good health requires constant work and effort, but cloud computing strategies like new storage (S3) and adaptive indexing technologies (NOSQL, Triples) will help. While some people think of data lakes, or even data, as a static picture after it has been captured, in reality, data needs to be continually enriched and augmented with learnings. Often, informatics organisations consider the data as the record – and in some cases, it is – but it does not have to be cast in stone and ‘stored’. Intellectual property (IP) records can be captured and stored in other systems – while the working data is stored in other data structures and ‘put to work’.

Enrichment is a hot topic in the pharma informatics domain. We have seen the emergence of many tools that all essentially do the same thing: make data more consumable or discoverable by scientists and computers. Semantic enrichment or natural language processing has been around for many years and has shown good benefits particularly in the healthcare domain, where it is used to extract and normalise data from clinical trials.

In Pharma R&D, the enrichment approach is gaining traction with the prevalence of new technologies and commercial offerings. Ontological, taxonomical and semantic tagging are set to become mainstream as the technology and application integration becomes easier and vendors deploy their tools in the cloud.

A corporate data lake must be defined and viewed as the place to go to find, search, interrogate and aggregate data – making it easier for data scientists to investigate and build data sets for their work. Find and search are two separate concepts here – one is where you know what you are looking for – the other is when you don’t know what you are looking for – and want to explore the data.

A data lake must be integrated into all systems that are part of the data lifecycle, crudely: creation, capture, analysis and reporting, so that all aspects of the R&D data landscape can be consumed and leveraged, re-indexed and continually enriched. A data lake should not be viewed as a regulatory or intellectual property (IP) store – it needs to be a living ecosystem of data and indices that adapts to the needs of the science and business.

Pharma is looking to shift to a situation where it can be much more data-driven. But first, data must be discoverable for scientists, data scientists and the applications they use. These data jockeys need access to vast quantities of highly curated data to do their jobs – and data lakes are likely the best answer.

AI and other tools like deep learning, augmented intelligence and machine learning all need a similar set of inputs to data scientists – lots of well annotated data. Adding more tags and metadata to a set of data is something that sits at the heart of what a true data lake should be – and the impact could be far reaching. The data volumes are huge and this leads to a couple of issues. Where should this data be stored? And how can it be made searchable? This is where the cloud helps.

Whilst searching is often discussed in a macro sense – Google-type searching for example – the questions that scientists want to answer are not always ‘keyword’ or phrase based. Scientific questions are far more intricate and need more than just typical text indices: they require fact-based searching and relationship-based searching too.

This requirement means data must be treated as a living organism and structured in a way that can handle tricky questions. This means each of the ‘index’ types need to be aware of each other so you can jump concepts, while also remaining easily updatable for when new data types are introduced.

This is not easy, but rapid progress is being made through the deployment and use of cloud storage, semantic enrichment, alternate data structures, data provisioning, data ingestion, analysis tools and AI. All these technologies have a part to play and their level of use depends on the questions being asked of the data. The cloud is the best way to leverage these technologies in a cost effective and consumable manner – vendors just need to make sure their applications are prepared.

Latest issue

In January 2020 the American Diabetes Association released a new set of clinical standards that, for the first time, included the use of an FDA-approved autonomous artificial intelligence (AI) platform for detecting diabetic retinopathy in people with diabetes.

At the end of February physIQ and the US Department of Veteran’s Affairs published data from a clinical trial demonstrating the use of physIQ’s machine-learning algorithm to predict when an individual patient with previously diagnosed heart failure will likely require a readmission to hospital.

New processors and GPU technology continue to demand more power than the previous generations, combining this with increasingly dense architectures requires cooling solutions such as liquid or immersion cooling.