Medical documents like electronic health records (EHR), clinical trial reports, drug experiment studies, medical journals and notes hold valuable knowledge about patients, diseases and drugs which can be invaluable in supporting new drug and disease research. But most often this information is captured manually as free-form text and needs a human expert to interpret. This knowledge is also frequently inside large PDF or Word documents with raw text, charts and tables, limiting the value that can be obtained from this information. Over time this becomes only more difficult.

In the drug discovery space a quick search on past co-occurrences of symptoms and chemicals in drugs can give valuable insights for pharmaceutical researchers. But doing a “Control + F” keyword search on documents is extremely time consuming. This is not simply a problem of going back to the right document, but of finding the right paragraph or table or chart inside a 200 page document with non-standard headings and varying writing styles.

Natural Language ProcessingPersistent worked with a major pharmaceutical company to develop a solution to help execute knowledge-driven searches for information across multiple drug experimentation documents, extracting insights in seconds instead of minutes or even hours. The first step was to use natural language processing (NLP) techniques to extract raw text from documents and develop an easily searchable index on Elasticsearch, with meta-data extracted from tables and figures and added to the index. A domain expert could now do a simple keyword search for relevant keywords and get the closest matching text information. But although this helped reduce the search time, it still needed considerable human effort to read and understand the insights from the returned raw text. The next step was to figure out how to extract structure from the raw text.

Traditionally, NLP methods have relied on rule-based pattern matching and bag-of-words (BOW) type models. Sentence structure is not considered and importance is given to individual words. The BOW approach typically ignores stop words like ‘a’, ‘the’, ‘of’, etc. which are important to understanding the meaning of a sentence. An improved approach is to use word embeddings like word2vec and glove. Here, words are represented as numeric vectors and similarities between words can be calculated. Typically, if models are trained on a pharma text corpus, it finds the diseases, chemicals, etc. forming clusters together. The BOW and embeddings approaches improve the keyword search engine but there is still room for improvement.

Deep Learning TechniquesNext, we looked at state-of-the-art deep learning techniques that treat sentences as a sequence of words, consider all words, and try to learn patterns from them. Understanding sentence structure can offer key insights about terms and extract “entities” without having to hard-code them. So when we look at a sentence like “Ibuprofen works by reducing hormones that cause inflammation and pain in the body” – the sequence-based learning model can predict that inflammation and pain are symptoms the way they are used in the sentence without necessarily storing a hard-coded vocabulary as in the BOW approach. This is the power that deep learning brings to NLP.

Next was building deep learning models that can predict entities like DRUG, CHEMICAL, SYMPTOM, etc. from raw text sentences and create a database of these entities. We developed a reference architecture for an approach called OAVE (Object-Attribute-Value-Evidence). The object will be the entity we identify like CHEMICAL, the attribute will be DOSAGE and value is 200mg, as an example, below. The raw text and the PDF or Word document with the line number where this information was found is then captured as evidence. The OAVE paradigm helps extract structured information from raw text without hard-coded rules. These structured OAVE entities can now be used to provide effective intent-based search and question answering system.

Doing More with LessThe major challenge for building any deep learning project is the availability of labelled data. For such projects to get to acceptable accuracy numbers, it typically requires data in orders of hundreds of thousands of marked entities covering a diverse portfolio of items to discover. The more entities to discover, the more the labelled data is needed. The challenge in creating labelled data is that it requires domain experts’ time, which is expensive. To mitigate this risk is an increasingly popular approach called generative pretraining, using unlabelled raw text to learn patterns in an unsupervised manner. The pretrained model now needs much less labelled data to learn from and quickly achieves high accuracy rates. By applying pretraining and then fine-tuning the model on limited labelled data, the labelled data needed is reduced for the model by almost a factor of three – that is three times less data needed. This approach is being used more and more to extract knowledge from unstructured text and limit the amount of labelled data needed for building models. Although applied to healthcare text, this can easily be applied to other domains such as banking, insurance, intellectual property, legal and more.

(The author is the Innovation and R&D Architect at Persistent Systems Ltd.)

DISCLAIMER: The views expressed are solely of the author and ETHealthworld.com does not necessarily subscribe to it. ETHealthworld.com shall not be responsible for any damage caused to any person/organisation directly or indirectly.

Sponsored Stories

Subscribe to our Newsletters

When the blood reaches these curves, it makes changes to its fluid mechanics and interactions with the vessel wall. In a healthy person, these changes are in harmony with the tortuous microenvironment, but when diseased, these environments could lead to very complex flow conditions that activate proteins and cells that eventually lead to blood clots.