Scientists at Cambridge University are developing a computer system that can read vast amounts of scientific literature, make rapid connections between facts and develop hypotheses.

Cancer research aided by natural text analysis of scientific literature

Scientists at Cambridge University are developing a computer system that can read vast amounts of scientific literature, make rapid connections between facts and develop hypotheses.

Cambridge University says most biomedical scientists can't manage to keep on top of reading all of the publications in their field, let alone an adjacent field. Cambridge points out that the US National Library of Medicine's biomedical bibliographic database now lists over 19 million records and adds up to 4,000 new records daily.

It says that for a prolific field such as cancer research, the number of publications could quickly become unmanageable and important hypothesis-generating evidence may be missed.

With these problems in mind the university is now developing a computer system to helps scientists in the biomedical field.

To be useful, says Cambridge, such a system would need to trawl through the literature in the same way that a scientist would. It would need to read literature to uncover new knowledge, evaluate the quality of the information, look for patterns and connections between facts, and then generate hypotheses to test.

Not only would such a program speed up the progress of scientific discovery but, with the capacity to consider vast numbers of factors, it might even discover information that could be missed by the human brain, said Cambridge.

Dr Anna Korhonen and a team of researchers of Cambridge's Natural Language and Information Processing Group are aiming to develop systems that can understand written language in the same way that humans do. One of the projects Korhonen is involved in has recently developed a method of "text mining" the area of cancer risk assessment of chemicals, one of the most literature-dependent areas of biomedicine.

Every year, thousands of new chemicals are developed, any one of which might pose a potential risk to human health. Korhonen said: "The first stage of any risk assessment is a literature review, which is a major bottleneck as there could be tens of thousands of articles for a single chemical. Performed manually it's expensive and because of the rising number of publications it's becoming too challenging to manage."

Her team has developed the CRAB tool in collaboration with Professor Ulla Stenius' group at the Institute of Environmental Medicine at Sweden's Karolinska Institutet.

The CRAB approach involves developing programs that can analyse natural language texts, despite their complexity, inconsistency and ambiguity. The CRAB technology is billed as the first text-mining tool aimed at aiding literature reviews in chemical risk assessments.

At the press of a button a profile is rapidly built for any particular chemical using all of the available literature, describing highly specific patterns of connections between chemicals and toxicity.

The tool will soon be available for end-users via an online web interface. But research into improving text mining will continue, said Korhonen, with one of the biggest current challenges being the development of adaptive technology that can be ported easily between different text types, tasks and scientific fields.

"Although still under development the system can be used to make connections that would be difficult to find, even if it had been possible to read all the documents," said Korhonen. "In a recent experiment we studied a group of chemicals with unknown modes of action and used the CRAB tool to suggest a new hypothesis that might explain their male-specific carcinogenicity in the pancreas."

The Cambridge development comes as IBM is making a play to help manage patient care with its Watson (https://www.computerworlduk.com/news/applications/3313622/ibms-dr-watson-seeks-to-deliver-automated-healthcare/) data analytics computing platform. Watson is designed to understand natural language in unstructured data and is now being applied to medical diagnostics.

Watson's capabilities allow medical histories of patients to be overlaid with their symptons and their family histories of past illnesses, to allow clinicians to reach what is hoped is the more accurate diagnosis of patients.