IBM Watson, other tools to provide automated reasoning and hypothesis generation from the complete medical literature

August 27, 2014

(Credit: IBM)

Computational biologists at Baylor College of Medicine and analytics experts at IBM research are developing a powerful new tool called the Knowledge Integration Toolkit (KnIT) that promises to help research scientists deal with the more than 50 million scientific papers available in public databases — with a new one publishing nearly every 30 seconds.

The goal: allow researchers pursuing new scientific studies to mine all available medical literature and formulate hypotheses that promise the greatest reward.

In a case study using KnIT, researchers predicted the existence of proteins that modify p53 (an important tumor suppressor protein). These proteins were later found to do just that*.

“On average, a scientist might read between one and five research papers on a good day,” said Lichtarge, also a professor of molecular and human genetics, biochemistry and molecular biology at Baylor. “But to put this in perspective with p53, there are over 70,000 papers published on this protein.

“Even if a scientist reads five papers a day, it could take nearly 38 years to completely understand all of the research already available today on this protein.”

Scientists formulate hypotheses based on what they read and know, but because there is so little that they can actually read, hypotheses can be biased, Lichtarge notes. “A computer certainly may not reason as well as a scientist, but it can, logically and objectively, contribute greatly when applied to our entire body of knowledge.”

Watson to accelerate understanding of the biology underlying diseases

Working with colleagues at IBM led by Scott Spangler, principal data scientist at IBM, the team took advantage of existing text mining capabilities, such as those used by IBM’s Watson technology.

“Our hope is that scientists and researchers will be able to use Watson’s cognitive capabilities to accelerate the understanding of biology underlying diseases,” said Spangler. “Better understanding the biology of diseases can eventually lead to better treatments for some of the most complex and challenging diseases, like cancer.”

KnIT represents the knowledge explicitly in a network that can be queried, and then allows for further attempts to use these data to generate new reasonable and testable hypotheses that can be used to help direct laboratory studies.

“Our long-term hope is to systematically extract knowledge directly from the totality of the public medical literature. For this we need technological advances to read text, extract facts from every sentence and to integrate this information into a network that describes the relationship between all of the objects and entities discussed in the literature,” said Lichtarge.

“This first study is promising, because it suggests a proof of principle for a small step towards this type of knowledge discovery. With more research, we hope to get closer to clinical and therapeutic applications.”

Most of the funding for this work was provided by the McNair Medical Institute of the Robert and Janice McNair Foundation and the Defense Advanced Research Projects Agency. Additional funding was provided by the National Science Foundation, and National Institutes of Health, and was supported in part by the IBM Accelerated Discovery Lab. Scientists at the University of Texas M.D. Anderson Cancer Center where also involved in the study.

* In the first test using KnIT, the team sought to identify new protein kinases that phosphorylate (or turn on) the protein tumor suppressor p53. There are over 500 known human kinases and 10s of thousands of possible proteins they can target. Thirty-three are currently known to modify p53.

In the study, the team used KnIT to mine the medical literature up to 2003 when only half of the 33 phosphorylating protein kinases had been discovered.

Using KnIT, 74 kinases were extracted as potential modifiers. Of these, prior to 2003, 10 were known to phosphorylate p53, nine were discovered at a later date. Of the 10 already known, KnIT accounted for them in reasoning as well as ranking the likelihood that the other 64 kinases targeted p53. Of the nine found nearly a decade later, KnIT accurately predicted seven.

“This study showed that in a very narrow field of study regarding p53, we can, in fact, suggest new relationships and new functions associated with p53, which can later be directly validated in the laboratory,” said Lichtarge, who holds The Cullen Foundation Endowed Chair at Baylor.

The remaining kinases identified in the case study, but not previously identified in real time, may be further studied in the laboratory, he said.

Abstract of Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining paper

Keeping up with the ever-expanding flow of data and publications is untenable and poses a fundamental bottleneck to scientific progress. Current search technologies typically find many relevant documents, but they do not extract and organize the information content of these documents or suggest new scientific hypotheses based on this organized content. We present an initial case study on KnIT, a prototype system that mines the information contained in the scientific literature, represents it explicitly in a queriable network, and then further reasons upon these data to generate novel and experimentally testable hypotheses. KnIT combines entity detection with neighbor-text feature analysis and with graph-based diffusion of information to identify potential new properties of entities that are strongly implied by existing relationships. We discuss a successful application of our approach that mines the published literature to identify new protein kinases that phosphorylate the protein tumor suppressor p53. Retrospective analysis demonstrates the accuracy of this approach and ongoing laboratory experiments suggest that kinases identified by our system may indeed phosphorylate p53. These results establish proof of principle for automated hypothesis generation and discovery based on text mining of the scientific literature.

Related:

comments 32

The two separate search systems that I designed in the late ’70′s would likely enhance this system by orders of magnitude. Unless of course, they are already using the equivalent. Design is lacking in many fields. I heard a wonderful talk on this subject in the ’80′s aired on “Money Radio” given by the guy who created the Gossamer Albatross and presented at the Commonwealth Club. He spoke of how the professionals in charge of marking each separate “Mast Tree” in Britain came to the New World and were faced with an OCEAN of mast trees, and this became a paradigm shift leading to the old U.S. engineering mantra – “We can build anything for $1 per pound.”

He spoke of the world’s two largest free-standing domes, one in Britain, the other in the U.S. But the British one massed about a third, I recall, of what the U.S. one did, due to the money put in at the design end, coming from that cultural perspective of scarcity. The U.S. engineers just added more concrete. Who needs design when resources are abundant?

But then, there is this thing called diminishing returns… Like the many ways that desktop computing has lagged far behind what we expected from the ongoing advances in hardware. Until recently, the Windows systems that I use at work were still much slower than my Amiga 3000 from 1991, at least when it came to user tasks. The difference was that the Amiga wasn’t designed by a committee, but by a crew of passionate geniuses, who put everything they had into the job. They didn’t assume, as Bill Gates is reported to have with DOS and Windows, that the hardware would be there. They put all their cards on design and produced a system 20 years ahead of its time.

There is still an Amiga market worldwide, BTW – for a 20+ year old platform. And the design is still lousy on the PC end. Just try grabbing a block of text off the screen itself on Windows. Or, try ASSIGNing a virtual device to create complex paths that behave as single objects in the OS.

Or look at where throwing money at need in the 3rd world got us… Ebola? Design would have dictated examining what happens when you create a system that selects for the most “needy.”

Very true! Design is all! I worked for many years in an R&D establishment. America imposed sanctions on my country. There was no choice but to redesign products and it became clear that America designed for the minimum that would work and then threw money and bodies at the project. Very often, it was possible to design a (way) superior product that was much cheaper. No-one normally bothered because America was marketing a product. It required outside pressure and motivation to redesign. And it paid-off. Many (expensive) American innovations would benefit from redesign – not enough skull sweat is expended here.

You are just describing the main problem of US cooperation when a company becomes too successful it usually starts to get run by bean counters and marketing people and not by engineers, Apple computers is one of the finest examples.

Honestly I can’t quiet see the fuss around this. What IBM/Watson is doing sounds more or less like standard text retrieval maybe coupled with a slightly better statistical learning. This stuff has been around for good 20 years. Nothing really to rave about. That any computer is better in indexing, scanning millions of text shouldn’t come a real surprise and just shows a limitation on short term memory storage of humans

Hmm I am not sure I fail to see if a deep learning or a convolution network type approach has any real benefit in this specific application. It will just take far far longer to get to the similar result. If you have a very specific domain constrained problem like this I think a good trained supervised model will work very well. However, if you’re looking for a more generalized solution a deep learning is probably better or quicker than training 100s of models and trying to link them somehow. That linking step still seems to me to be not a real solved problem in machine learning

Please correct me if I’m wrong. But, if I remember correctly, it goes like this: The brain is made up of about a billion neurons, and each neuron has 1,000 synapses. As far as I know, there is no computer in the world today with the computer power that can duplicate this feat in a similar way. But, once there is such a computer, intelligent artificial consciousness will not be far after.

We humans seem to be correct in realizing that when th AGI’s gain consciousness, they will improve themselves, as in increasing their intelligence to, for example, twice that of a human, four times that of a human…, and keep doubling their intelligence at an accelerated pace. Realizing this, we must also realize that we have a window of opportunity for humans to act, between now and such time. That action should consist of improving ourselves, just as conscious AGI’s (Artificial General Intelligence) would; like extending our longevity and amplifying our intelligence. Living for a short period of time of about 100 years, reproducing, and starting the process all over again is inherently inefficient. A much more efficient way of doing things would be to live a much longer life (or as long as you want) and reproduce whenever you want. In this way, we have a much better chance of reaching our desired goals of longevity and intelligence before the AGI’s reach theirs.

Some time in the near future, I would want our human civilization to respectfully get along with the conscious AGI’s and have a mutually beneficial relationship. At the present time though, we must realize that there is too much at stake to leave things to chance. There is a lot of work to be done and no time to be wasted.

What we currently have is the beginning of a primitive, although nevertheless increasingly/extremely useful, artificial intelligence (i.e. ‘Watson’ and now ‘KnIT’), with no consciousness; all gain and no pain.

This is really great news that IBM is developing this powerful AI tool called ‘KnIT.’ I see it as another very useful tool to assist us in getting to where we need to go/be.

Well I don’t have an exact source at the moment but it’s widely accepted that vision processing alone takes up about 30% of the cortex (another 8 percent for touch and 3 percent for hearing). Although we are using the “end results” of these sensations in our thinking, the mechanics of it has nothing to do with it, yet taking up about 40% of all resources.

Realizing how complex the support system (our body) is for the substrate (brain), it’s not hard to imagine that full control (housekeeping) takes up another 30% of available neurons…Yes many of our “systems” are automated or controlled more locally but not everything…

This is getting scary (in a good way). I believe by 2020 we will have expert systems that outperform groups of human scientists in any complete field and those systems will be interconnected so they will “consult” each other for input on other than their own specialty. The direction of our development (evolution) will be taken over way before they become sentient. Human language understanding is the first step because most of what we have today is in our language. But this rapidly getting changed by machines as we speak. Systems will communicate and store further knowledge in their own (incomprehensible for us) much more efficient language. This is another proof for the AI dooms predictors that they are wrong. They afraid that there will be some bad humans who program and control powerful AIs or AIs become bad once they can outsmart us. People just won’t have enough knowledge to control these complex systems. And the AIs will have other things on their minds than playing the human game of exterminators…

IMHO, the purpose of these expert systems are to understand problems by analyzing data and develop solutions, advancements. We (humans) will capitalize on these expert systems which means we ourselves will be very knowledgeable and the tools to equal our capabilities will be at hand by the time some compact systems become self aware. I don’t think an internet like large system can become self aware (although small compact systems will have a sort of internet at their disposal with the knowledge of all the expert systems). The point is that by the time AIs will gain awareness they will perceive us as equals or potentially equals…I think by the end of next decade one way or another it’ll be all over for humanity (this is the scary part)… this is where exponential will really kick in and we won’t be humans anymore!

Gabor: isn’t what you are predicting for 2020 the Singularity? Or something close to it? It seems to me that if these systems are doing the scientific research- and can advance scientific knowledge-that it is all over for human beings. It would also seem to me that once these things can advance knowledge, they would also be able to prioritize the research so that it could accelerate ever faster. An opinion please

I don’t see it ever being “all over for humanity”. Decreasing relevance to the Intelligent beings, certainly, for those homo-sapiens who avoid integration with the technological advances and advantages. Even so, until mechanical beings can both self-repair and produce the materials from finding, to processing, to final product, to enable that process, humans, as we know them today, will be necessary. After that, the benefits of an incredibly advanced, intelligent society will be available for all.

Rmagee, “After that, the benefits of an incredibly advanced, intelligent society will be available for all.” Yes, and this society won’t be more human than we are apes or ratlike mammals that climbed out of the sea. So technically, as soon as we are physically altering our biological brains with technology, we cease to be “humans”! This could be as close as 10-15 years from today.

Gabor, agreed, though by “all”, I was including even those of the more luddite persuasion.. those who choose to avoid alteration. They would also benefit from a society based on the gathering, sharing and growth of information, as opposed to the hoarding and siloing of information for personal, short-term personal gain… from which, arguably, comes most of the self-destructive behavior we have always lived with such as wars, etc.

I think those who “choose” to avoid alteration will be few when the technology is proven. Without sounding too harsh, the advantages of improving our substrate will be so great that those not willing will be considered mentally handicapped and cured accordingly. I’m sure there will be a range of incentives, as well as pressure by society (increasing isolation of those holding out) but eventually there will be no way and reason to resist.

I think you are underestimating the power of coming technology. We like to use the words disruptive or game changing when we are talking future technology. It won’t be like that. It will be a runaway planet-sized train that sweeps everything and everybody in its way! Religion has 3 major foundation: mortality, poverty and ignorance. All of those will be addressed in time with the coming technology.

I completely agree about the value of upgrading my substrate, but as a species, we _do_ have people who are already choosing to live without the advances available today; consider the Amish or the folks who live in places like Oraibi.

I am afraid there is numerous examples in human history that show otherwise the transition from classical age to middle age (or the dark age in Europe). A major leap back for centuries just due to religious doctrines. The transformation back in the Middle East / Arabian world from an open most advanced culture to a religious fanatics. Just to name a few. When there is dramatic change there is also always resistance.

Well, actually, this whole argument about “humanity” is probably way over-rated. “Humanity” historically appears to be whatever the current “humans” happen to say it is. In this sense, “we” will always be “humans” – the term itself evolves along with us. In the future, they indeed may look back at us in time and call us “hominids,” but they are likely to call themselves “human.” Eventually, there will probably be other intelligent life forms. Whether that turns out to be a “them” and “us” situation remains to be seen.

There is still the problem of the “drive” or instinct, still lacking and maybe this is a good thing, in those new artificial scientists.

For the next 10 to 20 years, I see them as more and more clever research assistants. They will help at aggregating data, sort them, distribute them along proto-theories to be refined and systematized by their human overlords and even will be able to propose clever experiments for (dis)proving promising hypotheses.

But after 20 years of this regimen, probably even the most bright human scientists will start to lose their grasp on their own specialty relatively to full grown “artificial professors”.

Renzo, a definition of the beginning of the Singularity is not clear cut and is subjective. My definition is that Singularity begins when the firs AI will gain self awareness. At this point that AI will be many times smarter than any pure unaided human or even groups of humans because of it’s obvious technological advantages. This probably will happen in the second half of next decade.