Enhance Drug Discovery Methods Using Deep Learning

发布时间：2018 年 12 月 4 日

正在翻译...

这是原始内容的机器翻译版本。 其提供仅用于信息目的，不应将其视为或用作完整或精确的信息。

抱歉，我们目前无法翻译此内容，请稍后重试。

The same Intel® architecture-based hardware systems used in the lab for everyday computational tasks can be harnessed to perform deep learning research and neural network training to automate drug discovery.

“An enormous figure looms over scientists searching for new drugs: the estimated US$2.6 billion price tag of developing a treatment. A lot of that effectively goes down the drain, because it includes money spent on the nine out of ten candidate therapies that fail somewhere between Phase 1 trials and regulatory approval. Few people in the field doubt the need to do things differently.” 1

— Nic Fleming,
Nature: International Journal of Science

Challenge

Inefficiencies in traditional techniques for identifying promising drug treatments have slowed the process of discovery and added substantially to research costs. New approaches are needed to accelerate discovery and reduce costs.

Solution

An innovative method for training the multiscale convolutional neural network (CNN) topology on a distributed CPU architecture gives researchers a valuable tool for discovering promising drugs. Within this domain, data generation and capture are highly automated, making it possible to implement scalable analytic solutions efficiently on the same computing hardware used in the lab for other computational tasks.

Background and Project History

Kyle Ambert, a senior deep learning data scientist at Intel, has been on a quest for much of his career to discover and refine more effective solutions for performing life science analytics. While working on his PhD at Oregon Health & Science University, Kyle focused on developing machine learning systems for helping researchers in the neurosciences. One keen area of interest for him was natural language processing, which led him to address the challenge of building machines that can analyze and extract useful patterns from scientific literature.

“When I joined Intel, I was naturally drawn to the work that we were doing to solve computational problems in the life sciences. Two years ago, I joined our deep learning group and a main focus was on understanding how image classification systems can be optimized to run on Intel® architecture-based hardware platforms. One of my colleagues introduced me to our collaborator’s computational research team, who challenged my team to take a deep learning topology they already use and optimize it for running on their Intel® Xeon® processor-based cluster. The goal was to make it possible to process more images per day than they were currently able to do. At the time, I believe it was taking 11 hours for them to train their model. All told, our work led to a drastic improvement—our eight-machine [Intel] Xeon processor-based cluster trains in 31 minutes.”

A small collection of commonly-available datasets guides understanding in the artificial intelligence community around optimal image classification with deep learning, and images in these tend to be relatively small with respect to number of pixels,” Kyle said, “and, in terms of content, simple. One of the more frequently-used collections, for instance, contains 256 x 256 images belonging to one of thousands of possible categories. One image, for example, depicts an airplane, the next a dog, the next a car, and so on.”

Kyle noted that while image collections such as this facilitate training systems for carrying out many important tasks, the information obtained from these types of images doesn’t often translate well to pharmaceutical research, which primarily relies on image data acquired with microscopes.

Image capture devices in use in much of the pharmaceutical industry generally produce large images—often at a resolution of 1024 x 1280 or above—depicting complex results that are usually best understood by human annotators. “Rather than depicting a single object of interest,” Kyle said, “high-content images in this domain generally depict multiple cells of potentially differing phenotypes. Rather than simply identifying the presence or absence of a particular cell type, a given task may require identifying a certain number of cells or an interaction between two cells of different phenotypes. In my experience, these are the types of images common to the life sciences. A CT scan depicts a complex snapshot of the human body.

An MRI might show enlarged ventricles along with a brain tumor. Teaching a machine to understand biological images potentially requires re-evaluating what we understand about using deep learning methods for image classification.”

A recent Intel collaboration with a major pharmaceutical firm began in April 2017, focusing on the application of deep learning techniques to analyze high-content images. Optimization enhancements to the analytical process began in fall of the same year with plans to release the findings to the community in November 2018.

“Intel technology is everywhere and, because of that, it can sometimes open some doors for collaboration that would be otherwise difficult to move.”
— Kyle Ambert, senior deep learning data scientist, Intel

Enabling Technologies

Intel® Xeon® Scalable processor technology proved extremely important to the collaborative research being conducted. The computational demands of working on hundreds or thousands of microscopy images—that often contain millions of pixels each—within a deep convolutional neural network model can require tremendous amounts of time. Using deep neural network acceleration techniques, the research team was able to process images in less time while simultaneously gaining improved insights in the image characteristics relevant to the learning process.

The team employed an eight-machine cluster composed of two-socket Intel® Xeon® 6148 processors (total of 40 cores per machine) running at 2.4 GHz with 192 GB of memory available for image processing (see 8 Node Cluster Configuration Details on the last page for more information).

This system enabled the team to handle over 120 3.9-megapixel images each second, using images from the Broad Bioimage Benchmark Collection* Q21 (BBBC-021) for training. The result was an improvement of more than 20 times in processing a dataset of 10,000 images.

“The large memory capabilities of Intel Xeon Scalable processors enable us to train deep learning workloads with a memory footprint beyond what other technologies would be able to accommodate,” Kyle said. The system configuration developed by the team also featured a high- speed fabric interconnect—Intel® Omni-Path Host Fabric Interface (Intel® OP HFI)—and Intel® Solid State Drives. On the software side, Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and the TensorFlow* optimizations were all important to the results.

“Next,” Kyle said, “I’m really interested in examining deep learning-based methods for unsupervised classification workloads. I don’t think the current approach of using supervised machine learning is scalable to the diversity of problems and dynamics in real-time data.”

Ongoing opportunities for discovery

Intel engagements with leading organizations in the medical community generate insights into advances and help develop AI techniques that can be applied to a broad spectrum of applications. Kyle noted, “We directly engage with our target industry for this very reason. The workload we studied for this project is common to the drug discovery process used by every company, so we imagine others will be interested in our results as well.”

“Besides addressing the industry problem in question,” Kyle continued, “we also contributed to the field’s understanding for how to scale out training on clusters of CPUs with large data.”

To validate the methodology in use, Kyle thinks that it is very important to be aware of the assumptions that go into using a statistical model or a particular machine learning library and to continually question why something is done a certain way. This process of maintaining awareness and re-evaluating the methods being employed during discovery can reveal hidden biases or flaws in the logic behind the operations.

To those interested in furthering their knowledge on the latest artificial intelligence advances and successes, ai.intel.com provides news of research breakthroughs, development guidelines, educational content, and programming libraries.

Figure 3. Artificial intelligence is reshaping the way we investigate human health issues and medicine.

“Unsupervised deep learning methods—that may be applied to unlabeled microscopy images— hold the promise of revealing novel insights for cellular biology and ultimately drug discovery. This will be the focus of continuing efforts in the future.” 2
- Intel Newsroom

TensorFlow* for High-Performance Computing

TensorFlow, a framework for math numerical computations based on an open-source library, includes specific features for implementing large-scale machine learning processes. Originally released by Google in November 2015, TensorFlow initially performed slowly on CPU processor platforms. Following Intel optimizations for running TensorFlow on Intel® Xeon® processor-based platforms, substantial performance improvements have been realized. TensorFlow is well-suited to a range of AI applications, including image recognition, language recognition, and object detection and localization.

Python* is the primary interface for TensorFlow with support for NumPy. It gives developers a means to create dataflow graphs, detailing the ways in which data moves through the structure or a collection of nodes. Nodes correspond with individual mathematical operations, and the connections between nodes represents a mathematical data array, called a tensor. Python makes it possible to easily couple together the high-level abstractions being expressed. Tensors are composed as Python objects within TensorFlow and each TensorFlow application is essentially a Python application. Through working with abstractions in TensorFlow, the process of building machine learning implementation becomes much easier, allowing developers to focus on the logical constructs of a program without having to deal with lower-level algorithms or implementation details.

The TensorFlow machine learning framework simplifies the acquisition of data, training of models, and predictive operations. The structures used in TensorFlow are well-suited to CNN models. Intel offers guidance setting threading models for CNN implementations and performance guidance for using TensorFlow with Intel® MKL. The optimizations that Intel has created for TensorFlow give developers a performance boost when it comes to processor-intensive operations in machine learning and can significantly reduce learning times for training and inference operations.

Figure 4. Image recognition within a TensorFlow* structure.

AI Is Expanding the Boundaries of Drug Discovery

Through the design of specialized chips and enhancements to existing architectures, research, educational outreach, and industry partnerships, Intel is accelerating the progress of AI to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, robotics, and other industry sectors. Intel works closely with policymakers, educational institutions, and enterprises of all kinds to uncover and advance solutions that address major challenges in the sciences.

Rethinking the drug discovery paradigm

Detecting patterns that exist in large volumes of data is one of the key strengths of deep learning methodologies and this capability is drawing many startups into research efforts that focus on using AI to accelerate drug discovery. One example is Berg, a biotechnology company outside of Boston, Massachusetts that pioneered a technique for identifying cancer mechanisms, using an AI platform to generate and analyze massive volumes of patient data, narrowing in on the relevant characteristics that apply to diseased cells. The research team modeled diseased human cells, monitoring lipid, metabolite, enzyme, and protein profiles, while changing sugar and oxygen levels at the cellular level. Tests on over 1,000 human cell samples, some healthy and others cancerous, has opened pathways for identifying treatment methods based on the biological origins of disease.

Berg’s co-founder and chief executive, Niven Narain, said, “We are turning the drug-discovery paradigm upside down by using patient-driven biology and data to derive more- predictive hypotheses, rather than the traditional trial-and- error approach.”3

“From autonomous cars that will save thousands of lives, to data analytics programs that may finally discover a cure for cancer, to machines that give voice to those who can’t speak, AI will be known as one of the most revolutionary innovations of mankind.”4