Johns Hopkins is consistently ranked as one of the top universities worldwide, with many schools and departments consistently ranked best in the world. We host some of the best and most interesting scientific data, and collaboration is one of JHU’s strong suits.

Our excellent cross-departmental community of machine learning faculty works on both general-purpose ML and more domain-specific ML methods. Here we survey some of our machine learning focus areas (explore individual faculty pages for more details).

General Purpose ML Methods

Several of us, especially in Computer Science and Applied Math and Statistics, focus on developing fundamental ML methods that cut across applications. Some examples:

Humans produce vast quantities of natural language—the primary medium of public knowledge, private records, and interpersonal communication. We are attacking the many statistical modeling and machine learning challenges needed to interact meaningfully with such data.

Large-scale inference: From large corpora, we extract factual knowledge and patterns of human communicative behavior. We are also reconstructing the grammatical and lexical structure of multiple languages at once. These difficult global inference problems require informed models and new semi-supervised and unsupervised techniques, including nonparametric Bayesian methods.

Signal processing: On auditory data, we focus on learning intermediate representations that allow us to extract acoustic events such as phonemes from the auditory stream. We draw on manifold learning, information geometry, physical modeling, and the neuroscience of perception.

Data scarcity: Humans talk about many things, in many languages and dialects and styles. Faced with a practical task—one particular type of language data and something to predict from it—one rarely has enough representative data to train millions of parameters. We seek general solutions to data scarcity, involving domain adaptation, multi-task learning, active learning, and crowdsourcing.

The pace of biological discovery has accelerated thanks to “next-generation” technologies, whose rate of data acquisition is increasing even faster than Moore’s law. Terabyte and petabyte datasets generated at JHU are leading to a new set of problems requiring machine learning solutions.

Scalable learning: As data sets increase in size, scalable learning becomes increasingly important. Algorithms such as Markov chain Monte Carlo, which may perform well for small data sets, become intractably slow for realistic problems. For example, at JHU, we are attempting to analyze whole-genome data to reveal how stem cells differentiate and specialize and how new tissues are formed — bridging spatial scales from nucleotides to genes to cells to tissues to organisms.

Small-sample learning: Many biological data sets measure a very large numberof features on a much smaller set of biological samples. Standard statistical techniques fail in this regime. A full human genome sequence, for example, contains in principle over 3 billion features accounting for genetic variation at each nucleotide, yet many studies enroll only hundreds to thousands of individuals. Working with scientists at the School of Medicine, we are discovering how rare mutations lead to cancer predisposition, heart disease, neuropsychiatric disorders, and other medical conditions.

Heterogeneous learning: Many classic inference problems are framed in terms of a data matrix and, for supervised problems, a vector to predict. Biological data sets often involve multiple types of data and require fundamentally different statistical models. Machine learning methods developed for homogeneous data are typically hard to extend to this case. At JHU, we are combining imaging with personal genetics, and combining data generated from metabolite profiling, mRNA sequencing, DNA sequencing, protein assays, and even natural language processing of the scientific literature to create a moving picture of the cell.

Network science: Biological data sets often have natural representations as networks, with genes and proteins as vertices and specific interaction types as edges. While many machine learning methods have been developed for graph analysis, biological networks pose special challenges. Edges can be noisy, with varying experimental quality or biochemical strength, and can be time-dependent or condition-specific. Furthermore, while most social network data sets consider only a single interaction modality such as phone calls or emails, biological data analysis often requires integration of multi-modal or multi-scale data.

Humans effortlessly distinguish among a remarkable variety of objects, actions and interactions in complex, cluttered scenes. They can recognize subtle behaviors among several people interacting with each other and with everyday objects. In contrast, automatically interpreting such scenes has proved surprisingly resistant to decades of research.

Automatic interpretation of images and videos raises serious challenges for statistical inference and learning. The difficulties are especially pronounced when the objective is to uncover a complex statistical dependency structure within a large number of variables and yet the number of samples available for learning is relatively limited. At JHU, we are developing algorithms, often based on graphical models, for

interpreting images efficiently

reducing the dimensionality of high-dimensional datasets

learning from very few examples

clustering data living in multiple subspaces and manifolds

discovering relationships in high-dimensional datasets

We apply these techniques to key challenges in computer vision, such as

detecting faces and deformable objects (e.g. cats) in photographs

recognizing object categories in photographs

segmenting and tracking moving objects in video

recognizing dynamic textures (water, smoke, fire) in video

recognizing human activities in video

modeling and recognizing skill in surgical motion and video data

JHU is a major player in computer vision, focusing on foundational research in the field. The Center for Imaging Science serves to coordinate related research, education, and outreach across several JHU departments. The Vision Sciences Group at the Homewood campus brings together the study of machine vision and biological vision.

Robots must act in the physical world to accomplish a specific task or objective. Typically, their actions are governed by information acquired from multiple sources and at various time scales. How does the robot decide what data to acquire, how to relate it to the task at hand, and what actions to take? Some examples of machine learning in JHU’s Sensor-Based Robotics research:

Modeling and mining human action: We can now record humans performing tasks unobtrusively and at scale. For example, we can record any of the 270,000 surgeries performed annually with our da Vinci surgical robot. We use ML techniques to process these complex motion signals into quasi-grammatical structures that can then be exploited to augment or automate tasks.

Data to information: Video cameras now produce far more data than any human can realistically process. For example, the so-called Pill Cam, which is swallowed by a patient, produces 50,000 images over 8 hours which must be reviewed diagnostically. We develop ML techniques to automate such assessments of images.

Learning and control: As devices become more complex and diverse, it becomes impractical to specify the control policies for all necessary tasks. Machine learning is used to learn control from examples, or to achieve a particular objective.

Modern healthcare is being transformed by new and growing electronic resources, with hospitals generating terabytes of imaging, diagnostic, monitoring, and treatment data. Machine learning is central to utilizing these rapidly expanding datasets, combing through data across patients, clinics, and hospitals to uncover more effective treatments and practices that increase the quality and longevity of human life. At Johns Hopkins, machine learning researchers are partnering with health care professionals to tackle these high impact problems, such as:

Are there sub-populations that respond more effectively to a treatment?

What observations by physicians can serve as early warning signs for the onset of chronic illness?

Can we aid medical decision making in high-risk breast, ovarian, prostate, and colorectal cancer families based on computational analyses of protein evolution and structure?

How can clinical random trials be optimized for better estimates of treatment effects?

What patterns can we discover from large-scale DCE-MRI and fMRI data?

These challenges require novel machine learning algorithms tailored to the challenges of large scale medical data. We seek to build accurate and reliable algorithms upon which doctors can make decisions, from the level of individual patients to widespread public policy.

The new Center for Personalized Cancer Medicine and Center for Population Health Information Technology (CPHIT) both aim to systematically improve patient care through learning algorithms that run on massive datasets of genetic data or electronic medical records.

Neuroscientific paradigms at JHU range widely: e.g., molecular genetics, in vivo calcium imaging, multi-electrode array recording, and magnetic resonance imaging (MRI). The resulting complex datasets often have millions of dimensions and may help to unravel many longstanding scientific questions. Some of our machine learning goals:

Explanatory modeling: A fundamental goal is to learn explanatory, causal, high-dimensional graphical models of whole-brain response over time. That is, what is the joint distribution over brain activity (spike trains, fMRI, or EEG) and its external visual, auditory, tactile, and motor correlates?

Random graph models: JHU hosts many connectome datasets, including the world’s largest: 10 TB of electron microscopy on brain slices. We are inferring spatial connectivity graphs from the raw images and building stochastic models of the graph structure.

Historically, machine learning has focused on data points that are assumed to be sampled independently or exchangeably from some distribution. These assumptions are inappropriate for network data such as social networks, transportation networks, and biological neural networks. A number of basic network-based questions are actively being investigated by members of the ML@JHU community, including:

Graph invariants: What graph invariants are most powerful for various hypothesis testing and anomaly detection tasks? Can observed vertex and edge attributes lead to more powerful tests by considering graph structure jointly with graph content?

Graph embeddings: Graphs are inherently high-dimensional non-Euclidean objects. How can we embed graphs in low-dimensional Euclidean (or near-Euclidean) spaces for pattern recognition and visualization? How can we choose the optimal dimension for a specific task?

Semi-supervised classification using graph structure: When attributes of some nodes or edges are observed, can we predict attributes of other nodes or edges in the same network? For example, given a few examples of fraudulent actors in a social network, communication network, or financial network, can we find more like them? Given a noisy data set from high-throughput biological assays, can we discard spurious edges and nominate missing edges for testing?