Machine Learning Department Research

Below is a sampling of active ML research projects and labs. Additional research projects are described on the home pages of individual faculty.

The AUTON Lab

Our main research is into useful data structures and algorithms for making interesting statistical and learning approaches tractable on large volumes of data. We are very interested in the underlying computer science, mathematics, statistics, and in practical applications of our work. We collaborate closely with food safety analysts public health agencies, nuclear safety experts, managers of fleets of equipment, social networkers, astrophysicists, biologists, drug companies, exploration companies and roboticists.www.autonlab.org

Cell Organizer

A team led by Bob Murphy, Department Head for Computational Biology and a faculty member in the Machine Learning department, is combining image-derived modeling methods with active learning to build a continuously updating, comprehensive model of protein localization. Obtaining a complete picture of the localization of all proteins in cells and how it changes under various conditions is an important but daunting task, given that there are on the order of a hundred cell types in a human body, tens of thousands of proteins expressed in each cell types, and over a million conditions (which include presence of potential drugs or disease-causing mutations). Automated microscopy can help by allowing large numbers of images to be acquired rapidly, but even with automation it will not be possible to directly determine the localization of all proteins in all cell types under all conditions. http://murphylab.cbi.cmu.edu/CellOrganizer/

Databases Group

The databases group at Carnegie Mellon University focuses on high performance database architectures, multimedia, and data mining. We participate in a number of cross-disciplinary efforts, and closely collaborate with a number of other groups at CMU.http://www.db.cs.cmu.edu/db-site/

Querendipity

Working scientists need to track an enormous amount of information -in addition to the scientific literature, which is currently growing at a rate of a million articles a year, biologists need to understand when new high-throughput experimental results have been obtained that might impact their work. The model traditionally used in biology to solve this problem is creation of a manually curated community database of experimental results and literature. The Querendipity project aims to create a new model for managing and distributing scientific data. Querendipity is a personalized adaptive information system that works by loosely integrating data of many sorts (including unstructured text) into a single structure that can be queried using "schema-free similarity queries" - which are similar to keyword queries, but allow queries to structured data with few text annotations as well as to text. http://www.cs.cmu.edu/~wcohen/querendipity/

Read the Web

Can computers learn to read? We think so. "Read the Web" is a research project that attempts to create a computer system that learns over time to read the web. Since January 2010, our computer system called NELL (Never-Ending Language Learner) has been running continuously, attempting to perform two tasks each day: First, it attempts to "read," or extract facts from text found in hundreds of millions of web pages (e.g., playsInstrument(George_Harrison, guitar)). Second, it attempts to improve its reading competence, so that tomorrow it can extract more facts from the web, more accurately. http://rtw.ml.cmu.edu/rtw/

SELECT Lab

Our main long-term research goal is developing efficient algorithms and methods for designing, analyzing, and controlling complex real-world systems. To achieve this goal, our research spans the entire spectrum from theoretical foundations to real-world applications.http://www.select.cs.cmu.edu/

Systems Biology Group

Our group develops computational methods for understanding the dynamics, interactions and conservation of complex biological systems. As new high-throughput biological data sources become available, they hold the promise of revolutionizing molecular biology by providing a large-scale view of cellular activity. However, each type of data is noisy, contains many missing values and only measures a single aspect of cellular activity. Our computational focus is on methods for large scale data integration. We primarily rely on machine learning and statistical methods. Most of our work is carried out in close collaboration with experimentalists. Many of the computational tools we develop are available and widely used.