Tuesday, March 14, 2006

Genetic programming (GP) is a computational discovery tool that is inspired by Darwinian evolution and natural selection. We have applied GP and related algorithms to a wide variety of genetic problems including modeling epistasis and biochemical pathways. The GP Bibliography maintained by Dr. Bill Langdon is an important resource for GP publications. Many of these papers can't be found on PubMed. Our list of GP papers in the bibliography can be found here.

Our newest paper will be published and presented as part of the Genetic Programming Theory and Practice (GPTP IV) workshop at the Center for the Study of Complex Systems in Ann Arbor in May. Here is the title and abstract. A preprint will be available upon request in a few weeks.

Human genetics is undergoing an information explosion. The availability of chip-based technology facilitates the measurement of thousands of DNA sequence variation from across the human genome. The challenge is to sift through these high-dimensional datasets to identify combinations of interacting DNA sequence variations that are predictive of common diseases. The goal of this study is to develop and evaluate a genetic programming (GP) approach to attribute selection and classification in this domain. We simulated genetic datasets of varying size in which the disease model consists of two interacting DNA sequence variations that exhibit no independent effects on class (i.e. epistasis). We show that GP is no better than a simple random search when classification accuracy is used as the fitness function. We then show that including pre-processed estimates of attribute quality using Tuned ReliefF (TuRF) in a multi-objective fitness function that also includes accuracy significantly improves the performance of GP over that of random search. This study demonstrates that GP may be a useful computational discovery tool in this domain. This study raises important questions about the general utility of GP for these types of problems, the importance of data pre-processing, the ideal functional form of the fitness function, and the importance of expert knowledge. We anticipate this study will provide an important baseline for future studies investigating the usefulness of GP as a general computational discovery tool for large-scale genetic studies.

2 Comments:

Hi Blogger!I like your blog! Keep up thegood work, you are providing a great resource on the Internet here!If you have a moment, please take a look at my site:domain names centerIt pretty much covers domain names center related issues.Best regards!

About Me

Edward Rose Professor of Informatics,
Director of the Institute for Biomedical Informatics, Director of the Division of Informatics in the Department of Biostatistics and Epidemiology,
Senior Associate Dean for Informatics,
The Perelman School of Medicine,
University of Pennsylvania