Bioinformatics Toolbox

Genetic Algorithm Search for Features in Mass Spectrometry Data

This example shows how to use the Global Optimization Toolbox with the Bioinformatics Toolbox™ to optimize the search for features to classify mass spectrometry (SELDI) data.

Introduction

Genetic algorithms optimize search results for problems with large data sets. You can use the MATLAB® genetic algorithm function to solve these problems in Bioinformatics. Genetic algorithms have been applied to phylogenetic tree building, gene expression and mass spectrometry data analysis, and many other areas of Bioinformatics that have large and computationally expensive problems. This example searches for optimal features (peaks) in mass spectrometry data. We will look for specific peaks in the data that distinguish cancer patients from control patients.

Global Optimization Toolbox

First familiarize yourself with the Global Optimization Toolbox. The documentation describes how a genetic algorithm works and how to use it in MATLAB. To access the documentation, use the doc command.

A genetic algorithm requires an objective function, also known as the fitness function, which describes the phenomenon that we want to optimize. In this example, the genetic algorithm machinery tests small subsets of M/Z values using the fitness function and then determines which M/Z values get passed on to or removed from each subsequent generation. The fitness function biogafit is passed to the genetic algorithm solver using a function handle. In this example, biogafit maximizes the separability of two classes by using a linear combination of 1) the a-posteriori probability and 2) the empirical error rate of a linear classifier (classify). You can create your own fitness function to try different classifiers or alternative methods for assessing the performance of the classifiers.

Users can change how the optimization is performed by the genetic algorithm by creating custom functions for crossover, fitness scaling, mutation, selection, and population creation. In this example you will use the biogacreate function written for this example to create initial random data points from the mass spectrometry data. The function header requires specific input parameters as specified by the GA documentation. There is a default creation function in the toolbox for creating initial populations of data points.

type biogacreate

function pop = biogacreate(GenomeLength,~,options,Y,id)
%BIOGACREATE Population creation function for MSGADEMO
%
% This function creates a population matrix with dimensions of
% options.PopulationSize rows by the number of independent variables
% (GenomeLength) columns. These values are integers that correspond to
% randomly selected rows of the mass spectrometry data Y. Each row of the
% population matrix is a random sample of row indices of the mass spec
% data.
% Note: This function's input arguments are required by the GA function.
% See GA documentation for further detail.
% Copyright 2005-2013 The MathWorks, Inc.
pop = zeros(options.PopulationSize,GenomeLength);
npop = numel(pop);
ranked_features = rankfeatures(Y,id,'NumberOfIndices',npop,'NWeighting',.5);
pop(:) = randsample(ranked_features,npop);

Set Genetic Algorithm Options

The GA function uses an options structure to hold the algorithm parameters that it uses when performing a minimization with a genetic algorithm. The gaoptimset function will create this options structure. For the purposes of this example, the genetic algorithm will run only for 50 generations. However, you may set 'Generations' to a larger value.

Use ga to start the genetic algorithm function. 100 groups of 20 datapoints each will evolve over 50 generations. Selection, crossover, and mutation events generate a new population in every generation.

% initialize the random generators to the same state used to generate the% published example
rng('default')
nVars = 12; % set the number of desired features
FitnessFcn = {@biogafit,Y,id}; % set the fitness function
feat = ga(FitnessFcn,nVars,options); % call the Genetic Algorithm
feat = round(feat);
Significant_Masses = MZ(feat)
cp = classperf(classify(Y(feat,:)',Y(feat,:)',id),id);
cp.CorrectRate