Gene Expression Profile Analysis

This example shows a number of ways to look for patterns in gene expression profiles. This example uses data from the microarray study of gene expression in yeast by DeRisi, et al. 1997.

Exploring the Data Set

The authors used DNA microarrays to study temporal gene expression of almost all genes in Saccharomyces cerevisiae during the metabolic shift from fermentation to respiration. Expression levels were measured at seven time points during the diauxic shift. The full data set can be downloaded from the Gene Expression Omnibus website, http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE28.

Start by loading the data into MATLAB®. The MAT-file yeastdata.mat contains the VALUE data or LOG_RAT2N_MEAN, or log2 of ratio of CH2DN_MEAN and CH1DN_MEAN from the seven time steps in the experiment, the names of the genes, and an array of the times at which the expression levels were measured.

load yeastdata.mat

To get an idea of the size of the data you can use numel(genes) to show how many genes there are in the data set.

numel(genes)

ans =
6400

genes is a cell array of the gene names. You can access the entries using MATLAB cell array indexing:

genes{15}

ans =
YAL054C

This indicates that the 15th row of the variable yeastvalues contains expression levels for the ORF YAL054C. You can use the web command to access information about this ORF in the Saccharomyces Genome Database (SGD).

Filtering the Genes

The data set is quite large and a lot of the information corresponds to genes that do not show any interesting changes during the experiment. To make it easier to find the interesting genes, the first thing to do is to reduce the size of the data set by removing genes with expression profiles that do not show anything of interest. There are 6400 expression profiles. You can use a number of techniques to reduce this to some subset that contains the most significant genes.

If you look through the gene list you will see several spots marked as 'EMPTY'. These are empty spots on the array, and while they might have data associated with them, for the purposes of this example, you can consider these points to be noise. These points can be found using the strcmp function and removed from the data set with indexing commands.

In the yeastvalues data you will also see several places where the expression level is marked as NaN. This indicates that no data was collected for this spot at the particular time step. One approach to dealing with these missing values would be to impute them using the mean or median of data for the particular gene over time. This example uses a less rigorous approach of simply throwing away the data for any genes where one or more expression level was not measured.

The function isnan is used to identify the genes with missing data and indexing commands are used to remove the genes with missing data.

If you were to plot the expression profiles of all the remaining profiles, you would see that most profiles are flat and not significantly different from the others. This flat data is obviously of use as it indicates that the genes associated with these profiles are not significantly affected by the diauxic shift; however, in this example, you are interested in the genes with large changes in expression accompanying the diauxic shift. You can use filtering functions in the Bioinformatics Toolbox™ to remove genes with various types of profiles that do not provide useful information about genes affected by the metabolic change.

You can use the genevarfilter function to filter out genes with small variance over time. The function returns a logical array of the same size as the variable genes with ones corresponding to rows of yeastvalues with variance greater than the 10th percentile and zeros corresponding to those below the threshold.

mask = genevarfilter(yeastvalues);
% Use the mask as an index into the values to remove the filtered genes.
yeastvalues = yeastvalues(mask,:);
genes = genes(mask);
numel(genes)

ans =
5648

The function genelowvalfilter removes genes that have very low absolute expression values. Note that the gene filter functions can also automatically calculate the filtered data and names.

Cluster Analysis

Now that you have a manageable list of genes, you can look for relationships between the profiles using some different clustering techniques from the Statistics and Machine Learning Toolbox™. For hierarchical clustering, the function pdist calculates the pairwise distances between profiles and linkage creates the hierarchical cluster tree.

The Statistics and Machine Learning Toolbox also has a K-means clustering function. Again, sixteen clusters are found, but because the algorithm is different these will not necessarily be the same clusters as those found by hierarchical clustering.

% Initialize the state of the random number generator to ensure that the% figures generated by these command match the figures in the HTML version% of this example.
rng('default');
[cidx, ctrs] = kmeans(yeastvalues, 16, 'dist','corr', 'rep',5,...'disp','final');
figure
for c = 1:16
subplot(4,4,c);
plot(times,yeastvalues((cidx == c),:)');
axis tightend
suptitle('K-Means Clustering of Profiles');

Principal Component Analysis

Principal-component analysis(PCA) is a useful technique that can be used to reduce the dimensionality of large data sets, such as those from microarray analysis. PCA can also be used to find signals in noisy data. The function mapcaplot is used to create a plot of the principal components of a data set. Try dragging the mouse around in one of the figure windows. You will see that the selected elements are highlighted in the other window. This allows you to look at multiple dimensions at once. Notice that the scatter plot of the scores of the first two principal components shows that there are two distinct regions. This is not unexpected as the filtering process removed many of the genes with low variance or low information. These genes would have appeared in the middle of the scatter plot.

mapcaplot(yeastvalues,genes)

mapcaplot calculates the principal components and creates scatter plots of the results. If you want to look at the values of the principal components, the pca function in the Statistics and Machine Learning Toolbox is used to calculate the principal components of a data set.

[pc, zscores, pcvars] = pca(yeastvalues);

The first output, pc, is a matrix of the principal components of the yeastvalues data. The first column of the matrix is the first principal component, the second column is the second principal component and so on. The second output, zscores, are the principal component scores. That is, the representation of yeastvalues in the principal component space. The third output, pcvars, contains the principal component variances.

The values of pcvars give a measure of how much of the variance of the data is accounted for by each of the seven principal components. It is clear that the first principal component accounts for a majority of the variance in the model. You can see the exact percentage of the variance accounted for by each component using this command:

pcvars./sum(pcvars) * 100

ans =
79.8316
9.5858
4.0781
2.6486
2.1723
0.9747
0.7089

This shows that almost 90% of the variance is accounted for by the first two principal components. You can use the cumsum command to see the cumulative sum of the variances.

cumsum(pcvars./sum(pcvars) * 100)

ans =
79.8316
89.4174
93.4955
96.1441
98.3164
99.2911
100.0000

If you want to have more control over the plotting of the principal components, you can use the scatter function.

An alternative way to create a scatter plot is with the function gscatter from the Statistics and Machine Learning Toolbox. gscatter creates a grouped scatter plot where points from each group have a different color or marker. You can use clusterdata, or any other clustering function, to group the points.

The function gname from the Statistics and Machine Learning Toolbox can be used to identify genes on a scatter plot. You can select as many points as you like on the scatter plot. Hit Enter when you have finished selecting points.

gname(genes) % Hit enter when you have finished selecting genes.

Self-Organizing Maps

If you have the Neural Network Toolbox™, you can use a self-organizing map (SOM) to cluster the data.

% Check to see if the Neural Network Toolbox is installedif ~exist(which('newsom'),'file')
disp(sprintf(['The Self-Organizing Map section of this example\n'...'requires the Neural Network Toolbox.']));
returnend

The newsom function creates a new SOM network object. This example will generate a SOM using the first two principal components.

P = zscores(:,1:2)';
net = newsom(P,[4 4]);

Train the network using the default parameters.

net = train(net,P);

Use plotsom to display the network over a scatter plot of the data. Note that the SOM algorithm uses random starting points so the results will vary from run to run.