A shiny app for clustering using genes in ExpressionSet

Introduction

We looked at the code for dfHclust in some detail. It makes
significant use of universal aspects of data.frame instances

rows can represent the objects to be clustered

columns represent the features of the objects used to compare and group them

rownames can be used to label the objects

colnames can be used to name the features

The name concepts make it easy for us to think substantively about
what is being done in the app.

When we move to the context of genome-scale data, things are a bit different. Let’s focus on expression array applications

the volume of data is often very substantial

assuming the objects of interest are the individual biological samples that have been arrayed, the labeling of objects can be complex

assuming we use an ExpressionSet to manage the data, both rownames and colnames will tend to use unilluminating vocabularies such as probe set and sample identifiers

This last concern is hard to defeat in general but for the tissuesGeneExpression data there is a simple solution.
The first concern, about data volume, can be managed nicely with ExpressionSet instances through bracket-based subsetting.

A simple app

Let’s write a function that transforms any ExpressionSet into a data.frame and then runs dfHclust. This will be our “app” for interactively clustering data in expression experiments!

Here’s code that runs the app on a small selection of genes and samples:

set.seed(1234)esHclust(tgeES[1:50,sample(1:ncol(tgeES),size=40)])

Exploring cluster analysis for tissue differentiation

Which genes are important for distinguishing the various tissues
available in this dataset? Can the analytical tools identified
to this point help us to identify and understand them? Neither of
these questions is particularly clear, and the experimental designs
underlying the tissue expression sets would need to be fully
understood. However, the following code can help identify
genes whose expression is statistically unlikely to be constant
over all the tissues present. We’ll collect moderated F tests for the
gene-specific null hypotheses that mean expression is constant
over all tissues.

This is all very informal, using only default values for heatmap presentations.
We’ll continue in this vein. In my inspection of the heatmap of means,
I considered the following five genes to be a potential signature for
tissue differentiation:

sig5=c("IL26","ZNF674","UBC.1","C7orf25.1","RPS13")

In the exercises we’ll see whether this is at all satisfactory.

A machine-learning approach to assessing discriminatory capacity

This is not directly addressing visualization but it is brief and
useful for thinking about what we can do with visualization. The
MLInterfaces package makes it easy to use various R-based
statistical learning tools with ExpressionSet instances. We’ll
use random forests to get a measure of effectiveness of the
sig50 defined above.

Here using the defaults we see that the 50 gene signature does a
reasonable job of sorting out the tissues. With simple modifications
to the single line of code using MLearn above you can assess
effects of changing the signature or the learning procedure.