2 Multivariate analysis

This is a classic microarray experiment. Microarrays consist of ‘probesets’ that interogate genes for their level of expression. In the experiment we’re looking at, there are 12625 probesets measured on each of the 128 samples. The raw expression levels estimated by microarray assays require considerable pre-processing, the data we’ll work with has been pre-processed.

2.1 Input and setup

The data is stored in ‘comma-separate value’ format, with each probeset occupying a line, and the expression value for each sample in that probeset separated by a comma. Input the data using read_csv(). The sample identifiers are present in the first column.

2.2 Cleaning and Exploration

The expression data is presented in what is sometimes called ‘wide’ format; a different format is ‘tall’, where Sample and Gene group the single observation Expression. Use tidyr::gather() to gather the columns of the wide format into two columns representing the tall format, excluding the Gene column from the gather operation.

exprs <- exprs %>% gather("Sample", "Expression", -Gene)

Explore the data a little, e.g., a summary and histogram of the expression values, and a histogram of average expression values of each gene.

For subsequent analysis, we also want to simplify the ‘B or T’ cell type classification

pdata <- pdata %>% mutate(B_or_T = factor(substr(BT, 1, 1)))

2.3 Unsupervised machine learning – multi-dimensional scaling

We’d like to reduce high-dimensional data to lower dimension for visualization. To do so, we need the dist()ance between samples. From ?dist, the input can be a data.frame where rows represent Sample and columns represent Expression values. Use spread() to create appropriate data from exprs, and pipe the result to dist()ance.x