Tutorials and Thoughts on Data Science

Compare Transformations & Batch Effects in Omics Data

While analysing high dimensional data, e.g. from Omics (Genomics, Transcriptomics, Proteomics etc.) – we are essentially measuring multiple response variables (i.e. genes, proteins, metabolites etc.) in multiple samples, resulting in a matrix X with r variables and n samples. The data capture can lead to multiple batches or groups in the data – a batch can be anything from a simple grouping like Male vs Female to Lanes on a sequencer, or Wells in a capture plate. Statistically this means that: Is the covariate or batch ignorable or should it be account for in our model.

The X matrix (data matrix) usually has to be normalised, which means the data has to undergo some sort of transformation to account for certain technical differences due to these batches e.g. like different samples being sequenced to different depths in an Next Generation Sequencer. This kind of systematic effect which is due to technical variation can be ‘adjusted’ and the samples are scaled/transformed to make them identically distributed. There are different methods of normalisation based on the data capture technology; but this post has two aims:

A simple workflow for diagnostic checks for your normalisation: we normalise the same data using two different methods, and check which method may be more appropriate.

Introducing the concepts of object oriented programming in R, with S4 objects.

1. Compare two Normalisation Methods:

In genomics, we are capturing many variables, and many of these variables are inter-dependent (e.g. co-expressed genes) and on average most of these variables are assumed to be identically distributed between samples (this however depends on how different each sample is from the other – e.g. comparing types of apples or apples vs oranges). In the example below we are using some gene expression data that has been normalised using two different methods:

Method 1: We used house keeping genes to normalise the samples, even though these samples are different cell types, hence the house keeping genes may act differently in each sub-type.

Method 2: We use the common highly expressed genes across all samples, and look for a common factor among these cells, and normalise using that criteria.

Note: We are not arguing which method is correct, both are essentially mathematically and biologically sound – but which one is suitable for our sample types.

The data in the plots below has been normalised and log transformed (in positive space) and zeros represent missing data. The samples have been coloured on our batch of interest i.e. the machine ID.

The figure above is a box plot of the data with the medians highlighted as a bar. We can see that in both the methods, the samples from the second machine (071216) has more missing data. Could this be because good quality samples were run first and bad quality later? Otherwise comparing the two methods, I don’t think there is any strong reason to select one over the other; if I had to choose one I would lean towards method 2 as the medians overall are slightly more equal?

We assume the each sample is normally distributed on a log scale; this is a very big assumption, but a convenient one for diagnostic checks. Besides even if the data is not normally distributed, the mean is normally distributed, due to the number of data points (we have about 530 or so variables in each sample). We are plotting the posterior mean for each sample (see here for some details), along with the 95% high density interval (HDI) for each mean. The mean for S_4 and its HDI does not overlap with e.g. S_14 in method 1, while in method 2 all the HDIs can overlap. There is not a clear choice here, but I would prefer the normalisation in method 2.

As we have used a normal model for the average of each sample, and the normal distribution has a second parameter (the scale – sigma, standard deviation), we compare the posterior sigma for these samples. In this case the standard deviation and the HDI in each sample are more comparable in method 2, I would consider this as strong evidence for the second method.

The figures show the posterior proportion of the data that is missing after a normalisation done using each method (The reason this can be different, as some normalisation methods may include background signal removal as well). However there does not appear to be a clear winner in this case. However the batch id 071216 has more missing data.

This figure shows the first 2 components from the Principal Component Analysis of the data matrix X. Looking at the distribution of the points along the two axes – visually I can see that the data is more clumped on the left (method 1) while is looks slightly more random on the right (comparatively of course – method 2). This plot also adds another piece of evidence towards using method 2 rather than method 1.

All these pieces of evidence for or against one method should be taken together, along with considering the types of samples being compared and the nuances of the data capture technology. In this case most of the evidence (although not very strong individually) collectively favours method 2. Furthermore, I would consider including the machine ID as a covariate in the model, or use a hierarchical model with the machine ID as a random effect in the model.

2. Object Oriented Programming in R – S4 Objects:

Before I talk about some of my experience with S4 objects, these two sites are good starting points for learning more about this, site1, site2. I liked the structure of C++ and working on larger projects in R can get messy if the programs/scripts are not structured. I will put some code in here for the class CDiagnosticPlots.

Class Declaration:

I define the class name, and the slots i.e. variables that this object will hold. Generally if I want the class to grow in the future, I will keep a slot like lData of type list, that I can append to for future use.

The constructor can perform various checks, before creating the object using the new function. It will return the object of the type CDiagnosticPlots which can be used in your program.

Generic Functions:

Generic functions are very useful and the function with the same name can be called for different object types. This is an example from one of my other repositories here. I define two different classes but both use the plotting function with the same name.