Strand CTO: Big data brings big challenges

Dr Ramesh Hariharan, chief technology officer and founder at Strand Life Sciences, speaks about how the advances made in genomics has also brought about the need for researchers to handle big data

Dr Ramesh Hariharan, CTO and founder, Strand Life Sciences, India

Our ability to make simultaneous measurements on several tens of thousands of genes or even on the entire DNA in a biological system has improved many-fold over the last several years. The cost of making these measurements has also been reducing continually. This has enabled researchers to generate candidate genes/mutations of interest far more quickly than previous gene-by-gene methods. This has also brought about the need for researchers to add one more dimension to their research toolbox-the need to handle big data.

Big data could mean different things in different contexts. In the context of a research lab doing molecular biology research today, big data typically signifies a five GB file of raw data for every sample run. These have to be processed by special algorithms before their sizes can be compressed to the scale of 10s to 100s of MBs, amenable for exploration and discovery. The algorithms are in a semi-mature state at the moment: algorithm developers across the world have built and made available open source and commercial tools that will allow researchers to do the same.

While many research groups in India use available tools effectively to answer their research questions, few have contributed new tools and algorithms for use by the community at large. Though much of our effort at Strand Life Sciences does go towards this goal, through a commercial setting, via our GeneSpring and Avadis NGS products; I do know of few academic initiatives in this direction as well. This needs to change in the future via a multi-disciplinary involvement of computer scientists, statisticians and biologists, under key umbrella initiatives that could be undertaken in India towards understanding the Indian population.

Note that the above paragraph defines big data largely from the perspective of a single research lab generating up to a few 100 GBs of data a year, which is the typical case in India. But there is a case to be made for aggregation and pooling of data across labs. In essence, we are talking about the creation of central genomic repositories with modern interfaces that are very easily accessible and usable by researchers at large. This is where the real problem of big data hits; data sizes can reach TB (terabyte) and PB (petabyte) scales. There are of course ways to reduce these sizes, and there are several groups across the world building such infrastructure, including our group at Strand. However, researchers in India haven't had an impressive history of pooling data or creating central bioinformatics resources.

There is a third definition of big data that might confront us in the future. A whole industry is working on bringing the cost of whole genome sequencing down. It currently stands at about $1000. If and when the cost reduces by another factor of two-three, it may become feasible for individuals to get their genomes sequenced proactively, regardless of immediate medical need. There are several examples of genomic variants providing warnings for individuals to monitor aspects of their health more aggressively. In summary, biological and medical research will drive off big data to a very large extent in the future.