We take two large scale data intensive problemsfrom biology. One is a study of

EST (ExpressedSequence Tag) Assembly

with half amillion mRNAsequences. The other one is the analysis of genesequence data (35339 Alu sequences). These test casescan scale to state of the art problems suchasclustering of a million sequences. We look at initialprocessing (calculation of Smith Watermandissimilarities and CAP3 assembly), clustering andMulti Dimensional Scaling. We present performanceresults on multicore clusters and note that currentlydifferent technologies are optimized for different steps.

1. Introduction

We abstract many approaches as a mixture ofpipelined and parallel (good MPI performance)systems, linked by a pervasive storage system. Webelieve that much data analysis can be performed in acomputing style where data is read from one filesystem, analyzed by one or more tools and writtenback to a database or file system. An important feature

of the MapReduce style approaches is explicit supportfor data parallelism which is needed in ourapplications..

2

CAP3 Analysis

We have applied three cloud technologies, namelyHadoop [1], DryadLINQ [2], and CGL-MapReduce [3]to implement a sequenceassembly program CAP3

[4]

which is dominant part of analysis of mRNAsequences into DNA and performs several majorassembly steps such as computation of overlaps,construction of contigs, construction of multiplesequence alignments and generation of consensussequences, to a given set of gene sequences. Theprogram reads a collection of gene sequences from aninput file (FASTA file format) and writes its output toseveral output files and to the standard output as shownbelow. The input data is contained in a collection offiles, each of which needs to be processed by the CAP3program separately.

Input.fsa-> Cap3.exe-> Stdout + Other output files

The “pleasingly parallel” nature of the applicationmakes it extremely easy to implement using thetechnologies such as Hadoop, CGL-MapReduce andDryad. In the two MapReduce implementations, weused a “map-only” operation to perform the entire dataanalysis, where as in DryadLINQ we use a single“Select” query on the set of input data files.

Fig. 1: Performance

of different implementations of CAP3

Fig. 2: Scalability of different implementations of CAP3

Figs 1 and 2 show comparisons of performance and thescalability of the three cloud technologies under CAP3program. The performance and the scalability graphsshows that all three runtimes work almost equally wellfor the CAP3 program, and we would expect them tobehave in the same way for similar applications withsimple parallel topologies. With the support forhandling large data sets, the concept of movingcomputation to data, and the better quality of servicesprovided by the cloud technologies such as Hadoop,DryadLINQ, and CGL-MapReduce make themfavorable choice of technologies to solve suchproblems.

3. Alu Sequencing Applications

Alus represent the largest repeat families in humangenome withabout 1 million copies ofAlu

sequencesin human genome.Alu

clustering can be

viewed as

atest

for the capacity of computational infrastructuresbecause it is of great biological interests,andof a

scalefor other

largeapplications such asthe automatedprotein family classification for a few millions ofproteins predicted from large metagenomics projects.

3.1. Smith Waterman Dissimilarities

Fig 3. Performance of Alu Gene Alignments versus parallel pattern

In initial pairwise alignment of Alu sequences, we usedopen source version of the Smith Waterman–

Alu sample.This uses an approach that uses no vectors but justpairwise dissimilarities

[5].

3.3

Multidimensional Scaling

MDS

Given dissimilarities D(i,j),MDSfinds

the best setof vectorsxi

inany chosen

dimension

d

minimizing

i,j

weight(i,j) (D(i,j)–

|xi

–

xj|n)2

(1)

The weight is chosen to reflect importance of point orto fit smaller distance more precisely than larger ones.

We have previously reported results using

ExpectationMaximizationbut here we use a different techniqueexploiting that (1)is “just”2

and one can use veryreliable nonlinear optimizers

to solve it.

We supportgeneralchoices for theweight(i,j) and n andis fullyparallel over unknownsxi. All our MDS services feedtheir results

directly to powerful Point Visualizer.Theexcellent parallel performance of MDS

will bereported.Note that total time for all 3 steps on the fullTempest system is about 6 hours and clearly getting toa million sequences is not unrealistic and would takearound a week on a 1024 node cluster.

All capabilitiesdiscussed in this paper will be made available as cloudor TeraGrid services over next 3-12 months [6].

6. References

[1]Apache Hadoop,http://hadoop.apache.org/core/

[2]Y.Yu et al.“DryadLINQ: A System for General-PurposeDistributed Data-Parallel Computing Using a