it for user interactions. For this purpose the commandabove with its arguments must bespecified as followed:

mapred:

jar: CloudBurst.jar

params: $reference $reads $output $min_read_len $max_read_len $k

$allowdifferences $filteralignment 240 48 24 24 128 16

inputs:

-

id: reference

description: Reference Genome

type: hdfs-file

makeAbsolute: false

-

id: reads

description: Reads

type: hdfs-file

makeAbsolute: false

-

id: min_read_len

description: min length of reads

type: number

value: 36

-

id: max_read_len

description: max length of reads

type: number

value: 36

-

id: k

description: mismatches

type: number

value: 3

-

id: allowdifferences

description: Allow Differences

type: list

values:

0: mismatches only

1: indels as well

value: 0

-

id: filteralignment

description: Filter Alignments

type: list

values:

0: all alignments

1: only report unambiguous best alignment

value: 1

outputs:

-

id: output

description: Output Folder

type: hdfs-folder

download: true

mergeOutput: false

Crossbow

Crossbow[2]

is a scalable software pipeline for whole genome resequencing analysis. Itcombines Bowtie, an ultrafast and memory efficient short read aligner, and SoapSNP, and anaccurate genotyper.These tools are combined in an automatic, parallel pipeline.

In order to integrate Crossbow into Cloudgene,a configuration fileto start

a Hadoop clusteron Amazon EC2has to be createdusing

the CloudBioLinux image (ami-31bc7758)

with openHadoop ports80,50030 and 50070.The install.shinit-script downloads and installs allrequired software (e.gsratoolkit).The corresponding YAML has the following structure:

name: Crossbow

category: Genetics

version: 1.1.2

website:http://bowtie-bio.sourceforge.net/crossbow

author: Ben Langmead et al.

cluster:

image: us-east-1/ami-31bc7758

type: m1.large,m1.xlarge

ports: 80,50030,50070

user: ubuntu

service: hadoop

installMapred: true

initScript: install.sh

ForCrossbow a web interface has already been made available by the authors. Nevertheless,by integrating these programs into Cloudgene, the users still benefit from (1) a standardizedway to import/export data, (2) a system which keeps track of all previous executed workflowsincluding the complete configuration set-up (input/output parameters, execution times,results) and (3) the possibility to concatenate different MapReduce jobs to pipelines. In thisexample, Cloudgene's concatenation functionality (specified as "steps" in the manifest file)has been used to execute several computation steps of Crossbow. This can be done bydefining the output directory of step x (e.g. step 1: Pre-processing) as the new input directoryfor step x+1 (e.g. step 2: Alignment)in the manifest file. Even if the newly created workflowconsists of several steps in the manifest file, the user can start it as one job.

is a reliable algorithm implemented in a web application to determine thehaplogroup affiliation of thousands ofmitochondrial DNA (mtDNA) profiles genotyped forthe entire mtDNA or any part of it. As HaploGrep provides its own web interface we do notneed to install Cloudgene-MapRed. Since it does not use the Hadoop service either, we noted

this option in the configuration as well. HaploGrep listens onport

80 (http) and 443 (https),thereforethese

ports are markedin the YAML configuration. The configuration file forCloudgene with all requirements looks as follows:

name: Haplogrep

description: Haplogrep

category:Genetics

cluster:

image: us-east-1/ami-da0cf8b3

type: m1.large,m1.xlarge

ports: 80

creationOnly: false

service: hadoop

installMapred: false

After the cluster setup is finalized, Cloudgene returns a web address which points to theinstalledinstance of HaploGrep.