Inputs and Command-line options

The following is a detailed description of the options used to control the MapSplice script:

Usage:

python bin/mapsplice_segments.py MapSplice.cfg

or

python bin/mapsplice_segments.py [inputs|options] MapSplice.cfg

or

python bin/mapsplice_segments.py [inputs|options]

Inputs and output:

-u/--reads-file <string>

A comma separated (no blank space) list of FASTA or FASTQ read files(inlcude path) Notes: For paired-end reads, the order should be as follows: reads1_end1,reads1_end2,reads2_end1,read2_end2... For two ends from the same read, the read names should be in the following format: read_base_name/1 and read_base_name/2 -The read_base_name should be the same for two ends

Format constraint: Reads names after @ or > should not containa blank space or tab

-c/--chromosome-files-dir <string>

The directory containing the sequence files corresponding to the reference genome (in FASTA format) -One chromosome per file -The chromosome name after '>' should not contain a tab or a blank space -The chromosome name should be the same as the basename of the chromosome file -The suffix of the chromosome file name should be 'fa' -eg. If the chromosome name after '>' is 'chr1', then the file name should be 'chr1.fa'

-B/--Bowtieidx <string>

The path and basename of index to be searched by Bowtie.

-E.g. if the index file name is index.1.ewbt, then the base name is index -If the index does not exist, it will be built from reference genomes indicated by option -c with bowtie-build.

(Index only need to be built once, and the pre-built indexes of various reference genomes are downloadable at Bowtie's page.)

However, use cation when downloading a pre-indexed genome (i.e. know what you are downloading, be sure the bowtie index is consistent with the chromosome files specified with -c option)

-o/--output-dir <string>

The name of the directory in which MapSplice will write its output. The default is "mapsplice_out/" under the current directory MapSplice is run in.

-t/--avoid-regions <string> (optional)

Regions to avoid (i.e. mask) while searching for alignments

- gff format required

- e.g. ~/examples/islands.gff

-T/--interested-regions <string>(optional)

Regions of interest while searching for alignments

- gff format required

-M/--sam-file <string> (optional)

A comma separated (no blank space) list of sam files (including path) (optional) -Only supports single end reads -If this value is specified, then reads_file option will not be used -The unmapped reads in the sam files will be converted into fastq format to be used as input reads

--bam <string> (optional)

A comma separated (no blank space) list of bam files (including path) (optional) -Only supports single end reads -If this value is specified, then reads_file option will not be used, -The unmapped reads in the bam files will be converted into fastq format to be used as input reads

--filter-fusion-by-repeat<string> (optional)

Filter fusion junction if the doner sequence and acceptor sequence appears repeatedly -blat needs to be installed on the system, chromosome index in blat format needs to be provided -e.g. human index in blat format: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit -The output is "fusion_remap_junction.unique.chr_seq.extracted.repeat_filtered"

Basic options:

Basic options are options suggested to be specified to run MapSplice correctly

-L/--seglen <int>

Description: Length of read segments -Suggested to be in range of [18,25], if the segment is too short it will be mapped everywhere, -Segment length should not be longer than half of the read length -Segment length should not be longer than 25 -If the read length can't be divided evenly, the read sequence will be truncated at the end for now. (e.g. segment length of 25 for a 60 bp read will use segments of nucleotides 1-25 and 26-50)

-Q/--reads-format <string>

Format of input reads, fa OR fq

--pairend

Whether or not the input reads are paired-end or single.Need to be specified for paired-end reads

Advanced options:

-E/--segment-mismatches <int>

The maximum number of mismatches (Hamming distance) that are allowed in an unspliced aligned read and segment. The default is 1. Must be in range [0-3]

--non-canonical | --semi-canonical

Whether or not the semi-canonical and non-canonical junctions should be outputted

The maximum small indel length (default is 3, suggested to be in [0-3])

--min-missed-seg <int>

An option to output incomplete alignments. # The minimal number of segments contained in alignment. # eg. If read length is 75bp, segment_length is 25, then setting min_missed_seg to 1 will output 50bp alignments if there are no 75bp alignments for the corresponding reads #-The default is output alignments of full read length

--search-whole-chromosome

If specified, search up to the maximum intron length away in exonic region and non-exonic region. # exonic region: segment mapped region during segment mapping # Normally MapSplice will only search up to the maximum intron length away in exonic region for fractions (i.e. small exons < segment length) of a spliced segment # -This enables MapSplice to find spliced alignments in small exons (< segment length) at head and tail across the chromosome, but will increase running time

--map-segments-directly

#If specified, MapSplice will try to find spliced alignments and unspliced alignments of a read, and select the best alignment. (will increase running time) #If not specified, MapSplice will try to find unspliced alignments of a read, then if no unspliced alignments are found, MapSplice will try to find spliced alignments for the read

Whether or not fusion junctions should be outputted # -Reads not aligned as normal unspliced or spliced alignments are consider as fusion candidates # -The outputs are "fusion.junction" and "fusion_junction.unique" if full-running is not turned on # -The outputs are "fusion_remap_junction.unique.chr_seq.extracted" if full-running is turned on

--cluster

Whether or not to use paired-end reads to generate cluster regions for fusion read mappings # Use paired-end reads to find fusion alignments with a single anchored method # e.g. use 2x50 paired read and 25bp segment length to find fusion alignments # -Only valid for paired-end reads and the full running model and do_fusion on (set full_running = yes and do_fusion = yes)