Documentation

SiPhy

The SiPhy software package comprises several analysis modules and utility programs. All are Java programs executed from the command line. The primary executable file, bin/scaler_0_5.jar, includes the following modules (the -task parameter identifies modules by task number). The general methods are described
below and their full description, inputs and outputs are described in the appropriate sections:

Task

Description

1

Slide a fixed window to testimate local rate of vatiation [ω in Equation (10) of Garber, et al.].

10

Estimates local rate of variation for a set of fixed regions (such as exons).

Multiple Alignment File

Multiple alignment files for well-studied phylogenies can be found on public websites such as the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu).

Missing data and alignment gaps:

Mulitple sequence alignments of genome sequence allow for missing (or inserted) sequence in one or several species to align as "gaps". Such gaps are abundant and may be the result evolutionary insertion and deletions events that, either by random drift or due to selective pressure, become fixed in a species, or they may be due to artifacts of the multiple alignment process. Further, genome assemblies vary in their completenes, and contain in many "gapped" regions of missing sequence. In this version, SiPhy treats both missing data and alignment gaps by ignoring the species with missing or gapped alignments from the computations.

Model File

The model file specifies the parameters of the probabilistic model describing the evolutionary process: A matrix describing the rate at which each nucleotide mutates into another, the equilibrium distribution of bases, and the phylogentic tree topology connecting the species under study.

SiPhy modules estimate the local divergence of different statistics from the neutral "average" evolutionary model provided by the model file. The modules work with the output of two programs commonly used to generate such models:

Phast. The output file format of this program is used as the input model file for SiPhy.

PAML. SiPhy provides a conversion utility that converts the output file format of this program to the Phast model file format.

Two types of genomic sequence are understood to have evolved under no or very weak selection: (1) 4D-sites, the four-fold degenerate third codon position of protein-coding genes, and (2) Ancestral Repeats (AR), mobile elements that are ancestral to all sequences under study. Several software programs, such as Phast and PAML, estimate the average substitution rate from the extracted alignment crresponding to these sequences. The resulting model file includes the scaled phylogenetic tree with edge length proportional to the estimates average substitutions per site, a rate matrix Q representing the rate of substitutions between the four nucleotides, and a stationary distribution.

Following is an example model file. The identifiers in the TREE field of the model file must match identifiers in the multiple alignment file.

Estimating local rate of variation (ω)

SiPhy provides two ways to estimate the local rate of variation [ω in Equation (10) of Garber, et al.]: by sliding a fixed-size window or by providing a set of annotations in BED format. To estimate variation for the whole genome or a generic distribution of constraint in the alignment, use a fixed-size window. To estimate variation for specific genomic areas, use annotations. The result is a tab-delimited output file with the estimated rate of variation for each window or element. The paper discusses only the fixed-size window method.

Sliding a fixed-size window

Estimate the local rate of variation [ω in Equation (10) of Garber, et al.] by sliding a fixed-size window.

This command generated the data describbed in Garber et al. for Encode region ENm002. Both the input and results of the command are available in the data directory.

Estimating varying size elements

Estimate the local rate of variation [ω in Equation (10) of Garber, et al.] providing a set of annotations in BED format. The paper discusses how to estimate the local rate of variation using fixed-size windows; it does not discuss this method.

-in: BED file annotating elements in the multiple alignment file. SiPhy estimates the local rate of variation for these elements. This tasks does not support reading from standard input, you must provide file.

This command computes the change in substitutuion rate (ω) for each of the three exons of the TUG1 non-coding RNA. The model file and output can be found in the data directory. The multiple alignment for chromosome 22 can be downloaded from UCSC genome browser: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz44way/maf/chr22.maf.gz . The -ignore flag used here restricts the program to eutherian mammals rather than using all vertebrate species in the alignment.

This command will estimate the stationary distribution (π) and its log-odds ratio score for each column in the alignment. Both input files and the output of this command can be found in the data directory.

Integrating site-wise substitution patterns log-odds score in windows

After estimating substitution patterns per column as described above, use this method to sum the resulting scores to estimate substitution patterns across a larger region or a fixed-sized window.

Command:

java -jar bin/scaler_0.5.jar -task 11

Parameters:

-in: π estimation file (the result of site-wise base substitution pattern estimation). This task does not support standard input, you must provide a file name

As describbed in Garber, et al.] Siphy can define conserved elements whose bases are evolving in unexpected subsitituion patterns using a hidden Markov model. Siphy supports two commands, the first to find the Viterbi path, the second to estimate the posterior probability that a base is evolving under the biased substitution pattern model rather than the neutral model. Both commands have the same parameters and only differ in their task number.

Computes the viterbipath and posterior probabilities using the π HMM on the encode region ENm002. The input and output used in this example can be downloaded from the data directory.

Scoring motif conservation

Estimates a conservation score for a given sequence motif using a simple generalization of the substitution pattern estimation. The algorithm is similar to the one described by Moses, et al. in MONKEY (Genome Biology 2004, 5:R98). This method was not discussed in Garber, et al. but was used in Guttman, et al. (Nature 2009, 458:223-227) to rank lincRNAs by their potential to be regulated by known transcription binding sites based on occurrence of conserved binding sites.

SiPhy estimates the conservation score by sliding a fixed-size window equal in length to the motif. You can analyze a single genomic interval by specifying the start and end coordinates of that interval (-start and -end parameters) or analyze a set of coordinates by specifying a set of annotations in a BED file (-in parameter). The input multiple alignment files must be in the MAF file format.

Command:

java -jar bin/scaler_0.5.jar -task 16

Parameters:

-pwm: Required. Position weight matrix (pwm) of the sequence motif in pwm file format. See p53.pwm for a sample input file.

-indir: Input directory that contains a MAF file for each chromosome. Alignment files are assumed to have a .maf extension, if they have a different extension use the -mafSuffix to specify it. If omitted, working dir is used?

-outdir: Output directory. If omitted, SiPhy writes output files to the working directory.

-seedMinScore: For long motifs, provide a minimum reference score. SiPhy computes a reference score for motif tuples then analyzes only those motifs that score higher than the specified minimum reference score. Scores depend on the motif information content. For example, a minSeedScore of 0 would seed on tuples that are equally likely under both the neutral and PWM models, a positive score indicates that the tuple is more likely under the PWM specified base frequencies.

-ignore: Comma separated species to ignore, if the model's phylogenetic tree contains species not desired in the analysis.

-priorOmega: For a large interval, adjust for local substitution rate variation by specifying the ω obtained as described above. If omitted, ω is set to 1 (neutral).

-outprefix: To append a used defined prefix to the generated output file name.

-mafSuffix: By default the command assume MAF files withing the indir have a .maf extension. Use this parameter to sepcify a different file extension.

Result:

The ouput is one BED file per motif (the file name being the motif name as it appears in the pwm input file) with the locations that pass the minimum seed score, the orientation of the entry reflects the best orientation of the motif match. The score is sum of the log-odds ratio of the likelihood that each column of the element was generated by the evolutionary model with stationary distribution (π) equals to the corresponding column in the motif or by the null model.

This command will scan the TUG1 non-coding RNA loci for conserved instances of the transfac P53 motif. The -minSeedScore = 2 improves performance by scanning the reference first, only running the maximum likelihood estimation for windows where the log-odds ratio of the likelihood of the reference (hg18) bases being generated by the motif vs the neutral model specified is grater than 2. All input files, except for the directory with the genome wide alignments can be found in the data directory, to run this command it is sufficient to create a multiz44way directory and download the chr22 alignment from the UCSC browser: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz44way/maf/chr22.maf.gz the -ignore flag restricts the program to eutherian mammals rather than using all vertebrate species in the alignment.