Saguaro (Genome-Wide) is a program to detect signatures of selection
within populations, strains, or species. It takes SNPs or
nucleotides as input, and creates statistical local phylogenies for
each region in the genome. Saguaro was developed at the Science for
Life Laboratory, Department of Medical Biochemistry and
Microbiology, Uppsala University, and the Broad Institute of MIT and
Harvard.

Background
When species or populations diverge, their genomes will accumulate
differences to each other, so that the overall phylogeny follows
ancestry and is recognizable by computing pairwise genomic distances
genome-wide. However, several evolutionary forces act on very local
genomic regions, so that these regions appear in violation to the
dominant phylogeny, e.g. parallel evolution that sweeps certain
haplotypes to fixation independently in different populations,
driven by the need the need to adapt to similar environments.
Saguaro implements an algorithm that sets out to identify and
pinpoint such regions, without the need for any a priori hypotheses:
a Hidden Markov Model and a Neural Network, applied in an
interleaved fashion, will hypothesize local "phylogenies",
especially when they occur several times over in the genome, and
report them for further biological analysis.

Supported platforms
Saguaro is written in C++ and requires Linux, and the GNU gcc
compiler (we tested it with gcc versions 4.2, 4.4, and 4.6). The
amount of RAM required to run depends on the data set, please note
that even the small test sample data set included in the download
will take up several GB, since the data conversion is optimized to
process large amounts of data quickly. The sample data does run on a
MacBook Pro with 4GB of RAM, but only if no other programs are
loaded into memory, and even then it will start swapping.

Installing Saguaro
Download the source code from here. To
compile the executables, type

-i<string> : input multiple alignment file (MAF format)
-o<string> : binary feature output files
-n<string> : names of the genomes to be extracted (must match
MAF)
-nosame<bool> : skip positions in which all calls are the same
(def=0)
-m<int> : minimum coverage (def=2)
-c<string> : name of the genome in which the coordinates will
be reported

Note that you need to provide a plain text file to option -n that
contains all the organisms to be analyzed, with one entry per line
(see sample data below). The option -c defines which genome to use
as the coordinate system, and this organism needs to be in the list
of names. If organisms are either fairly closely related (more bases
match than mismatch across all organisms), or the experiment
normalizes conserved regions, we recommend the -nosame option.

and all entries in the multi-fasta file will be converted to Saguaro
format. Note that the mult-fasta file needs to contain all the gaps
to build a consistent multi-alignment, and further, that the
multi-alignment positions will be used as the coordinate system to
report results.

Running Saguaro
Once converted into features, run Saguaro, the options are:

where either a single feature file will be processed via the -f
option, or a list of files (one line per file name) is supplied via
the -l option. The option -iter controls both the number of
iterations, as well as how many hypotheses will be output.

Output
The final result is found in the output directory, called
LocalTrees.out. After a file header, it lists a phylogeny for each
genomic location, including coordinates and a distance matrix best
describing this regions, e.g.:

In addition, the file saguaro.cactus lists all hypotheses that have
been generated during the run and fit the individual regions best
genome-wide. Note that Saguaro is not forced to assign genomic
regions to all hypotheses, so some might not be used to classify
local regions.

Example data set
For example data, see the script test_Saguaro that is distributed
with the software. The sample data is part of the 29 mammalian
genomes alignments, and maps to a part of human chromosome 6.