In the following example, we are going to process raw read data from UCE
enrichments performed against several divergent taxa so that you can get a feel
for how a typical analysis goes. I’m also going to use several tricks that I
did not cover in the UCE Processing for Phylogenomics section.

Usually, we want a count of the actual number of reads in a given sequence file
for a given species. As mentioned in the UCE Processing for Phylogenomics section, we can
do this several ways. We’ll use tools from unix, because they are fast. The
next line of code will count the lines in each R1 file (which should be equal to
the reads in the R2 file) and divide that number by 4 to get the number of
sequence reads.

The data you just downloaded are actual, raw, untrimmed fastq data. This means
they contain adapter contamination and low quality bases. We need to remove
these - which you can do several ways. We’ll use another program that I
wrote (illumiprocessor) because it allows us to trim many different indexed
adapters from individual-specific fastq files - something that is a pain to do
by hand. That said, you can certainly trim your reads however you would like.
See the illumiprocessor website for instructions on installing the program.

To use this program, we will create a configuration file that we will use to
inform the program about which adapters are in which READ1 and READ2 files. The
data we are trimming, here, are from TruSeq v3 libraries, but the indexes are 10
nucleotides long. We will set up the trimming file with these parameters, but
please see the illumiprocessor documentation for other options.

# this is the section where you list the adapters you used. the asterisk# will be replaced with the appropriate index for the sample.[adapters]
i7:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTTG
i5:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
# this is the list of indexes we used[tag sequences]
BFIDT-166:GGAGCTATGG
BFIDT-016:GGCGAAGGTT
BFIDT-045:TTCTCCTTCA
BFIDT-011:CTACAACGGC
# this is how each index maps to each set of reads[tag map]
Alligator_mississippiensis_GGAGCTATGG:BFIDT-166
Anolis_carolinensis_GGCGAAGGTT:BFIDT-016
Gallus_gallus_TTCTCCTTCA:BFIDT-045
Mus_musculus_CTACAACGGC:BFIDT-011
# we want to rename our read files something a bit more nice - so we will# rename Alligator_mississippiensis_GGAGCTATGG to alligator_mississippiensis[names]
Alligator_mississippiensis_GGAGCTATGG:alligator_mississippiensis
Anolis_carolinensis_GGCGAAGGTT:anolis_carolinensis
Gallus_gallus_TTCTCCTTCA:gallus_gallus
Mus_musculus_CTACAACGGC:mus_musculus

I create this file in a directory above the one holding my reads, so the
structure looks like:

Notice that the program has created a log file showing what it did, and it has
also created a new directory holding the clean data that has the name clean-
fastq (what you told it to name the directory). Within that new directory,
there are taxon-specific folder for the cleaned reads. More specifically, your
directory structure should look similar to the following (I’ve collapsed the
list of raw-reads):

The -> in the raw-reads directory above means there are symlinks to the
files. I have removed the file paths and replaced them with <PATH> so that the
figure will fit on a page.

The really important information is in the split-adapter-quality-trimmed
directory - which now holds our reads that have had adapter-contamination and
low-quality bases removed. Within this split-adapter-quality-trimmed
directory, the READ1 and READ2 files hold reads that remain in a pair (the
reads are in the same consecutive order in each file). The READ-singleton
file holds READ1 reads OR READ2 reads that lost their “mate” or “paired-
read” because of trimming or removal.

You might want to get some idea of what effect the trimming has on read counts
and overall read lengths. There are certainly other (better) tools out there to
do this (like FastQC), but you can get a reasonable idea of how good your reads
are by running the following, which will output a CSV listing of read stats by
sample:

# move to the directory holding our cleaned readscd clean-fastq/
# run this script against all directories of readsfor i in *;do
phyluce_assembly_get_fastq_lengths --input $i/split-adapter-quality-trimmed/ --csv;done

The output you see should look like this:

# sample,reads,total bp,mean length, 95 CI length,min,max,median
All files in dir with alligator_mississippiensis-READ1.fastq.gz,3279362,294890805,89.9232243955,0.01003996813,40,100,100.0
All files in dir with anolis_carolinensis-READ2.fastq.gz,3456457,314839345,91.0873026917,0.00799863728974,40,100,100.0
All files in dir with gallus_gallus-READ2.fastq.gz,749026,159690692,213.197795537,0.0588973605567,40,251,250.0
All files in dir with mus_musculus-READ-singleton.fastq.gz,2332785,211828511,90.8049867433,0.0102813002698,40,100,100.0

phyluce has a lot of options for assembly - you can use velvet, abyss, or
trinity. For this tutorial, we are going to use trinity, because I believe it
works best for most purposes. The helper programs for the other assemblers use
the same config file, too, so you can easily experiment with all of the
assemblers.

To run an assembly, we need to create a another configuration file. The
assembly configuration file looks like the following, assuming we want to
assemble all of our data from the organisms above:

If you want to change the names on the left hand side of the colon in the config
file, you can do so, but the PATHs on the right hand side need to point to our
“clean” UCE raw reads. If you have files in multiple locations, you can use
different PATHs on the right-hand side.

Attention

Although you can easily input new PATHs in this file, the
structure of the data below the PATH you use must be the same - meaning
that the structure and naming scheme for READ1, READ2, and READ-singleton
must be the same. Or, put another way, it must look like the following:

Note that I am using 12 physical CPU cores to do this work. You
need to use the number of physical cores available on your machine. The
phyluce.conf file also assumes you have at least 8 GB of RAM on your
system, and it is better to have much more. If you use more CPU cores than
you have or you specify more RAM than you have, bad things will happen.

As the assembly proceeds, you should see output similar to the following:

Your species-specific assembly files are in the trinity-assemblies directory
nested within species-specific directories that correspond to the name you used
in the assembly.conf file (to the left of the colon). Each name is appended
with _trinity because that’s what trinity requires. There is a symlink
within each species-specific folder so that you now the “contigs” you assembled
are in Trinity.fasta. trinity.log holds the log output from trinity.

There is also a contigs directory within this folder. The contigs directory
is the important one, because it contains symlinks to all of the species-
specific contigs. This means that you can treat this single folder as if it
contains all of your assembled contigs.

The process of read assembly often differs by operating system and sometimes
by OS version, and some of these differences are due to libraries that
underlie many of the assembly programs. Expect to see differences. You
should not expect for them to be huge.

Attention

If you see max-contig sizes around 16KB (for vertebrates), that
is commonly the entire or almost-entire mtDNA genome. You do not tend to
see entire mtDNA assemblies when the input DNA was extracted from a source
having few mitochondria (e.g. blood).

There are many, many other assembly QC steps you can run other than simply
looking at the stats of the assembled contigs. We will not go into those here.

Now that we’ve assembled our contigs from raw reads, it’s time to find those
contigs which are UCE loci and move aside those that are not. The directory
structure before we do this should look like the following:

These are the capture data for the alligator_mississippiensis sample. We
targeted 5k UCE loci in this sample and recovered roughly 4484 of those loci.
Before reaching that total of 4484 loci, we removed 45 and 44 loci from the data
set because they looked like duplicates (probes supposedly targeting different
loci hit the same contig or two supposedly different contigs hit probes
designed for a single UCE locus).

Question: Why is the count of UCE loci different by sample?

For these example data, we enriched some samples (alligator_mississippiensis
and gallus_gallus) for 5k UCE loci, while we enriched others
(anolis_carolinensis and mus_musculus) for 2.5k UCE loci. Additionally, the
2.5k UCE enrichments did not work very well (operator error).

The directory structure now looks like the following (everything collapsed but
the uce-search-results directory):

The search we just ran created lastz search result files for each taxon, and
stored summary results of these searches in the probe.matches.sqlite database
(see The probe.matches.sqlite database for more information on this database and its structure).

Now that we have located UCE loci, we need to determine which taxa we want in
our analysis, create a list of those taxa, and then generated a list of which
UCE loci we enriched in each taxon (the “data matrix configuration file”). We
will then use this list to extract FASTA data for each taxon for each UCE locus.

First, we need to decide which taxa we want in our “taxon set”. So, we create a
configuration file like so:

These names need to match the assembly names we used. Here, we have just put
all 4 taxa in a list that we named all. However, we can adjust this list in
many ways (see Creating a data matrix configuration file).

Save this file as taxon-set.conf at the top level of our uce-tutorial
directory. The directory should look like this, now:

Lots of times we want to know individual statistics on UCE assemblies for a
given taxon. We can do that by exploding the monolithic fasta file into a file
of UCE loci that we have enriched by taxon, then running stats on those exploded
files. To do that, run the following:

You have lots of options when aligning UCE loci. You can align the loci and use
those alignments with no trimming, you can edge-trim the alignments following
some algorithm, and you can end+internally trim alignments following some
algorithm. It’s hard to say what is best in all situations. When taxa are
“closely” related (< 30-50 MYA, perhaps), I think that edge-trimming alignments
is reasonable. When the taxa you are interested in span a wider range of
divergence times (> 50 MYA), you may want to think about internal trimming.

How you accomplish you edge- or internal trimming is also a decision you need to
make. In phyluce, we implement our edge-trimming algorithm by running the
alignment program “as-is” (i.e., without the –no-trim) option. We do
internal-trimming by turning off trimming using –no-trim, then passing the
resulting alignments (in FASTA format) to a parallel wrapper around Gblocks.

You also have a choice of aligner - mafft or muscle (or you can externally
align UCE loci using a tool like SATé, as well).

The . values that you see represent loci that were aligned and succesfully
trimmed. Any X values that you see represent loci that were removed
because trimming reduced their length to effectively nothing.

Attention

The number of potential alignments dropped here is abnormally
large becase our sample size is so small (n=4).

The current directory structure should look like (I’ve collapsed a number of
branches in the tree):

Note that there are only 2 sets of counts in the Data matrix
completeness section because (1) we dropped all loci having fewer than 3
taxa and (2) that only leaves two remaining options.

The most important data here are the number of loci we have and the number of
loci in data matrices of different completeness. The locus length stats are
also reasonably important, but they can also be misleading because edge-trimming
does not remove internal gaps that often inflate the length of alignments.

Now, let’s do the same thing, but run internal trimming on the resulting
alignments. We will do that by turning off trimming –no-trim and outputting
FASTA formatted alignments with –output-format fasta.

If you look at the alignments we currently have, you will notice that each
alignment contains the locus name along with the taxon name. This is not what
we want downstream, but it does enable us to ensure the correct data went into
each alignment. So, we need to clean our alignments. For the remainder of
this tutorial, we will work with the Gblocks trimmed alignments, so we will
clean those alignments:

For the most part, I analyze 75% and 95% complete matrices, where “completeness”
for the 75% matrix means than, in a study of 100 taxa (total), all alignments
will contain at least 75 of these 100 taxa. Similarly, for the 95% matrix, in a
study of 100 taxa, all alignments will contain 95 of these 100 taxa.

Attention

Notice that this metric for completeness does not pay attention
to which taxa are in which alignments - so the 75%, above, does not mean
that a given taxon will have data in 75 of 100 alignments.

To create a 75% data matrix, run the following. Notice that the integer
following –taxa is the total number of organisms in the study.

# make sure we are in the correct directorycd uce-tutorial/taxon-sets/all
# the integer following --taxa is the number of TOTAL taxa# and I use "75p" to denote the 75% complete matrix
phyluce_align_get_only_loci_with_min_taxa \
--alignments mafft-nexus-internal-trimmed-gblocks-clean \
--taxa 4\
--percent 0.75 \
--output mafft-nexus-internal-trimmed-gblocks-clean-75p \
--cores 12\
--log-path log

Now that we have our 75p data matrix completed, we can generate input files
for subsequent phylogenetic analysis. For the most part, I favor ExaBayes,
RAxML, and ExaML (usually in that order). Formatting our 75p data into
a phylip file for these programs is rather easy. To do that, run:

Notice that using the –charsets flag will output the charsets
as well as the concatenated PHYLIP file. You generally want these and the
cost for them is low. I would always run this option - even if you later
do not use them.

The current directory structure should look like (I’ve collapsed a number of
branches in the tree):

The above data are ready to analysis in RAxML. I usually run RAxML in two
pieces - first running the “best” tree search, then running the bootstrap
replicates, and I almost always run these analyses on the HPC.

I have not included ExaBayes in the conda package for phyluce because it
generally requires MPI and should really be run on an HPC or a large local
machine. If you would like to run it on your machine, you’ll need to get it
installed on your own.

To run data in ExaBayes, you need the PHYLIP file you just created as well as
2 other files in order to the the program - one giving partition information and
another giving configuration options. Create these.

# make sure we are in the correct directorycd uce-tutorial/taxon-sets/all/mafft-nexus-internal-trimmed-gblocks-clean-75p-exabayes
# get a random numberecho$RANDOM# this outputs5341# ensure the files we created above are in this directory, then
mpirun -np 12 exabayes -f mafft-nexus-internal-trimmed-gblocks-clean-75p.phylip -n run1 -q aln.part -s 5341 -c config.nexus -R 4

Warning

Note that I am using 12 physical CPU cores here (-T 12). You
need to use the number of physical cores available on your machine.

Once those are finished running, we can summarize the posterior using the
consense and postProcParams program: