Once we have assembled our fastq data (see Assembly), we need to process
those contigs to (a) determine which represent enrichend UCE loci and (b)
remove any potential paralogs from the data set. Before we can do that, we
need to to a little preparatory work by downloading a FASTA file representing
the probe set that we used.

To identify which of the contigs we’ve assembled are UCE loci (and which UCE
loci they might be), we are going to match our assembled contigs to the probes
we used to enrich UCE loci. Before we do that, however, we need to download
a copy of probe set we used for matching purposes.

Attention

We archive official probe sets at
https://github.com/faircloth-lab/uce-probe-sets/, but you need to be
careful about which one you grab - probe sets can be of different sizes
(e.g. 2,500 or 5,500 loci) and for different groups of taxa (e.g., amniotes,
fish)

To download a given probe set for phyluce, you need to figure out which probe
set you need. Then, you can use a command like wget on the command-line (or
navigate with your browser to the URL and save the file):

# to get the 2.5k, amniote probe set
wget https://raw.githubusercontent.com/faircloth-lab/uce-probe-sets/master/uce-2.5k-probe-set/uce-2.5k-probes.fasta
# to get the 5k, amniote probe set
wget https://raw.githubusercontent.com/faircloth-lab/uce-probe-sets/master/uce-5k-probe-set/uce-5k-probes.fasta

Once we’ve downloaded the probe set we used to enrich UCE loci, we need to find
which of our assembled contigs are the UCE loci that we enriched. During this
process, the code will also remove any contigs that appear to be duplicates as a
result of assembly/other problems or a biological event(s).

The way that this process works is that phyluce aligns (using lastz) the
contigs you assembled to the probes you input on a taxon-by-taxon (or otu-by-
otu) basis. Then, the code parses the alignment file to determine which contigs
matched which probes, whether any probes from a single locus matched multiple
contigs or whether a single contig matched probes designed from muliple UCE
loci. Either of these latter two events suggests that the locus in question is
problematic.

Hint

ADVANCED: The default regular expression assumes probes in your
file are named according to uce-NNN_pN, where uce- is just a text
string, NNN is an integer value denoting each unique locus, _p is a
text string denoting a “probe” targeting locus NNN, and the trailing
N is an integer value denoting each unique probe targeting the same
locus.

If you are using a custom probe file, then you will either need to ensure
that your naming scheme conforms to this approach OR you will need to
input a different regular expression to convert the probe names to locus
names using the --regex flag.

The *.log files for each operation are always printed to the
screen AND also written out to the $CWD (current working directory).
You can keep these files more orderly by specifying a $LOG on the
command line using the --log-path option.

The *.lastz files within the /path/to/output directory are basically for
reference and individual review (they are text files that you can open using a
text editor to view). The really important data from the lastz files are
summarized in the:

probe.matches.sqlite

database file. It’s probably a good idea to have some knowledge of how this
database is structured, since it’s basically what makes the next few steps work.
So, let’s go over the structure and contents of this database.

probe.matches.sqlite is a relational database that summarizes all
valid matches of contigs to UCE loci across the set of taxa that you fed it.
The database is created by and for a program named sqlite, which is a very
handy, portable SQL database. For more info on SQL and SQLITE, see this
sqlite-tutorial. I’ll briefly cover the database contents and use below.

Basically, what this indicates is that you enriched 9 of 10 targeted UCE loci
from genus_species1, 3 of 10 UCE loci in the list from genus_species2,
and 2 of 10 UCE loci from genus_species3. The locus name is given in the
ucecolumn. Remember that we’ve limited the results to 10 rows for the sake
of making the results easy to view.

If we wanted to see only those loci that enriched in all species, we could run:

Basically, the matches table and this query are what we run to generate
complete (only loci enriched in all taxa) and incomplete (all loci
enriched from all taxa) matrices very easily and quickly (see
Creating a data matrix configuration file).

The match_map table shows us which species-specific, contigs match which UCE
loci. Because each assembly program assigns an arbitrary designator to each
assembled contig, we need to map these arbitrary designators (which also differ
for each taxon/OTU) to the UCE locus to which it corresponds. Because assembled
contigs are also not in any particular orientation relative to each other across
taxa/OTUs (i.e., they may be 5’ - 3’ or 3’ - 5’), the database also records the
orientation of all contigs relative to orientation of each probe in the probes
file.

Let’s take a quick look at the match_map table:

SELECT*FROMmatch_mapLIMIT10;

This query is similar to the one that we ran against matches and returns the
first 10 rows of the match_map table:

As stated above, these results show which assembled contigs “hit” particular UCE
loci. So, if we were to open the
$ASSEMBLY/contigs/genus_species1.contigs.fasta symlink the contig named
node_1676 corresponds to UCE locus uce-503. Because contigs are named
arbitrarily during assembly, this same UCE locus is also found in
genus_species2, but it is named node-243.

Each entry in the rows also provides the orientation for particular contigs
(-) or (+). This orientation is relative to the orientation of the UCE
probes/locus in the source genome (e.g., chicken for tetrapod probes).

We use this table to generate a FASTA file of UCE loci for alignment (see :ref
:fasta-file), after we’ve identified the loci we want in a particular data set
(see Creating a data matrix configuration file). The code for this step also uses the associated
orientation data to ensure that all the sequence data have the same orientation
prior to alignment (some aligners will force alignment of all reads using the
given orientation rather than also trying the reverse complement and picking the
better alignment of the two).

Now that we know the taxa for which we’ve enriched UCE loci and which contigs
we’ve assembled match which UCE loci, we’re ready to generate some data
matrices.

The data matrix generation process consists of two distinct parts:

Getting locus counts and generating a taxon set

Extracting FASTA data from our $ASSEMBLY/contigs based on the taxon set

After we identify the UCE loci we enriched, but before we extract fasta data
from our $ASSEMBLY/contigs corresponding to those loci, we need to create a
data matrix configuration file that denotes (1) which taxa we want to include in
a given analysis and (2) which loci will be included with this taxon set.

The taxa included in the data matrix configuration file are determined by the
user - you input a list of taxa you want to the analysis. The UCE loci included
in the data matrix configuration are then determined by the software which
compares the requested taxa to UCE match results in probe.matches.sqlite and
two flags that you pass either one requesting complete data matrix or one
requesting an incomplete data matrix.

complete matrix

A phylogenetic matrix (typically sequence data) in which there are no
missing data at any locus for any taxon/OTU.

incomplete matrix

A phylogenetic matrix (typocally sequence data) in which data may be
missing from a given taxon or a given loci (or both).

During the creation of the data matrix configuration file you can also include
additional data from pre-existing UCE match databases and contigs (see :ref
:outgroup-data).

First, let’s generate a data matrix configuration file from only the current UCE
enrichments that will be complete - meaning that we will not include loci where
certain taxa have no data (either the locus was not enriched for that taxon or
removed during the filtering process for duplicate loci).

To do this, you need to create a starting taxon-configuration file (a text-based
file) denoting the taxa we want in the data set. The taxon-configuration file
should look exactly like this (substitute in your taxon names):

[dataset1]genus_species1genus_species2genus_species3

Let’s assume you save this file as datasets.conf. Now, to create the data
matrix configuration file from this taxon-configuration file, run:

This basically says that although we’ve detected a total of 1,314 UCE loci in
the 3 taxa in which we are interested, when we boil those down to a complete
matrix, the complete matrix is only going to contain 306 UCE loci (of the
1,314). We had to drop 428 loci because we did not detect them in genus_species1
and we had to drop another 380 loci because we did not detect them in
genus_species2.

The output written to the /path/to/uce/taxon-set1/dataset1.conf will look
something like:

Now, you might think that increasing the locus count is simply a matter of
removing genus_species1 from the list of taxa. This is not strictly true,
however, given the vagaries of hits and misses among taxa.
phyluce_assembly_get_match_counts
has several other options to help you determine which taxa may be causing
problems, but picking the best combination of taxa to give you the highest
number of loci is a reasonably hard optimization problem.

You may not always want a complete data matrix. Or generating a complete matrix
drops too many loci for your tastes (it often does). In that case, you can
easily generate an incomplete dataset using the following:

This will generate a dataset that includes any loci enriched across the taxa
in the datasets.conf file.

Note

You do not determine the “completeness” of the finaly data matrix
that you want to create during this stage - that happens later, after
alignment (see Finalize matrix completeness). As a result, we are alinging data
from any and all UCE loci having >= 3 taxa, which allows us to flexibly
select the level of incompleteness later, without having to re-run our
alignments.

In this way, you can get some idea of how different taxon-set memberships
affect the resulting data matrix configuration files prior to extracting the
relevant FASTA data from $ASSEMBLY/contigs - which is a reasonably slow
process.

You may want to include outgroup data from another source into your datasets.
This can be from the pre-processed outgroup data files, but it doesn’t need to
be these outgroup data. These additional data can also be contigs previously
assembled from a different set of taxa.

Hint

ADVANCED: If you want to include outgroup data from
genome-enabled taxa, we have already created several repositories
containing these data. We maintaing these data under version control at:
https://github.com/faircloth-lab/uce-probe-sets. To download these data
and use them in your analyses, you can clone the data using git:

gitclonehttps://github.com/faircloth-lab/uce-probe-sets

Then update your --taxon-list-config file and provide the proper paths
to the cloned data, as detailed below.

The first step of this process is to setup your --taxon-list-config slightly
differently - by indicating taxa from external data sources using asterisks:

To keep all this extension from getting too crazy, I’ve limited the ability to
include external data to a single set. If you have lots of data from many
different enrichments, you’ll need to generate a contigs folder containing all
these various assemblies (or symlinks to them), then align the probes to these
data (see Match contigs to probes). Once you do that, you can extend your
current data set with all of these other data.

Once we have created the data matrix configuration file containing data for our
taxa of interest and those loci of interest, we need to extract the appropriate
FASTA sequences from each assembly representing the taxon/OTU of interest (e.g.
in $ASSEMBLY/contigs). This is a reasonably straightforward process that
differs only slightly based on whether you are extracting a complete matrix of
data, an incomplete matrix of data, and/or whether you are incorporating any
external data sources.

Note the addition of the --incomplete-matrix option. This
creates an output file that contains the names of the missing loci by
taxon/OTU. You can name this file anything you like. I tend to use
.incomplete as the extension so that it is clear what this file
contains.

When we’re incorporating external data, we need to pass the name of the external
database as well as the name of the external contigs. To generate a FASTA
file containing the sequence data from a complete data matrix configuration
that includes exeternal data sources, run:

With all of that out of the way, things get much easier to deal with. Now, we
need to align our data across loci, and once we’re done with that, the remaining
operations we can run on the data are format-conversions, QC steps, matrix
trimming for completeness, and any number of other fun things.

Aligning the amount of data generated by enrichment approaches is reasonably
computationally intensive - so the alignment step goes fastest if you have a
multicore machine. You also have several alignment options available, although
I would suggest sticking with MAFFT.

Attention

The alignment process, as implemented by phyluce, includes
trimming steps that trim ragged edges and remove alignments that become to
short following trimming.

To turn trimming off and trim alignments using another approach, pass the
--no-trim option. There are also several more options related to
trimming that you can tweak. To view these, run
phyluce_align_seqcap_align--help.

If you pass more --cores than your machine has, you will
receive an error.

Note

Here, we are accepting the default, output alignment format (“nexus”).
To change that format to something else, pass the --output-format option
with a choice of {fasta,nexus,phylip,clustal,emboss,stockholm}.

For historical reasons, and also for users to ensure that the sequence data
aligned together are from the same loci, each sequence line in the alignment
file output by seqcap_align_2 contains the genus_species1 designator,
but the genus_species1 designator is also prepended with the locus name
(e.g. uce-1005_genus_species1). We need to remove these if we plan to
concatenate the loci (RAxML). More generally, it is a good idea
to remove locus names from sequence lines before running any analyses. To do
this, run:

For historical reasons, and also for users to ensure that the sequence data
aligned together are from the same loci, each sequence line in the alignment
file output by seqcap_align_2 contains the genus_species1 designator,
but the genus_species1 designator is also prepended with the locus name
(e.g. uce-1005_genus_species1). We need to remove these if we plan to
concatenate the loci (RAxML). More generally, it is a good idea
to remove locus names from sequence lines before running any analyses. To do
this, run:

After checking the resulting alignment summary stats and checking your
alignments for quality, you will generally want to cull the data set to reach
your desired level of completeness. That is easily done by running the
following, while inputting the set of alignments just generated using:

Finally, you will need to add missing data designators for taxa missing from
each alignment of a given locus. This will basically allow you to generate
concatenated data sets and it may reduce error messages from other programs
about files having unequal numbers of taxa. To do this, run:

Many workflows for phylogenetics simply involve converting one alignment format
to another or changing something about the contents of a given alignment. We
use many of these manipulations in the next section (see Preparing alignment data for analysis),
as well.

To convert one alignment type (e.g., nexus) to another (e.g., fasta), we have a
relative simple bit of code to achieve that process. You can greatly speed this
processing step up on a multicore machine with the --cores option:

PHYLIP, PhyML, and other programs like CloudForest require input files to be in
strict phylip format for analysis. Converting alignment files to this format
was discussed above, and is simple a matter of:

MrBayes is a little more challenging to run. This is largely due to the fact
that we usually estimate the substitution models for all loci, then we partition
loci by substitution model, concatenate the data, and format an appropriate
file to be input to MrBayes.

The tricky part of this process is estimating the locus-specific substitution
models. Generally speaking, I do this with CloudForest now, then I strip the
best-fitting substitution model from the CloudForest output, and input that
file to the program that creates a nexus file for MrBayes.

First, estimate the substitution models using cloudforest (this will also give
you genetrees for all loci, as a bonus). You will need your alignments in
strict phylip format:

In the above, genetrees is a keyword that tells CloudForest that you mean to
estimate genetrees (instead of bootstraps). Depending on the size of your
dataset (and computer), this may take some time. Once this is done:

CloudForest is a program written by Nick Crawford and myself that helps you
estimate genetrees and perform bootstrap replicates for very large datasets.
Data input to CloudForest should be in strict phylip format (see
PHYLIP/CloudForest). First, as above, run genetree analysis on your data (
if you ran this above, you don’t need to run it again). This will estimate
the genetrees for each locus in your dataset, using it’s best fitting
substitution model):

We can also use RaXML to genrate gene trees to turn into a species tree. To keep
the taxa names similar to what I run through CloudForest, I usually input
strict phylip formatted files to these runs (see PHYLIP/CloudForest). Once
that’s done, you can generate genetrees with:

Number of –cores is the number of simultaneous trees to estimate, while
–threads is the number of threads to use for each tree. Although somewhat
counterintuitive, I’ve found that 1 –thread per locus and many locis being
processed at once is the fastest route to go.

Once that’s finished, you can generate bootstrap replicates for those same loci: