The first few tutorials have given you a feel for how to perform
phylogenetic/phylogeograohpc analyses using existing probe sets, new data, and
genomes. However, what if you are working on a set of organisms without a probe
set targeting conserved loci? How do you identify those loci and design baits
to target them? This Tutorial shows you how to do that.

In the examples below, we’ll follow a UCE identification and probe design wo
rkflow using data from Coleoptera (beetles). Although you can follow the entire
tutorial from beginning to end, I’ve also made the BAM files containing mapped
reads available, which lets you skip the computationally exepensive step of
performing read simulation and alignment.

You do not neccessarily need to do this as part of this tutorial
for UCE identification and probe design - If you only want to follow the
steps for locus identification (skipping probe design and in-silico
testing), you can simply download the prepared FASTQ/BAM files from
figshare.

You do not neccessarily need to do this as part of this tutorial
for UCE identification and probe design - If you only want to follow the
steps for locus identification (skipping probe design and in-silico
testing), you can simply download the prepared FASTQ/BAM files from
figshare.

When you get genome sequences from NCBI, the FASTA headers of most
scaffold/contigs contain a lot of extra cruft that can cause problems with some
of the steps in the UCE identification and probe design workflow. I usually
remove the extra stuff, maintaining only the accession number information for
each contig / scaffold. To do that, I use a little python script that looks
like the following:

For these genome assemblies, the important bits are in the FASTA header just
before the space in the name. The code above basically keeps this information
before the space and discards the remaining FASTA header information.

In the case of the genomes above, here are the commands I ran (note also that
this creates an output file with a different name from the input file):

Later during the workflow, we’re going to need to have our genomes in 2bit
format, which is a binary format for genomic data that is easier and faster to
work with than FASTA format. We’ll use a BASH script to convert all of the
sequences, above, to 2bit format:

You do not neccessarily need to take this step as part of the
tutorial - you can simply download prepared, simulated FASTQ files from
figshare.

In order to locate UCE loci across a selection of different genomes, we’re going
to align reads from each taxon, above, to a reference genome sequence (the
“base” genome sequence) using a permissive raw read aligner. You can use reads
from low-coverage, genome scans or you can use reads simulated from particular
genomes. In this tutorial, we’re going to use this latter approach and simulate
reads (without sequencing error) from the genomes that we will align to the base
genome. To accomplish this, we’ll use art,
which is a robust read simulator that is reasonably flexible.

Because we’re using simulated reads to locate UCE loci, we want to turn off the
built-in feature of art that adds some sequencing error to simulated reads.
This results in a general form of the call to art that looks like:

This simulates reads from the –in genome.fasta that are 100 bp in length,
cover the genome randomly to roughly 2X, have an insert size of 200 bp, and have
an inserts size standard deviation of 150. The last line turns off the
simulation of sequencing error in each of these reads and also turns off the
creation of an alignment file showing where the reads came from in the genome
sequence.

In the case of the Coleoptera genomes you downloaded, here are the commands I
ran to simulate the read data that we will use in the next step (note also that
this creates an output file with a different name from the input file). First,
create a directory to hold the simulated read data:

Now, you should see 2 read files for each taxon. We are going to “break” the
pairs by merging the read information together, then we are going to zip the
resulting file that contains the R1 and R2 data. We can accomplish this
pretty easily using a quick BASH script:

Attention

Note that we’ve dropped menMol1 from the read simulation
process. This is largely because it is an outgroup to beetles. We’ll use
it later, when we’re performing in silico tests of the UCE bait set.

Now that we have read data representing each of our exemplar taxa, we need to
align these reads to the “base” genome sequence, in this case the genome sequence
of Tribolium castaneum (aka triCas1). We selected this assembly as the “base”
genome because of it’s age (i.e., better-assembled) and level of annotation.

We will perform the read alignments to triCas1 using the permissive read
aligner, stampy, which works well when aligning sequences to a divergent
reference sequence. Hoever, before running the alignments, we need to prepare
the base genome. And, before we do that, let’s create a direcetory to work in:

Now that we’ve prepared our base genome, we need to perform the actual alignment
of the simulated reads to the base genome. And, before we do that, let’s create
a directory to hold the resulting alignments:

Now, we need to perform the alignments on a taxon-by-taxon basis (and/or you can
run these in parallel using HPC). To do this easily (and on a local
computer) we can use a BASH script to run the alignments serially:

Warning

Note that I am using 16 physical CPU cores ($cores) to do this
work. You need to use the number of physical cores available on your
machine.

This code basically loops over each exemplar genomes, aligns the reads to the
base genome sequence, and converts the resulting output to BAM format (which
is a binary, compressed version of SAM format).

When the alignments have completed, your directory structure should look
something like:

Now, these BAM files are pretty large because they contain all mapped as well
as all unmapped reads. We want to remove those unmapped reads, which will also
reduce file-size. We can do that using samtoolsview and a BASH script.
We’re also going to create a directory named all and symlink all of the
reduced BAM files to this directory.

Warning

The script, as-written, removes the BAM files containing both
mapped and unmapped reads. If you don’t want to do this, remove the rm
$critter/$critter-to-triCas1.bam; line.

Basically, because we’ve now mapped simulated (or actual) sequence data from
several exemplar taxa to a closely-related “base” genome sequence, we’ve
essentially identified putatively orthologous loci shared between the exemplar
taxa and the base taxon. These conserved regions are where the simulated (or
actual) sequene data mapped with a sequence divergence of < 5%.

Now, we need to filter this large number of conserved regions to remove things
like repetitive regions, but also to find which loci are shared among all
exemplar taxa - not simply a single exemplar and the base taxon.

If you are starting the tutorial at this position after
downloading the *-MAPPING.bam files from figshare, then you will need to
create a directory to work in named uce-coleoptera and then place all of
the *-MAPPING.bam files in a subdirectory of uce-coleoptera names
alignments/all. Your resulting directory structure should look like:

In the steps above, we have generated BAM files representing reads that stampy
has mapped to the base genome. Those reads that map align to putatively
conserved sequence regions (that we need to filter), and these alignments of
mapping reads should reside in our alignments/all directory.

To begin the filtering process, we’re going to convert each BAM file to BED
format, which is an interval-based format that is easy and fast to manipulate
with a software suite known as bedtools. But before we do that, we are doing
to create a bed directory to hold all of these BED format files.

Before moving forward with the merge command, below, we need to sort the
resulting BED files, which orders each lines of data in the BED file by
chromosome/scaffold/contig and position along that chromosome/scaffold/contig.
Again, we can do this with some bash scripting:

Because there may be small gaps between proximate regions of conservation
(which may result because we’re using data that are either from low-coverage,
simulated reads or low-coverage actual reads) we need to merge together
putative regions of conservation that are proximate. Luckily bedtools has
a handy tool to do that - the merge function.

To get some idea of the total number of merged, putatively conserved regions in
each exemplar taxon that are shared with the base genome, we can simply loop
over the files and count the number of lines in each:

At this point, we’ve mapped reads to the base genome, kept those regions where
reads mapped, converted that to a BED, and merged intervals that are very close
to one another.

What we have not done is remove any putatively conserved intervals shared
between exemplar taxa and the base genome that are repetitive regions. To do
this, we’re going to run a python program over all of the BED files for each
exemplar taxon. This program will look at the intervals shared between the
exemplar taxon and the base genome and it will remove intervals where the base
genome is shorter than 80 bp and where > 25 % of the base genome is masked.

Up to this point, we’ve been processing each file on a taxon-by-taxon basis,
where each taxon had data aligned to the base genome. Now, we need to determine
which loci are conserved across taxa. To do that, we first need to prepare a
configuration file (named bed-files.conf) that gives the paths to each of our
*.bam.sort.merge.strip.bed files. That file needs to be in configuration file
format, like so:

The [beds] line is the “header” line, and that is followed by each taxon name
(on the left) and the name of the BED file we want to process (on the right).
You should place this file in the bed directory. If you place it elsewhere,
you’ll need to use full paths on the right hand side.

Your directory structure should now look like (note new bed-files.conf)

Now, we’re going to run the following program, that creates a record of which
alignment intervals are shared among taxa. We need to pass the location of the
bed-files.conf to this program, along with the name of our base genome, and a
name for the output database that will be created:

The program shows the results of inserting data for each exemplar taxon that
we’ve selected. If we take a look at the table contents (see The probe.matches.sqlite database
for more instructions on sqlite databases), we see something like the
following:

The first row of this table (which is limited to 10 rows of results by the query
although it is 60699 rows long) shows that for triCas1 contig GG695505.1,
agrpla1, anogla1, and onttau1 have reads that overlap at position 532 to
632.

Now that we have our table of results, we can run a quick query (using a
Python program) against the table to look at results, more generally. The
following code queries the database and writes out the number of loci shared by
the base taxon (triCas1) and 1, 2, 3, 4, and 5 (all) of the exemplar taxa that
we’ve aligned to the base genome. You need to give the program the path to the
database created above and the name of the base taxon:

The output from this program basically says that, if we are interested in only
those loci found in all exemplar taxa that align to the base genome, there are
1,822 of those. Similarly, if we’re willing to be a little less strict about
things, there are 6,471 conserved loci that are shared by triCas1 and 4 of the
exemplar taxa.

Question: How conservative should I be?

Basically, the question boils down to “Should I select only the set of loci
shared by all exemplars and the base genome or shoul I be more liberal?”.
It’s also a hard question to answer. In most cases, I’m pretty happy
selecting n-1 or n-2 where n is the total number of exemplar taxa. In
the example below, however, we’ve selected n as the “ideal”. This is
largely because we have so little information about coleopteran genomes - so
we want to be pretty darn sure these loci are found in most/all of them.

Now that we have a general sense of the number of conserved loci in each class
of sharing across exemplars (e.g. 5 (all), 4, 3, 2, 1), we need to extract those
loci that fall within one of these classes. In this case (and as noted in the
box, above), we’re going to ouput only those conserved loci that we’ve
identified as being shared between the base genome and all exemplars. We do
that with a slightly different Python script. This script takes the database
name, the base genome, the count of exemplar taxa shared across, and the name of
an output file as input. The output file will be BED formatted.

Now that we’ve indentified conserved sequences shared among the base genome and
the exemplar taxa, we need to start designing baits to capture these loci. The
first step in this process is to extract FASTA sequences from the base genome
that correspond to the loci we’ve identified. We do that with a Python script
that takes as input the BED file we created, above, the 2bit-formatted base
genomes, a length of sequence we want to extract (160 bp), and an output FASTA
filename.

Question: Why buffer to 160 bp?

We are extracting FASTA regions of 160 bp because that allows us to place 2
baits right in the center of this region at 3x tiling density which means
that standard 120 bp baits will overlap by 40 bp and have 80 bp to each side
(total length 160 bp).

Now that we’ve extracted the appropriate loci from the base genome, we need to
design bait sequences targeting these loci. For that, we use a different
Python script. This program takes as input the FASTA file we just created, and
some design-specific information (–probe-prefix, –design, –designer). The
design options (–tiling-density, –two-probes, –overlap) ensure that we
select two baits per locus with 3x tiling that overlap the middle of the
targeted locus. Finally, we remove (–masking, –remove-gc) potentially
problematic baits with >25% repeat content and GC content outside of the range
of 30-70% (30 % > GC > 70%).

Because we haven’t search for duplicates among our loci and because reducing
longer reads to shorter ones (e.g. designing baits from loci) can introduce
duplicate baits, we need to screen the resulting bait set for duplicates. To do
that, we follow a 2-stage process - first to align all probes to themselves then
to use those alignments to remove potentially duplicates baits/loci. First we
run a lastz search of all baits to themselves. This program takes as input the
temp probes we just designed (as both –target and –query), relatively low
values for –identity and –coverage to make sure we identify as many
duplicates as possible, and the program writes these results to the –output
file:

Now that we’ve run the alignments, we need to screen them alignments and remove
the duplicate baits from the bait set. This program takes as input the lastz
results from above and the temp-probe file, as well as the probe-prefix that we
used during probe design, above. The results are written to a file that is
equivalent to the probe file name + DUPE-SCREENED, so in this case the output
file is named triCas1+5.temp-DUPE-SCREENED.probes.

Now that we have a duplicate-free (or putatively duplicate free) set of
temporary baits designed from conserved loci in the base genome, we’re going to
use some in-silico alignments to see if we can locate these loci in the several
exemplar genomes.

Attention

For the following analyses, you need genome assemblies for each
of the exemplar taxa, formatted as 2bit files.

We’ll use the results of these alignments to design a bait set that includes
baits designed from the base genome, but also from the exemplar taxa. This
should allow our bait set to work more consistently across broad groups of
organisms.

In terms of directory structure, things should look pretty similar to the
following:

Note that we have all the genomes in their directory, in both FASTA and 2bit
formats. We’re also have a new genome sequence in here - that of menMol1
(Mengenilla moldrzyki [twisted-wing parasites]), which represents the outgroup
to Coleoptera. We’re adding this taxon because it helps us bridge the base of
the tree - e.g. the divergence between the outgroup and the exemplar taxa that
we’re using to design probes.

Question: What exemplar taxa should I use for bait design?

This is a really hard question to answer. In old, divergent groups with few
genomic resources, the answer is usually “all the species” with genomic
data. Basically, you want to include exemplars that make the divergence
among baits targeting the same loci something >20% or so. That said, even
this number is a bit of a guess - no one has systematically tested how
“sticky” baits are when they are used to enrich loci across divergent
groups. We know they are pretty sticky and in certain cases can enrich loci
as much as 35%-40% divergent from the bait sequence. Generally speaking, I
try to include exemplar taxa during probe design that bridge the known
diversity of a given group... again, in many cases this is hard (or
impossible) to know given current data. So, you may have to take a bit of a
guess.

So, assuming that you have the appropriate 2bit files in
uce-coleoptera/genomes, we are going to align the temporary probes that we’ve
designed to the exemplar genomes, and we’re going to run these and subsequent
bait design steps in a new directory, named probe-design. So:

We need to align the temporary probe sequences to each genome, which we can do
using the following code, which takes as input our temporary probe file, the
list of genomes we want to align the probes against, the path to the genomes,
the minimum sequence identity to accept a match (on the low end of the spectrum
for this step), and number of compute cores to use, and the name of an output
database to create and the output directory in which to store the lastz results.

Warning

Note that I am using 16 physical CPU cores (–cores) to do this
work. You need to use the number of physical cores available on your
machine.

Based on the alignments of the temporary probe set to the exemplar genomes, we
need to extract FASTA data from each of the exemplar sequences so that we can
design baits targeting the conserved loci in each. This is pretty similar to
what we did earlier for the temporary probe set, except that now we’re running
the extraction across all the exemplar taxa.

Before we begin, we need to make a configuration file with all the genome
locations in it (again, as before):

Using the configuration file, we need to extract the FASTA sequence that we
need from each exemplar taxon. Here, we’re buffering each locus to 180 bp to
give us a little more room to work with during the probe design step. The
program takes our config file as input, along with the folder of lastz results
created above. The –name-pattern argument allows us to match files int the
–lastz directory, –probes is how we buffer the sequence, and we pass the
name of an output directory to –output:

The FASTA files we just created are in uce-coleoptera/coleoptera-genome-fasta.
The output from the program that you see basically shows you how many UCE loci
we extracted from each of the exemplar genomes. As expected, the lowest number
we located and extracted are from the menMol1 (outgroup) genome.

As before, we want to determine which loci we are detecting consistently across
all of the exemplar taxa when doing these in-silico searches. To do that, we’ll
run another bit of python code. Here, we’re working in the
uce-coleoptera/coleoptera-genome-lastz directory. This program will create a
relational database that houses detections of loci in the exemplar taxa. It
takes, as input, the folder of FASTAs we just created, the base genome taxon,
and a name to use as the output database:

Which shows our detection of conserved loci in each of the exemplar taxa when we
search for them using the temporary probes that we designed from the base
genome. As before, we can get some idea of the distribution of hits among
exemplar taxa (e.g., are loci detected in “all”, n-1 taxa, n-2 taxa, etc.).

Again, we’ve got to make a decision here about how conservative we want to be
regarding baits that hit all/some taxa. We only get 386 loci that we detect
in all exemplars (including the base genome and the menMol1 outgroup). That
seems too strict (particularly because this total includes menMol1, which is
really divergent from our taxa of interest). We also have to keep in mind that
we can randomly fail to detect loci that are actually present, either by chance
or do to sequence divergences that are >50% (the value we used in our search).
In the end, I settled on loci we detected in ≥ 4 exemplar taxa. So we need to
extract those from the database and store them in triCas1+5-back-to-4.conf:

Now that we’ve settled on the set of loci we’ll try to enrich, we want to
design baits to target them. In contrast to the steps we took before to design
the temporary bait set, we’re using all of the exemplar genomes and the base
genome to design probes. This way, we’ll have a heterogeneous bait mix that
contains probes designed from each exemplar but targeting the same locus, which
should make the probe set we’re designing more “universal”.

To do this, we use a program similar to what we used before, except that this
program has been modified to design probes across many exemplar genomes
(instead of just one). As input, we give the program the name of the directory
holding all of our fastas and the name of the config file we created in the step
above. Then, as before, we need to add some metadata that will be incorporated
to the bait set design file, and we tell the program to tile at 3x density, use
a “middle” overlap, remove baits with >25% masking, and to design two probes
targeting each locus. Finally, we write this probe set to a file named
coleoptera-v1-master-probe-list.fasta.

Note that the number of baits that we’ve designed to target 1209 conserved loci
is quite high - this is because we’re including roughly 2 baits for 1209 loci
across 7 exemplar taxa (16926 is the theoretical maximum).

As before, we need to check out bait set for duplicate loci. This time, the
search is going to take longer, because of the larger number of baits. We’ll
align all the probes to themselves, then read in the alignments, and filter the
probe list to remove putative duplicates.

What we’ve created, above, is the master bait list that contains baits targeting
the conserved locus we identified. Because we’ve designed probes from multiple
exemplar taxa, the number of overall baits is high (as high as 14 baits
targeting each conserved locus). This bait set is ready for synthesis and
subsequent enrichment of these conserved loci shared among coleoptera.

Sometimes we might not want to synthesize all of the baits for all of
the loci. For instance, we might be enriching loci from species that are nested
within the clade defined by ((‘Anoplophora glabripennis (Asian longhorned
beetle)’:’Leptinotarsa decemlineata (Colorado potato beetle)’)’Dendroctonus
ponderosae (mountain pine beetle)’), and because we’re only working with these
species, we might want to drop the baits targeting UCE loci in Agrilus
planipennis (emerald ash borer), Tribolium castaneum (red flour beetle),
Onthophagus taurus (taurus scarab), and Mengenilla moldrzyki (Strepsiptera).
This is actually pretty easy to do - we just need to subset the baits to
include those taxa that we do want. Given the example, above, we can run:

Now that we’ve designed our baits, it’s always good to run a sanity check on the
data - if we use the baits to collect data from a selection of available genomes
(or other genetic data), can we reconstruct a phylogeny that is sane, given what
we know about the specific taxa?

How we do that is outlined below. Many of the steps we’ve run before, so I’m
not going to explain these quite as much as I have previously. Several other of
the steps that we’re going to run are also outlined in Tutorial I: UCE Phylogenomics.

First, we need to make a directory to hold our in-silico test results:

Here (and if you have them), you may want to include all the genomic data you
have access to - particularly if you removed some taxa because they were closely
related to other taxa and designing probes from these closely related groups was
redundant. To run these alignments (you’ve seen this before):

Note that I am using 16 physical CPU cores (–cores) to do this
work. You need to use the number of physical cores available on your
machine.

Now, we need to extract fasta data for each of these loci. This is effectively
the same as what we’ve done before, but notice the use of –flank in place of
–probe. This tells the program that we want to extract larger chunks of
sequence, in thie base 400 bp to the each side of a given locus (if possible):

In the step above, we essentially extracted FASTA data for each taxon, and wrote
those out into individual FASTA files. These are the equivalent of the assembled
contigs that we use in the standard phyluce pipeline, so now, we’re going to
use that workflow. Note that the filtering in
phyluce_assembly_match_contigs_to_probes is more strict that what we used
above to identify contigs.

Now, we need to get the count of matches that we recovered to UCE loci in the
probe set, and extract all of the “good” loci to a monolithic FASTA (see
Tutorial I: UCE Phylogenomics if this is not making sense):