Once your reads are clean, you’re ready to assemble. At the moment, you can use
velvet, ABySS, and Trinity for assembly. However, be aware that there is not
a conda install package for Trinity due to some difficulties in how that
package is structured.

Most of the assembly process is automated using code within phyluce,
specifically the following 3 scripts:

phyluce_assembly_assemblo_abyss

phyluce_assembly_assemblo_trinity

phyluce_assembly_assemblo_velvet

The code of each of the above programs always expects your input directories
to have the following structure (from the Quality Control section):

The assembly name on the left side of the colon can be whatever you want.
The path name on the right hand side of the colon must be a valid path to a
directory containing read data in a format similar to that described above.

Attention

Assembly names MUST be unique.

Question: How do I name my samples/assemblies?

Naming samples is a contentious issue and is also a hard thing to deal with
using computer code. You should never have a problem if you name your
samples as follows, where the genus and specific epithet are separated by
an underscore, and multiple individuals of a given species are indicated
using a trailing integer value:

anas_platyrhynchos1anas_carolinensis1dendrocygna_bicolor1

You should also not have problems if you use a naming scheme that suffixes
the species binomial(s) with an accession number that is simply
formatted (e.g. no slashes, dashes, etc.):

Once your configuration file is created (best to use a decent text editor) that
will not cause you grief, you are ready to start assembling your read data into
contigs that we will search for UCEs. The code to do this for the three helper
scripts is below (remember, we are using $HOME/anaconda/bin generically to
refer to your anaconda or miniconda install).

Following assembly, phyluce_assembly_assemblo_abyss modifies
the assemblies by replacing degenerate base codes with standard nucleotide
encodings. We do this because lastz, which we use to match contigs to
targeted UCE loci, is not compatible with degenerate IUPAC codes.

The phyluce_assembly_assemblo_abyss code makes these substitutions for every site having a
degenerate code by selecting the appropriate nucleotide encoding randomly.
The code also renames the ABySS assemblies using the velvet naming
convention. The modified contigs are them symlinked into
$ASSEMBLY/contigs. Unmodified contigs are available in $ASSEMBLY/genus-
species/out_k*-contigs.fa

Generally, I would suggest that you use Trinity. In my hands, it produces
reasonable contig assemblies that are longer than the assemblies built by
either velvet or ABySS. There are some caveats, however. If you want the
most accurate assemblies possibly, then it may be best to use ABySS.
This is because ABySS runs read-based error correction prior to assembly
which results in more accurate contigs.

Question: For ABySS and velvet, what –kmer value do I use?

Also a hard question. Part of the reason that it is hard is due to the fact
that we are trying to assemble data of heterogenous read depth (i.e., our
reads are spread across (mostly) UCE loci, but the depth of coverage of each
locus is varaible due to capture efficiency). Longer kmer values can give
you longer (but fewer) contigs, while shorter kmer values produce fewer,
more abundant contigs. In most cases, your assemblies will be decent with a
kmer value around 55-65.