This is the advanced tutorial for the command line interface to ipyrad.
In this tutorial we will introduce two new methods that were not
used in the introductory tutorial, but which provide some exciting new
functionality. The first is branching, which is used to
efficiently assemble multiple data sets under a range of parameter settings,
and the second is referencemapping, which is a way to leverage information
from reference genomic data (e.g., full genome, transcriptome,
plastome, etc) during assembly.

If you’ve already been through the introductory tutorial you’ll remember that
a typical ipyrad analysis runs through seven sequential steps to take data
from its raw state to finished output files of aligned data.
After finishing one assembly, it is common that we might want to create a
second assembly of our data under a different set of parameters;
say by changing the clust_threshold from 0.85 to 0.90, or changing
min_samples_locus from 4 to 20.

It would be wholy inefficient to restart from the beginning for each assembly
that uses different parameter settings. A better way would be to re-use existing
data files and only rerun steps downstream from where parameter changes have
an effect. This approach is a little tricky, since the user would need
to know which files to rename/move to avoid existing results files and parameter
information from being overwritten and lost.

The motivation behind the branching assembly process in ipyrad is to simplify
this process. ipyrad does all of this renaming business for you, and creates
new named files in a way the retains records of the existing assemblies and
effectively re-uses existing data files.

At its core, branching creates a copy of an Assembly object (the object that is
saved as a .json file by ipyrad) such that the new Assembly inherits all of
the information from it’s parent Assembly, including filenames, samplenames,
and assembly statistics. The branching process requires a
new assembly_name, which is important so that all new files
created along this branch will be saved with a unique filename prefix.
We’ll show an example of a branching process below, but first we need to
describe reference mapping, since for our example we will be creating two
branches which are assembled using different
assembly methods.

ipyrad offers four assembly methods, three of which
can utilize a reference sequence file. The first method, called reference,
maps RAD sequences to a reference file to determine homology and discards all
sequences which do not match to it. The second method, denovo+reference,
uses the reference first to identify homology, but then the remaining unmatched
sequences are all dumped into the standard denovo ipyrad pipeline to be clustered.
In essence, the reference file is simply used to assist the denovo assembly, and
to add additional information. The final method, denovo-reference,
removes any reads which match to the reference and retains only non-matching
sequences to be used in a denovo analysis. In other words, it allows the use
of a reference sequence file as a filter to remove reads which match to it. You
can imagine how this would be useful for removing contaminants, plastome data,
symbiont-host data, or coding/non-coding regions.

Let’s first download the example simulated data sets for ipyrad. Copy and paste
the code below into a terminal. This will create a new directory called
ipsimdata/ in your current directory containing all of the necessary files.

## The curl command needs a capital O, not a zero.>>>curl-LkOhttps://github.com/dereneaton/ipyrad/raw/master/tests/ipsimdata.tar.gz>>>tar-xvzfipsimdata.tar.gz

If you look in the ipsimdata/ directory you’ll see there are a number of example
data sets. For this tutorial we’ll be using the rad_example data. Let’s
start by creating a new Assembly, and then we’ll edit the params file to
tell it how to find the input data files for this data set.

## creates a new Assembly named data1>>>ipyrad-ndata1

Newfileparams-data1.txtcreatedin/home/deren/Documents/ipyrad

As you can see, this created a new params file for our Assembly. We need to
edit this file since it contains only default values. Use any text editor to
open the params file params-data1.txt and enter the values
below for parameters 1, 2, and 3. All other parameters can be left at their
default values for now. This tells ipyrad that we are going to use the name
iptutorial as our project_dir (where output files will be created), and
that the input data and barcodes file are located in ipsimdata/.

Inside iptutorial you’ll see that ipyrad has created two subdirectories
with names prefixed by the assembly_name data1. The other saved file is a
.json file, which you can look at with a text editor if you wish.
It’s used by ipyrad to store information about your Assembly.
In general, you should not mess with the .json file,
since editing it by hand could cause errors in your assembly.

For this example we will branch our Assembly before running step3 so that we can
see the results when the data are asembled with different assembly_methods. Our
existing assembly data1 is using the denovo method. Let’s create a branch
called data2 which will use reference assembly. First we need to run the
branch command, then we’ll edit the new params file to change the assembly_method
and add the reference sequence file.

## create a new branch of the Assembly 'data1'>>>ipyrad-pparams-data1.txt-bdata2

Now let’s suppose we’re interested in the effect of missing data on our assemblies
and we want to assemble each data set with a different min_samples_locus
setting. Maybe at 4, 8, and 12 (ignore the fact that the example data set
has no missing data, and so this has no practical effect; See the empirical
example tutorial for a better example). It’s worth
noting that we can branch assemblies after an analysis has finished as well.
The only difference is that the new assembly will think that it has already
finished all of the steps, and so if we ask it to run them again it will instead
want to skip over them. You can override this behavior by passing the -f flag,
or --force, which tells ipyrad that you want it to run the step even though
it’s already finished it. The two assemblies we finished were both assembled at
the default value of 4 for min_samples_locus, so below I set up code to
branch and then run step7 on each of these assemblies with a new setting of 8 or 12.

## branch data1 to make min8 and min12 data sets>>>ipyrad-pparams-data1.txt-bdata1-min8>>>ipyrad-pparams-data1.txt-bdata1-min12

Now use a text editor to set “min_samples_locus” to the new value (8 or 12)
in the params file of each of these assemblies.

## branch data2 to make min8 and min12 data sets>>>ipyrad-pparams-data2.txt-bdata2-min8>>>ipyrad-pparams-data2.txt-bdata2-min12

Once again, use a text editor to set “min_samples_locus” to the new value (8 or 12)
for these assemblies. Then, we will run step7 to get the final data sets.

## run step7 on using the new min_samples_locus settings>>>ipyrad-pparams-data1-min8.txt-s7>>>ipyrad-pparams-data1-min12.txt-s7>>>ipyrad-pparams-data2-min8.txt-s7>>>ipyrad-pparams-data2-min12.txt-s7

Now if we look in our project_dir iptutorial/ we see that the fastq/
and edits/ directories were created using just the first assembly data1,
while the clust/ and consens/ directories were created for both data1 and
data2, since both completed steps 3-6. Finally, you can see that each
assembly has its own outfiles/ directory with the results of step7.

## use ls -l to view inside the project directory as a list>>>ls-liptutorial/

In your working directory you will have the four params files which
have the full set of parameters used in each of your assemblies.
This makes for a good reproducible workflow, and can be referenced later
as a reminder of the parameters used for each data set.

It’s also possible to create a branch with only a subset of samples from
the original assembly. You can do this by specifying a list of samples
to include following the new assembly name after the -b flag. For
example the command below will create a new branch called subdata
including only the 4 samples listed.

## Branch subset of Samples to a new Assembly by passing in## sample names to include.>>>ipyrad-pparams-data1.txt-bsubdata1A_01B_01C_01D_0

If you want to select more than a handful of samples it might be easier
to instead provide a text file with sample names listed one per line.
So we made it so you can do that. The format of the file for listing
sample names is literally just a text file with one sample name per line.
Here is an example sample names file samples_to_keep.txt

And the command to do the branching:

## Branch subset of Samples by passing in a file with sample names>>>ipyrad-pparams-data1.txt-bsubdatasamples_to_keep.txt