Build a clade collection

In this tutorial, we will create a clade project including all the genomes available for a species in RefSeq as well as any additional genomes you may have using MiGA alone. If you want to explore a more manual approach using bash, see the Build a clade collection using BASH example. We will use Escherichia coli as the target species, but you can use any species (or any taxon) you want.

0. Initialize the project

miga new -P E_coli -t clade

cd E_coli

1. Download publicly available genomes

There are different stages of completeness defined in the NCBI Genome database, and you may want to include only some cases depending on you analysis. The stages (from higher to lower quality) are:

Complete: Genomes including all replicons in the organism(s) sequenced.

Or you can set it globally as an environmental variable before running miga:

export NCBI_API_KEY=ABCD123

2. Add your own genomes

If you have any unreleased genomes, you can simply add them to the same project to be processed together with those publicly available. You can initialize datasets at different points, see input data. For the purposes of this tutorial, we'll assume that you have raw coupled reads from two sequencing lanes (1 and 2) in Gzipped FastQ files: