MiGA workflow

MiGA Workflow

This is the general overview of the MiGA workflow:

The MiGA Workflow

For each step, performed analyses may include the use of external Software, and produce one or more result files (indexed in a hash). In most steps, different utilities from the Enveomics Collection are used in addition to the Software detailed below. See more details of each step below, including file keys and descriptions. Some files are mandatory to continue with the analysis (marked with req), some can be gzipped during or after the analysis (marked with gz), and some are directories (marked with dir).

Dataset Results

Raw Reads

This step is never actually performed by MiGA, instead it serves as the entry point for raw reads input.

Supported file keys:

For single reads only

single (req, gz): FastQ file containing the raw reads.

For paired-end reads only

pair1 (req, gz): FastQ file containing the raw forward reads.

pair2 (req, gz): FastQ file containing the raw reverse reads.

Statistics:

For single reads only

reads: Total number of reads.

length_average: Average read length (in bp).

length_standard_deviation: Standard deviation of read length (in bp).

g_c_content: G+C content of all reads (in %).

For paired-end reads only

read_pairs: Total number of read pairs.

forward_length_average: Average forward read length (in bp).

forward_length_standard_deviation: Standard deviation of forward read

length (in bp).

forward_g_c_content: G+C content of forward reads (in %).

reverse_length_average, reverse_length_standard_deviation,

reverse_g_c_content: Same as above, for reverse reads.

MiGA symbol: raw_reads.

Trimmed Reads

This is part of Trimming & read quality in the above diagram. In this step, MiGA trims reads by Phred quality score 20 (Q20) and minimum length of 50bp using SolexaQA++, and clips potential adapter contamination using Scythe (reapplying the length filter). If the reads are paired, only pairs passing the filters are used.

Supported file keys:

For single reads only

single (req, gz): FastQ file containing trimmed/clipped reads.

For paired-end reads only

pair1 (req, gz): FastQ file containing trimmed/clipped forward reads.

pair2 (req, gz): FastQ file containing trimmed/clipped reverse reads.

single (req, gz): FastQ file containing trimmed/clipped reads with

only one sister passing quality control.

For either type

trimming_summary: Raw text file containing a summary of the trimmed

sequences.

MiGA symbol: trimmed_reads.

Read Quality

This is a quality-control step included as part of Trimming & read quality in the diagram above. In this step, MiGA generates quality reports of the trimmed/clipped reads using SolexaQA++ and FastQC.

Supported file keys:

solexaqa (dir): Folder containing the SolexaQA++ quality-control

summaries.

fastqc (dir): Folder containing the FastQC quality-control analyses.

MiGA symbol: read_quality.

Trimmed FastA

This is the final step included in Trimming & read quality in the diagram above, in which MiGA generates FastA files with the trimmed/clipped reads.

Essential Genes

In this step, MiGA uses HMM.essential.rb from the Enveomics Collection to identify a set of genes typically present in single-copy in Bacterial and Archaeal genomes. In this step, protein translations of those essential genes are extracted for other analyses in MiGA (e.g., hAAI in distances) or outside (e.g., phylogeny or MLSA for diversity analyses). In addition, this step generates a report that can be used for quality control including estimations of completeness and contamination (for genomes) and median number of copies of single-copy genes (for metagenomes and viromes).

innominate: List of innominate taxa (groups without a name but containing

lower-rank classifications) as raw text.

kronain: Raw-text list of taxa used as input for Krona.

krona: HTML output produced by Krona.

MiGA symbol: mytaxa.

MyTaxa Scan

This step is only supported for genomes (dataset types genome, popgenome, and scgenome), and it requires the (optional) MyTaxa requirements installed.

In this step, the genomes are scanned in windows of ten genes. For each window, the taxonomic distribution is determined using MyTaxa and compared against the distribution for the entire genome. This is a quality-control step for manual curation.

Supported file keys:

mytaxa (req): MyTaxa output.

report (req): PDF file containing the graphic report.

regions_archive (gz): Archived folder containing FastA files with the

sequences of the genes in regions identified as abnormal.

nomytaxa: If it exists, MiGA assumes no support for MyTaxa modules, and none

of the above files are required.

Deprecated file keys:

wintax: Taxonomic distribution of each window.

blast (gz): BLAST against the reference genomes database.

mytaxain (gz): Re-formatted BLAST used as input for MyTaxa.

regions (dir): Folder containing FastA files with the sequences of the

genes in regions identified as abnormal.

gene_ids: List of genes per window.

region_ids: List of regions identified as abnormal.

MiGA symbol: mytaxa_scan.

Distances

This step is only supported for genomes (dataset types genome, popgenome, and scgenome). In this step, each dataset is compared against all other datasets in the project. If the dataset is a reference dataset, it is compared against all other reference datasets in the project. If it's a query dataset, it is compared iteratively against medoids. For more details on the strategy used in this step, see the manual section on distances.

Supported file keys:

For reference datasets

haai_db (req): SQLite3 database containing hAAI values.

aai_db: SQLite3 database containing AAI values.

ani_db: SQLite3 database containing ANI values.

For query datasets

aai_medoids (req except for clades projects): Best hits among medoids

at different hierarchical levels in the AAI indexing.

ani_medoids (req for clades projects): Best hits among medoids at

different hierarchical levels in the ANI indexing.

haai_db (req): SQLite3 database containing hAAI values.

aai_db: SQLite3 database containing AAI values.

ani_db: SQLite3 database containing ANI values.

ref_tree: Newick file with the Bio-NJ tree including queried medoids and

the query dataset.

ref_tree_pdf: PDF rendering of ref_tree.

intax: Raw text result of the taxonomy test against the reference genome.

MiGA symbol: distances.

Taxonomy

This step is only supported for genomes (dataset types genome, popgenome, and scgenome) that are reference datasets, in projects with a set reference project (:ref_project in metadata).

In this step, MiGA compares the genome against a reference project using the query search method, and imports the resulting taxonomy with p-value below 0.05 (or whichever value is set as :tax_pvalue in metadata).

Supported file keys:

intax: Raw text result of the taxonomy test against the reference genome.

aai_medoids (req except for reference clades projects): Best hits among

medoids at different hierarchical levels in the AAI indexing.

ani_medoids (req for reference clades projects): Best hits among medoids

at different hierarchical levels in the ANI indexing.

haai_db (req): SQLite3 database containing hAAI values.

aai_db: SQLite3 database containing AAI values.

ani_db: SQLite3 database containing ANI values.

ref_tree: Newick file with the Bio-NJ tree including queried medoids and

the query dataset.

ref_tree_pdf: PDF rendering of ref_tree.

MiGA symbol: taxonomy.

Stats

In this step, MiGA traces back all the results of the dataset and estimates summary statistics. In addition, it cleans any stored values in the distances database including datasets no longer registered in the project.

No supported file keys.

MiGA symbol: stats.

Project Results

Once all datasets have been pre-processed (i.e., once all the results above are available for all reference datasets), MiGA executes the following project-wide steps:

hAAI Distances

Consolidation of hAAI distances.

Supported file keys:

rdata (req): Pairwise values in a data.frame for R.

matrix (req): Pairwise values in a raw tab-delimited file.

log (req): List of datasets included in the matrix.

hist: Histogram of hAAI values as raw tab-delimited file.

MiGA symbol: haai_distances.

AAI Distances

Consolidation of AAI distances.

Supported file keys:

rdata (req): Pairwise values in a data.frame for R.

matrix (req): Pairwise values in a raw tab-delimited file.

log (req): List of datasets included in the matrix.

hist: Histogram of AAI values as raw tab-delimited file.

MiGA symbol: aai_distances.

ANI Distances

Consolidation of ANI distances.

Supported file keys:

rdata (req): Pairwise values in a data.frame for R.

matrix (req): Pairwise values in a raw tab-delimited file.

log (req): List of datasets included in the matrix.

hist: Histogram of ANI values as raw tab-delimited file.

MiGA symbol: ani_distances.

Clade Finding

In this step, MiGA attempts to identify clades at species level or above using a combination of ANI and AAI values. MiGA generates AAI clades in this step for genomes projects. Clades proposed at AAI > 90% and ANI > 95% are formed using the Markov Clustering algorithm implemented in MCL. Most distance manipulation and tree estimation and manipulation utilities use the R packages Ape and Vegan.

Subclades

In this step, MiGA attempts to identify clades below species level using ANI values. MiGA generates ANI clades in this step. Most distance manipulation and tree estimation and manipulation utilities use the R packages Ape and Vegan.

Supported file keys:

report (req): PDF file including a graphic report for the clustering.

class_table (req): Tab-delimited file containing the classification of all

datasets in ANI clusters.

class_tree (req): Newick file containing the classification of all

datasets in ANI clusters as a dendrogram.

classif (req): Tab-delimited file containing the highest-level

classification of each dataset, the medoid of the cluster, and

the ANI against the corresponding medoid.

medoids (req): List of medoids per cluster.

ani_tree: Bio-NJ tree based on AAI distances in Newick format.

MiGA symbol: subclades.

OGS

In this step, MiGA generates groups of orthology using reciprocal best matches between all pairs of datasets in the project. Groups are generated using MCL with pairs weighted by bit score. Once computed, MiGA uses the matrix of OGS to estimate summary and rarefied statistics.

Supported file keys:

ogs (req): Matrix of orthology groups, as tab-delimited raw file.

stats (req): Summary statistics in JSON format.

abc (gz): When available, it includes all the individual RBM files in

ABC format. This file is typically produced as intermediate result and

removed before finishing, but can be maintained using

miga new -P . -m clean_ogs=false --update in the project folder using the