A resource of 192 inbred D. melanogaster lines created from a single natural population that have been both extensively sequenced and extensively phenotyped for quantitative traits. This resource will be a valuable tool for associating genotype to phenotype for quantitative traits.

Default toolkit configuration enables it to find and retrieve SRA runs by
accession. It also downloads (and cache) only the part of data you really need. For example
quality scores represent a majority of data volume and you may not need them if you
dump fasta only (versus fastq). Or if you are looking at particular gene you may not
need reads aligned to other regions or not aligned at all. Same way if you use
GATK with enabled SRA support you need only SRA run accessions to fire your process.

fastq-dump
will dump reads in a number of "standard" fastq and fasta formats.

vdb-dump
is also capable of producing fasta and fastq (beside other formats). It dumps data much
faster then fastq-dump but ordering of reads may be different and it does not produce split-read
multi-file output.

Prefetch
tool will help you cache all data in advance if you plan to run data analysis in environment where getting
data from NCBI at run time is unfeasible.

To select a list of intersing SRA runs in the scope of experiment, sample or study you may use SRA Run Selector
either directly or from Entrez search.

Strong signals

Results show distribution of reads mapping to specific taxonomy nodes as a percentage
of total reads within the analyzed run. In cases where a read maps to more than one related
taxonomy node, the read is reported as originating from the lowest shared taxonomic node.
So when a read maps to two species belonging to the same genus, it is reported as having
originated from their common genus. Under typical conditions where a single organism has
been sequenced, the expectations are that reads will map to several taxonomy nodes across
the organism’s lineage, and that the number of reads mapping to higher level nodes will
be more than those that map to terminal nodes.

STAT results are proportional to the size of sequenced genomes. So given a mixed sample
containing several organisms at equal copy number, one expects proportionally more reads
to originate from the larger genomes. This means that the percentages reported by STAT
will reflect genome size and must be considered against the genomic complexity of the
sequenced sample.

Overview

The NCBI SRA Taxonomy Analysis Tool (STAT) calculates the taxonomic distribution of reads
from next generation sequencing runs. This analysis maps individual sequencing reads to a
taxonomic hierarchy and reports the taxonomic composition of reads within a sequencing run.

Method

STAT maps sequencing reads to a taxonomic hierarchy using a two-step strategy based
on exact query read matches to precomputed k-mer dictionary databases. In the first
pass a small, a “coarse” reference dictionary database is used to identify organisms
matching a read set. In the second pass, organism-specific slices from a “fine” reference
dictionary database are used to compute distribution of reads between identified taxonomy
classes (species and higher order taxonomy nodes). When multiple taxnodes are mapped for
single spot we use the lowest non-ambigous mappimg

STAT k-mer dictionaries are built using an iterative minhash
based approach against reference genomic databases. For every fixed segment length of incoming reference
nucleotide sequence, k-mer representing this segment selected based on minimum
fvn1 hash function.
Several strategies were used to enhance the specificity and accuracy of STAT results.
Low complexity k-mers composed of >50% homo-polymer or dinucleotide repeats (e.g. AAAAAA or ACACACACACA)
were filtered from dictionaries, and discrete k-mers belonging to multiple taxonomic references
were “merged” at the lowest common taxonomic node shared between references. Finally, the specificity
of representative k-mers was determined by searching against the source reference genomic database.
When representative k-mers were found in multiple taxonomic references nodes, they were merged at
the lowest common taxonomic node as above.

Genome references

The NCBI refseq_genomic database was supplemented
with the validated viral genome set (RefSeq neighbors)
and used as the source for k-mer creation in both “coarse” and “fine” sets.

Taxonomy hierarchy

Reference sequences were mapped to the taxonomy hierarchy using the NCBI taxonomy database. The database contained 48,180 taxonomy nodes in January, 2017.

Segment sizes and K-mer selection

K-mer dictionaries were built by computationally slicing reference genomes into sequential segments and selecting 32-mers to represent each segment.
The “coarse” k-mer dictionary uses variable segment lengths, proportional to genomes size and ranging from 200-8000 nt. The “fine” k-mer dictionary
uses a constant 64 nt segment length for all genomes (for 32-mer index it gives us 32x reduction in space and io at the cost of expectation that
we have at least one error-free 64-mer for every spot )

Sequence substring: one of the biological reads for a spot should contain the substring
Examples:
ATTGGA,
^ATTGGA,
ATTGGA$,
ATGDNNAT,
ATGGA&GCGC
The strings are case insensitive, and belong to either 2NA or 4NA alphabets.
String length limited to 29 characters in 4NA alphabet
(includes IUPAC substitution codes) or 61 characters in 2NA alphabet (ACGT only).
Search is case insensitive and strings may be combined with boolean
operators & | ! (AND, OR, NOT)
See "SRA nucleotide search expressions" for more details.
Maximum size of Run to be search is
1.1G