Dendritic cells (DCs) are antigen sensing and presenting cells that are essential for effective immunity. Existing as a multi-subset population, divided by distinct developmental and functional characteristics1,2, DC subsets play important and unique roles in responses to pathogens, vaccines and cancer therapies, as well as during immune-pathologies. Therefore therapeutic manipulation of the DC compartment is an attractive strategy. However, our incomplete knowledge of the inter-relationship between DC subsets and how they develop from progenitors in the bone marrow (BM) has so far limited the realization of their therapeutic potential. DCs arise from a cascade of progenitors that gradually differentiate in the BM; first, the macrophage DC progenitor (MDP), then common DC progenitor (CDP), and lastly the Pre-DC, which will leave the BM to seed peripheral tissues before differentiating into mature DCs3,4. While the basic outline of this process is known, how subset commitment and development is regulated at the molecular level remains poorly understood. Here we reveal that the Pre-DC population in mice is heterogeneous, containing uncommitted Ly6c+/-Siglec-H+ cells as well as Ly6c+Siglec-H- and Ly6c-Siglec-H- sub-populations that are developmentally fated to become Th2/17-inducing CD11b+ DCs and Th1-inducing CD8a+ DCs, respectively. Using single cell analysis by microfluidic RNA sequencing, we found that DC subset imprinting occurred at the mRNA level from the CDP stage, revealing that subset fate is defined in the BM and not in peripheral tissues. Single cell transcriptome analysis allowed identification of the molecular checkpoints between progenitor stages and revealed new regulators of DC-poiesis, shedding light on the role of cell cycle control and specific transcription factors in DC lineage development. These data advance our knowledge of the steady-state regulation of DC populations and open promising new avenues for investigation of the therapeutic potential of DC subset-specific targeting in vivo to improve vaccine-based and immunotherapeutic strategies. Overall design: Single cell mRNA sequencing was used to investigate the transcriptomic relationships within the Dendritic cell precursor compartment within the BM as well as between single Dendritic cell precursors

SRA archive data

SRA archive data is normalized by the SRA load process and used by the SRA Toolkit to read and produce formats like FASTQ, SAM, etc.
The default toolkit configuration enables it to find and retrieve SRA runs by accession.

Public SRA files are now available from GCP and AWS cloud platforms as well as from NCBI.
Access to most data in the cloud requires a user account with the cloud service provider.
The user’s account will incur costs for cloud compute or to copy data outside of the specified cloud service region.

In order to support large scale (hyper parallel) data analyses SRA data is now available at GCP and AWS with few caveats:

SRA data is copied to the cloud from NCBI. There may be a lag between availability from NCBI and from CSP (cloud service providers)

To access public data user account with the cloud service provider is required. Your account will incur costs for cloud compute and/or to copy data
(either archival or results of your comute) outside of the specified cloud service region

Distribution of protected data is signed by NIH account and requires user to operate in the same region as the data

SRA has also begun to provide access to originally submitted source files:

not all files have been validated by SRA

not all files have been copied to cloud locations (recovering it from NCBI tape system takes time ).

the volume of this type of data a much larger and it is not used as often so we will keep most of it
on tape or "cold" storage in cloud. As a result the data may not be available instantly and restore
requests will be served on first-come first-served basis and cost of resore may be charged to your
user account.

Taxonomy Analysis

Strong signals

Results show distribution of reads mapping to specific taxonomy nodes as a percentage of total reads within
the analyzed run. In cases where a read maps to more than one related taxonomy node, the read is reported as
originating from the lowest shared taxonomic node. So when a read maps to two species belonging to the same genus,
it is assigned at the genus level. Sequence reads from a single organism will map to several taxonomy nodes spanning the organism’s lineage.
The number of reads mapping to higher level nodes will typically be greater than those that map to terminal nodes.

STAT results are proportional to the size of sequenced genomes. Given a mixed sample containing several organisms at equal copy number,
proportionally more reads originate from the larger genomes. This means that the percentages reported by STAT will reflect
genome size and must be considered against the genomic complexity of the sequenced sample.

Overview

The NCBI SRA Taxonomy Analysis Tool (STAT) calculates the taxonomic distribution of reads from next generation sequencing runs.
This analysis maps individual sequencing reads to a taxonomic hierarchy and reports the taxonomic composition of reads within a sequencing run.

Method

STAT maps sequencing reads to a taxonomic hierarchy using a two-step strategy based on exact query read matches to precomputed k-mer dictionary databases.
In the first pass a small, a "coarse" reference dictionary database is used to identify organisms matching a read set.
In the second pass, organism-specific slices from a "fine" reference dictionary database are used to compute distribution of reads between identified taxonomy classes (species and higher order taxonomy nodes).
When multiple tax nodes are mapped for single spot we use the lowest non-ambiguous mapping.

STAT k-mer dictionaries are built using an iterative minhash
based approach against reference genomic databases. For every fixed segment length of incoming reference
nucleotide sequence, k-mer representing this segment selected based on minimum
fvn1 hash function.
Several strategies were used to enhance the specificity and accuracy of STAT results.
Low complexity k-mers composed of >50% homo-polymer or dinucleotide repeats (e.g. AAAAAA or ACACACACACA)
were filtered from dictionaries, and discrete k-mers belonging to multiple taxonomic references
were "merged" at the lowest common taxonomic node shared between references. Finally, the specificity
of representative k-mers was determined by searching against the source reference genomic database.
When representative k-mers were found in multiple taxonomic references nodes, they were merged at
the lowest common taxonomic node as above.

Genome references

The NCBI refseq_genomic database was supplemented
with the validated viral genome set (RefSeq neighbors)
and used as the source for k-mer creation in both "coarse" and "fine" sets.

Taxonomy hierarchy

Reference sequences were mapped to the taxonomy hierarchy using the NCBI
taxonomy database. The database contained 48,180 taxonomy nodes in January 2017.

Segment sizes and K-mer selection

K-mer dictionaries were built by computationally slicing reference genomes into sequential segments and selecting 32-mers to represent each segment.
The "coarse" k-mer dictionary uses variable segment lengths, proportional to genomes size and ranging from 200-8000 nt. The "fine" k-mer dictionary
uses a constant 64 nt segment length for all genomes (for 32-mer index it gives us 32x reduction in space with the assumption that we have at least one error-free 64-mer for every spot).

Sequence substring: one of the biological reads for a spot should contain the substring
Examples:
ATTGGA,
^ATTGGA,
ATTGGA$,
ATGDNNAT,
ATGGA&GCGC
The strings are case insensitive, and belong to either 2NA or 4NA alphabets.
String length limited to 29 characters in 4NA alphabet
(includes IUPAC substitution codes) or 61 characters in 2NA alphabet (ACGT only).
Search is case insensitive and strings may be combined with boolean
operators & | ! (AND, OR, NOT)
See "SRA nucleotide search expressions" for more details.
Maximum size of Run to be search is
1.1G