Mammalian preimplantation development is a complex process involving dramatic changes in the transcriptional architecture. Through single-cell RNA-sequencing (RNA-seq), we report here a comprehensive analysis of transcriptome dynamics from oocyte to morula in both human and mouse embryos. Based on single nucleotide variants (SNVs) in blastomere mRNAs and paternal-specific SNPs, we identify novel stage-specific monoallelic expression patterns for a significant portion of polymorphic gene transcripts (25-53%). By weighted gene co-expression network analysis (WGCNA), we find that each developmental stage can be concisely delineated by a small number of functional modules of co-expressed genes. This result indicates a sequential order of transcriptional changes in pathways of cell cycle, gene regulation, translation, and metabolism in a step-wise fashion from cleavage to morula. Cross-species comparisons reveal that the majority of human stage-specific modules (7 out of 9) are remarkably preserved, only to diverge in developmental specificity and timing in mice. We further identify conserved key members (or hub genes) of the human and mouse networks. These genes represent novel candidates that are likely key players in driving mammalian preimplantation development. Collectively, we demonstrate that mammalian preimplantation development is orchestrated by evolutionarily conserved genetic programs that diverge in developmental timing. Our results provide a valuable resource to dissect gene regulatory mechanism underlying progressive development of early mammalian embryos. Overall design: single-cell RNA-seq of human and mouse blastomeres

SRA archive data

SRA archive data is normalized by the SRA load process and used by the SRA Toolkit to read and produce formats like FASTQ, SAM, etc.
The default toolkit configuration enables it to find and retrieve SRA runs by accession.

Public SRA files are now available from GCP and AWS cloud platforms as well as from NCBI.
Access to most data in the cloud requires a user account with the cloud service provider.
The user’s account will incur costs for cloud compute or to copy data outside of the specified cloud service region.

In order to support large scale (hyper parallel) data analyses SRA data is now available at GCP and AWS with few caveats:

SRA data is copied to the cloud from NCBI. There may be a lag between availability from NCBI and from CSP (cloud service providers)

To access public data user account with the cloud service provider is required. Your account will incur costs for cloud compute and/or to copy data
(either archival or results of your comute) outside of the specified cloud service region

Distribution of protected data is signed by NIH account and requires user to operate in the same region as the data

SRA has also begun to provide access to originally submitted source files:

not all files have been validated by SRA

not all files have been copied to cloud locations (recovering it from NCBI tape system takes time ).

the volume of this type of data a much larger and it is not used as often so we will keep most of it
on tape or "cold" storage in cloud. As a result the data may not be available instantly and restore
requests will be served on first-come first-served basis and cost of resore may be charged to your
user account.

Results show distribution of reads mapping to specific taxonomy nodes as a percentage of total reads within
the analyzed run. In cases where a read maps to more than one related taxonomy node, the read is reported as
originating from the lowest shared taxonomic node. So when a read maps to two species belonging to the same genus,
it is assigned at the genus level. Sequence reads from a single organism will map to several taxonomy nodes spanning the organism’s lineage.
The number of reads mapping to higher level nodes will typically be greater than those that map to terminal nodes.

STAT results are proportional to the size of sequenced genomes. Given a mixed sample containing several organisms at equal copy number,
proportionally more reads originate from the larger genomes. This means that the percentages reported by STAT will reflect
genome size and must be considered against the genomic complexity of the sequenced sample.

Overview

The NCBI SRA Taxonomy Analysis Tool (STAT) calculates the taxonomic distribution of reads from next generation sequencing runs.
This analysis maps individual sequencing reads to a taxonomic hierarchy and reports the taxonomic composition of reads within a sequencing run.

Method

STAT maps sequencing reads to a taxonomic hierarchy using a two-step strategy based on exact query read matches to precomputed k-mer dictionary databases.
In the first pass a small, a "coarse" reference dictionary database is used to identify organisms matching a read set.
In the second pass, organism-specific slices from a "fine" reference dictionary database are used to compute distribution of reads between identified taxonomy classes (species and higher order taxonomy nodes).
When multiple tax nodes are mapped for single spot we use the lowest non-ambiguous mapping.

STAT k-mer dictionaries are built using an iterative minhash
based approach against reference genomic databases. For every fixed segment length of incoming reference
nucleotide sequence, k-mer representing this segment selected based on minimum
fvn1 hash function.
Several strategies were used to enhance the specificity and accuracy of STAT results.
Low complexity k-mers composed of >50% homo-polymer or dinucleotide repeats (e.g. AAAAAA or ACACACACACA)
were filtered from dictionaries, and discrete k-mers belonging to multiple taxonomic references
were "merged" at the lowest common taxonomic node shared between references. Finally, the specificity
of representative k-mers was determined by searching against the source reference genomic database.
When representative k-mers were found in multiple taxonomic references nodes, they were merged at
the lowest common taxonomic node as above.

Genome references

The NCBI refseq_genomic database was supplemented
with the validated viral genome set (RefSeq neighbors)
and used as the source for k-mer creation in both "coarse" and "fine" sets.

Taxonomy hierarchy

Reference sequences were mapped to the taxonomy hierarchy using the NCBI
taxonomy database. The database contained 48,180 taxonomy nodes in January 2017.

Segment sizes and K-mer selection

K-mer dictionaries were built by computationally slicing reference genomes into sequential segments and selecting 32-mers to represent each segment.
The "coarse" k-mer dictionary uses variable segment lengths, proportional to genomes size and ranging from 200-8000 nt. The "fine" k-mer dictionary
uses a constant 64 nt segment length for all genomes (for 32-mer index it gives us 32x reduction in space with the assumption that we have at least one error-free 64-mer for every spot).

Sequence substring: one of the biological reads for a spot should contain the substring
Examples:
ATTGGA,
^ATTGGA,
ATTGGA$,
ATGDNNAT,
ATGGA&GCGC
The strings are case insensitive, and belong to either 2NA or 4NA alphabets.
String length limited to 29 characters in 4NA alphabet
(includes IUPAC substitution codes) or 61 characters in 2NA alphabet (ACGT only).
Search is case insensitive and strings may be combined with boolean
operators & | ! (AND, OR, NOT)
See "SRA nucleotide search expressions" for more details.
Maximum size of Run to be search is
1.1G