Figure: An artistic representation of single-cell RNA sequencing. The
stars in the sky represent cells in a heterogeneous tissue. The projection of
the stars onto the river reveals relationships among them that are not apparent
by looking directly at the sky. Like the river, our Bayesian model, called scVI,
reveals relationships among cells.

The diversity of gene regulatory states in our body is one of the main reasons
why such an amazing array of biological functions can be encoded in a single
genome. Recent advances in microfluidics and sequencing technologies (such as
inDrops) enabled measurement of gene expression at the single-cell level and has
provided tremendous opportunities to unravel the underlying mechanisms of
relationships between individual genes and specific biological phenomena. These
experiments yield approximate measurements for mRNA counts of the entire
transcriptome (i.e around $d = 20,000$ protein-coding genes) and a large number
of cells $n$, which can vary from tens of thousands to a million cells. The
early computational methods to interpret this data relied on linear model and
empirical Bayes shrinkage approaches due to initially extremely low sample-size.
While current research focuses on providing more accurate models for this gene
expression data, most of the subsequent algorithms either exhibit prohibitive
scalability issues or remain limited to a unique downstream analysis task.
Consequently, common practices in the field still rely on ad-hoc preprocessing
pipelines and specific algorithmic procedures, which limits the capabilities of
capturing the underlying data generating process.

In this post, we propose to build up on the increased sample-size and recent
developments in Bayesian approximate inference to improve modeling complexity as
well as algorithmic scalability. Notably, we present our recent work on deep
generative models for single-cell transcriptomics, which addresses all the
mentioned limitations by formalizing biological questions into statistical
queries over a unique graphical model, tailored to single-cell RNA sequencing
(scRNA-seq) datasets. The resulting algorithmic inference procedure, which we
named Single-cell Variational Inference (scVI), is open-source and
scales to over a million cells.

The scRNA-seq Analysis Era

Previous RNA quantification technologies were only able to collect population
level data, thus biologists could not determine if the differences in their
experiments were due to changes within a single cell type, multiple cell types,
or simply changes in frequencies of cells. Researchers can now dissociate
heterogeneous tissues into single cells, each encapsulated within a droplet with
barcodes that allow for easy sequencing and demultiplexing. After read alignment
procedures, we get a data matrix $x_{ng}$ of counts for expression of gene $g$
in cell $n$.

The standard analysis of this data matrix can be broken down into a number of
sequential tasks. At a high level, we would like to classify cells present in
our sample, which is commonly approached by clustering problem an abstract
latent space. Looking more closely, we want to identify the characteristic
genes for each cell type. , usually achieved via hypothesis testing.
Importantly, for any of these analyses, we must decouple biological signal from
technical factors. In scRNA-seq, technical factors include sequencing depth
(i.e. the number of total transcripts captured in a cell) and batch-effects (i.e
discrepancies between biological replicates). These confounding effects make any
inference problem significantly harder and therefore cannot be addressed by an
off-the-shelf machine learning algorithm.

There is a rich literature in scRNA-seq data analysis that relies on very
diverse machine learning techniques such as nearest-neighbors algorithms (MNNs),
similarity metric learning via kernel methods (SIMLR), matrix factorization
(ZINB-WaVE), and Bayesian non-parametric modeling (BISCUIT). However, all these
approaches have limitations. Namely, one specific algorithm is built to address
a unique problem. Consequently, each algorithm can use its own underlying
hypothesis that may create statistical inconsistencies. Furthermore, these
methods do not scale to more than a few thousand samples — either because of
memory or time complexity — which is an issue for analyzing modern datasets.

Deep Generative Models of Stochastic Gene Expression

As a starting point to a suitable generative model, we identify the different
factors of variability over which we would like to condition the data generating
process for the mRNA counts $x_{ng}$. First, a cell-specific latent variable
$l_n$ represents discrepancies between the number of transcripts captured in
each cell. Second, a dataset-specific factor $s_n$ represents discrepancies
between sequencing protocols or experiments conditions. Third, a cell-specific
latent variable $z_n$ encodes the parameters that governs stochastic gene
expression of the cell. This variable is typically of small dimension and
embeds the cells for downstream analysis (such as clustering).

Notably, following seminal work by Sandrine Dudoit’s group (ZINB-WaVE) we model
the conditional distribution $p(x_{ng} | z_n, l_n, s_n)$ as zero-inflated
negative binomial. Unlike ZINB-WaVE, which relies on linear models, we use deep
neural networks to parametrize these conditional distributions. We refer to our
manuscript for a complete specification of our probabilistic model.

Because the marginal distribution $p(x_n | s_n)$ is not amenable to Bayesian
computations, we rely on auto-encoding variational Bayes to learn an approximate
posterior $q(z_n, l_n | x_n, s_n)$ (Figure 1) and fit our generative model. In
particular, scVI relies on stochastic optimization by sampling from the
variational posterior and from the datasets, which ensures a scalable inference
procedure. scVI can also be used for hypothesis testing by sampling from the
variational posterior and identify gene of interest between cell-types.

Figure 1: Overview of scVI. Given a gene-expression matrix with batch
annotations as input, scVI learns a non-linear embedding of the cells that can
be used for multiple analysis tasks. We draw here the computational trees
(neural networks) used to compute the embedding as well as the distribution of
gene expression.

We refer to our manuscript for an extensive benchmarking study on a set of seven
different datasets and five different tasks (which includes embedding and
hypothesis testing). scVI compares favorably on all the datasets and all the
tasks against state-of-the-art algorithms in each task.

A Gene Expression Atlas of the Mouse Brain from a Million Single-cell Measurements

scVI provides a readily applicable solution for large-scale inference. In
particular, we could fit the 1.3 million cells dataset from 10x Genomics in less
than 1 hour on a small subset of 720 genes and less than 6 hours on a bigger
subset of 10,000 genes. The generative model embeds the data via its posterior
$q(z_n | x_n, s_n)$ (Figure 2) and identifies important genes between cell-types
via hypothesis testing.

Figure 2: tSNE embedding of a subset of cells from the 1.3M cells dataset
based on scVI’s latent space.

Towards more Systematic Integration of Prior Biological Knowledge

From this proof of concept, we are now working on extending this work to
incorporate prior biological knowledge such as integrating several datasets,
using semi-supervised models to annotate the cells, adding hierarchical
structure for cell-type taxonomy, and adding sparsity priors for greater
interpretability of the neural networks weights.

We hope that our approach will inspire the computational biology community to
develop probabilistic-model-based tools that are more consistent and better
grounded than ad-hoc algorithmic procedures.
Please feel free to read our manuscript published in Nature Methods
along with its associated News and Views. We have released code here
through a GitHub project.