alization, as they capture the two directions that capture maximal variance.
PCA is popular because it can be computationally efficient in many cases
relevant to single-cell data, the axes it
returns are interpretable (in that they
are linear combinations of the biomolecules being measured), and it is
possible to compute how much of the
variance is captured in the top n PCs.
Its interpretation is straightforward as
well: It captures the direction of maximal variance. However, in PCA each
PC must be orthogonal to the others,
therefore, it will perform best on linear
data. Additionally, the variance in the
data may not be the most informative
characteristic. For these reasons, PCA
is often a quick, first pass visualization
tool.

A more complex set of nonlinear approaches is also in use. One popular
method performs nonlinear embedding with the Student’s t distribution,
called t-sne (t-distributed stochastic
neighbor embedding), which tends to
preserve structure in mass cytometry
data better than linear methods such
as PCA [ 2]. This is an iterative algorithm that, in each iteration, strives to
return a (typically) two-dimensional
position for each high-dimensional
data point by optimizing an objective
function in which points that are close
in high-dimensional space are placed
close to each other in 2-D, while other
points are repelled. It is computationally burdensome for large numbers of
cells, which can constrain its use in
some applications.

USING PROBABILIS TIC MODELS TOINFER REGULATORY STRUCTURE

In biological systems, biomolecules
interact in complex networks to affect a desired response such as cell
division, secretion of molecules to
communicate with other cells, and
processes that regulate cell metabolism. When biological networks
malfunction, the results can lead to
serious diseases. A classic example
is cancer, in which mutations affecting tumor suppressor or oncogenic
proteins misregulate the biological
network, which then instructs the
cell to divide incessantly, resulting
in a tumor, or to travel through the
bloodstream, inducing metastases.

Resulting clusters are often sensitive
to the particular choice of algorithm
(k-means may yield a very different
set of clusters than agglomerative approaches for instance) as well as the
choice of distance metric. A distance
metric such as Euclidean distance is
magnitude sensitive, while rank-based
methods such as Spearman correlations assess similarity of patterns and
are magnitude independent.

As biological distributions, cell
populations follow typical distribution
patterns with many cells concentrated near the mean of that population,
and tapering concentrations as the
distance increases from the median.
Because of this, they lend themselves
well to separation via density-based
approaches. Density-based methods,
which are becoming more popular in
the field, work by first estimating the local density, either by griding the space
or using a nearest-neighbor approach
or similar. They then find regions to
separate along areas of low density.
These methods are attractive partly because they mimic the human expert’s
process: Humans also look for dense
areas of concentrations and naturally
identify populations by separating
along areas of low density.

Just as a human must sometimesdecide if a “bump” is really a differ-ent population or just a noisy cornerof an existing population, so too mustthese methods handle this type of po-tential noise. Density-based methodsmay become computationally pro-hibitive as the number of dimensionsgrows large, or as the number of cellsbecomes large. Some density-basedmethods may work easily in two di-mensions, but can become prohibi-tively expensive at three or more, andtherefore must be applied to sequen-tial sets of two dimensions each. Fi-nally, both density-based methods andclustering suffer greatly from the curseof dimensionality. As the number of di-mensions grows, the data become ex-tremely sparse and the amount of datarequired to separate clusters of cellsgrows exponentially. Many rare andimportant biological stem cell subsets(sometimes tens of cells) are nearly im-possible to identify statistically acrossall dimensions.

While clustering and density-based
approaches return distinct cell populations, dimensionality reduction approaches enable a low, dimensional visualization of high-dimensional data,
which the biologist can then interpret
with the standard (but powerful!) human pattern recognition system. Unlike other tools, these methods do not
always try to capture all potential cell
subsets, but instead strive to “squeeze”
out extra dimensions and can return
a user-defined number of dimensions
(typically two is chosen for ease of visualization). One tried and true method
is principal component analysis (PCA),
which finds eigenvectors, or linear
combinations of the original dimensions, which are termed “principle
components” (PC), along with a coefficient that indicates the importance of
each such PC. The coefficients are the
eigenvalues, and they indicate what
proportion of the data’s variance is
captured by a particular PC. Typically
the top two PCs are used for a 2-D visu-