Meta

Maintainers

Project description

Introduction

t-Stochastic Neighborhood Embedding
(t-SNE) is a highly successful
method for dimensionality reduction and visualization of high
dimensional datasets. A popular
implementation of t-SNE uses
the Barnes-Hut algorithm to approximate the gradient at each iteration
of gradient descent. We modified this implementation as follows:

Computation of the N-body Simulation: Instead of approximating the
N-body simulation using Barnes-Hut, we interpolate onto an equispaced
grid and use FFT to perform the convolution, dramatically reducing
the time to compute the gradient at each iteration of gradient
descent. See this
post for some intuition on how it works.

Computation of Input Similiarities: Instead of computing nearest
neighbors using vantage-point trees, we approximate nearest neighbors
using the Annoy library. The
neighbor lookups are multithreaded to take advantage of machines with
multiple cores. Using “near” neighbors as opposed to strictly
“nearest” neighbors is faster, but also has a smoothing effect, which
can be useful for embedding some datasets (see Linderman et al.
(2017)). If subtle detail is required
(e.g. in identifying small clusters), then use vantage-point trees (which is
also multithreaded in this implementation).

Early exaggeration: In Linderman and Steinerberger
(2017), we showed that
appropriately choosing the early exaggeration coefficient can lead to
improved embedding of swissrolls and other synthetic datase ts

Late exaggeration: By increasing the exaggeration coefficient late in
the optimization process (e.g. after 800 of 1000 iterations) can
improve separation of the clusters