Abstract

Information distances like the Hellinger distance and the Jensen-Shannon
divergence have deep roots in information theory and machine learning. They are
used extensively in data analysis especially when the objects being compared are
high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications efficiently
and at scale with any kind of provable guarantees. We can’t sketch these
distances easily, or embed them in better behaved spaces, or even reduce the
dimensionality of the space while maintaining the probability structure of the data.

In this paper, we build these tools for information distances—both for the Hellinger distance and Jensen–Shannon divergence, as well as related measures, like the χ2 divergence. We first show that they can be sketched efficiently (i.e. up to multiplicative error in sublinear space) in the aggregate streaming model. This result is exponentially stronger than known upper bounds for sketching these distances in the strict turnstile streaming model. Second, we show a finite dimensionality embedding result for the Jensen-Shannon and χ2 divergences that preserves pair wise distances. Finally we prove
a dimensionality reduction result for the Hellinger, Jensen–Shannon, and
χ2 divergences that preserves the information geometry of the
distributions (specifically, by retaining the simplex structure of the
space). While our second result above already implies that these divergences can
be explicitly embedded in Euclidean space, retaining the simplex structure is
important because it allows us to continue doing inference in the reduced
space. In essence, we preserve not just the distance structure but the
underlying geometry of the space.

The space of information distances includes many distances that are used
extensively in data analysis. These include the well-known Bregman divergences, the α-divergences, and the f-divergences. In this work we focus on a subclass of the f-divergences that admit embeddings into some (possibly infinite-dimensional) Hilbert space, with a specific emphasis on the JS divergence. These divergences are used in statistical tests and estimators (Beran, 1977), as well as in image analysis (Peter and Rangarajan, 2008),
computer vision (Huang et al., 2005; Mahmoudi and Sapiro, 2009), and text
analysis (Dhillon et al., 2003; Eiron and McCurley, 2003). They were introduced by Csiszár (1967), and, in the most general case, also include
measures such as the Hellinger, JS, and χ2 divergences (here we consider a symmetrized variant of the χ2 distance).

To work with the geometry of these divergences effectively at scale and in high dimensions, we need algorithmic tools that can provide provably high quality approximate representations of the geometry.
The techniques of sketching, embedding, and dimensionality
reduction have evolved as ways of dealing with this problem.

A sketch for a set of points with respect to a property P is a function that maps the data to a small summary from which property P can be evaluated, albeit with some approximation error. Linear sketches are especially useful for estimating a derived property of a data stream in a fast and compact way.2
Complementing sketching, embedding techniques are one to one mappings that transform a collection of points lying in one space X to another (presumably easier) space Y, while approximately preserving distances between points. Dimensionality reduction is a special kind of embedding which preserves the structure of the space, while reducing its dimension. These embedding techniques can be used in an almost “plug-and-play” fashion to speed up many algorithms in data analysis: for example for near neighbor search (and classification), clustering, and closest pair calculations.

Unfortunately, while these tools have been well developed for norms like ℓ1 and ℓ2, we lack such tools for information distances. This is not just a theoretical concern: information distances are semantically more suited to many tasks in machine learning, and building the appropriate algorithmic toolkit to manipulate them efficiently would expand greatly the places where they can be used.

1.1 Our contributions

Sketching information divergences.

Guha, Indyk, and McGregor (2007) proved an impossibility result, showing that a large class of information divergences cannot be sketched in sublinear space, even if we allow for constant factor approximations. This result holds in the strict turnstile streaming model—a model in which coordinates of two points x, y⊂Δd are increased incrementally and we wish to maintain an estimate of the divergence between them. They left open the question of whether these divergences can be sketched in the aggregate streaming model, where each element of the stream gives the ith coordinate of x or y in its entirety, but the coordinates may appear in an arbitrary order. We answer this in the affirmative for two important information distances, namely, the Jensen–Shannon and χ2 divergences.

{theorem}

A set of points P under the Jensen–Shannon(JS) or χ2 divergence can be deterministically embedded into O(d2εlogdε) dimensions under ℓ22 with ε additive error. The same space bound holds when sketching JS or χ2 in the aggregate stream model.
{corollary}
Assuming polynomial precision, an AMS sketch for Euclidean distance can reduce the dimension to O(1ε2log1εlogd) for a (1+ε) multiplicative approximation in the aggregate stream setting.

{theorem}

A set of points P under the JS or χ2 divergence can be embedded into ℓ¯d2 with ¯d=O(n2d3ε2) with (1+ε) multiplicative error.
For the both techniques, applying the Euclidean JL–Lemma can further reduce the dimension to O(lognε2) in the offline setting.

Dimensionality reduction.

We then turn to the more challenging case of performing dimensionality reduction for information distances, where
we wish to preserve not only the distances between pairs of points (distributions),
but also the underlying simplicial structure of the space, so that we can
continue to interpret coordinates in the new space as probabilities. This notion
of a structure-preserving dimensionality reduction is implicit when
dealing with normed spaces (since we always map a normed space to another), but
requires an explicit mapping when dealing with more structured
spaces. We prove an analog of the classical JL–Lemma :

{theorem}

For the Jenson-Shannon, Hellinger, and χ2 divergences, there exists a structure preserving dimensionality reduction from the high dimensional simplex Δd to a low dimensional simplex Δk, where k=O((logn)/ε2).

The theorem extends to “well-behaved” f-divergences (See Section 3 for a precise definition). Moreover, the dimensionality reduction is constructive for any divergence with a finite dimensional kernel (such as the Hellinger divergence), or an infinite dimensional Kernel that can be sketched in finite space, as we show is feasible for the JS and χ2 divergences.

Our techniques.

The unifying approach of our three results—sketching, embedding into ℓ22, and dimensionality reduction—is to analyze carefully the infinite dimensional kernel of the information divergences. Quantizing and truncating the kernel yields the sketching result, sampling repeatedly from it produces an embedding into ℓ22. Finally given such an embedding, we show how to perform dimensionality reduction by proving that each of the divergences admits a region of the simplex where it is similar to ℓ22. We point out that to the best of our knowledge, this is the first result that explicitly uses the kernel representation of these information distances to build approximate geometric structures; while the existence of a kernel for the Jensen–Shannon distance was well-known, this structure had never been exploited for algorithmic advantage.

The works by Fuglede and Topsøe (2004), and then by Vedaldi and Zisserman (2012)
study embeddings of information divergences into an infinite dimensional Hilbert space by representing them as an integral along a one-dimensional curve in C. Vedaldi and Zisserman give an explicit formulation of this kernel for JS and χ2 divergences, for which a discretization (by quantizing and truncating) yields an additive error embedding into a finite dimensional ℓ22. However, they do not obtain quantitative bounds on the dimension of target space needed or address the question of multiplicative approximation guarantees.

In the realm of sketches, Guha, Indyk, and McGregor (2007) show Ω(n) space (where n is the length of the stream) is required in the strict turnstile model even for a constant factor multiplicative approximation. These bounds hold for a wide range of information divergences, including JS, Hellinger and the χ2 divergences. They show however that an additive error of ε can be achieved using O(1ε3logn) space.
In contrast, one can indeed achieve a multiplicative approximation in the aggregate streaming model for information divergences that have a finite dimensional embedding into ℓ22. For instance, Guha et al. (2006) observe that for the Hellinger distance that has a trivial such embedding, sketching is equivalent to sketching ℓ22 and hence may be done up to a (1+ε)-multiplicative approximation in 1ε2logn space. This immediately implies a constant factor approximation of JS and χ2 divergences in the same space, but no bounds have been known prior to our work for a (1+ε)-sketching result for JS and χ2 divergences in any streaming model.

Moving onto dimensionality reduction from simplex to simplex, in the only other work we are aware of, Kyng, Phillips, and Venkatasubramanian (2010) show a limited dimensionality reduction result for the Hellinger distance. Their
approach works by showing that if the input points lie in a specific region of
the simplex, then a standard random projection will keep the points on a
lower-dimensional simplex while preserving the distances
approximately. Unfortunately, this region is a small ball centered in the interior of
the simplex, which further shrinks with the dimension. This is in sharp contrast to our work here, where the input points are unconstrained.

While it does not admit a kernel, the ℓ1 distance is also an
f-divergence, and it is therefore natural to investigate its potential
connection with the measures we study here. For ℓ1, it is well known that
significant dimensionality reduction is not possible: an embedding with
distortion 1+ε requires the points to be embedded in n1−O(log1ε) dimensions, which is nearly linear. This result was
proved (and strengthened) in a series of
results (Andoni et al., 2011; Regev, 2012; Lee and Naor, 2004; Brinkman and Charikar, 2005).

The general literature of sketching and embeddability in normed spaces is too
extensive to be reviewed here: we point the reader to Andoni et al. (2014) for
a full discussion of results in this area. One of the most famous applications
of dimension reduction is the Johnson–Lindenstrauss(JL) Lemma, which states
that any set of n points in ℓ22 can be embedded into
O(lognε2) dimensions in the same space while
preserving pairwise distances to within (1±ε). This result has
become a core step in algorithms for near neighbor search (Ailon and Chazelle, 2006; Andoni and Indyk, 2006), speeding up clustering
algorithms (Boutsidis et al., 2015), and efficient approximation of
matrices (Clarkson and Woodruff, 2013), among many others.

Although sketching, embeddability, and dimensionality reduction are related
operations, they are not always equivalent. For example, even though ℓ1 and
ℓ2 have very different behavior under dimensionality reduction, they can
both be sketched to an arbitrary error in the turnstile model (and in fact any
ℓp norm, p≤2 can be sketched using p-stable
distributions (Indyk, 2000)). In the offline
setting, Andoni et al. (2014) show that sketching and embedding of normed spaces are equivalent: for any finite-dimensional normed space X, a constant distortion and space sketching algorithm
for X exists if and only if there exists a linear embedding of X
into ℓ1−ε.

In this section, we define precisely the class of information divergences that we work with, and their specific properties that allow us to obtain sketching, embedding, and dimensionality results.
For what follows Δd denotes the d-simplex: Δd={(x1,…,xd)∣∑xi=1 and xi≥0,∀i}. Let
[d]={1,…,d}.
{definition}[f-divergence]
Let p and q be two distributions on [n]. A convex function f:[0,∞)→R such that f(1)=0 gives rise to an
f-divergenceDf:Δd→R as:

Df(p,q)=d∑i=1pi⋅f(qipi),

where we define 0⋅f(0/0)=0, a⋅f(0/a)=a⋅limu→0f(u), and 0⋅f(a/0)=a⋅limu→∞f(u)/u.

{definition}

[Regular distance]
We call a distance function D:X→Rregular if there
exists a feature map ϕ:X→V, where V is a (possibly infinite dimensional) Hilbert
space, such that:

D(x,y)=∥ϕ(x)−ϕ(y)∥2∀x,y∈X.

The work of Fuglede and Topsøe (2004) establishes that JS is regular;
Vedaldi and Zisserman (2012) construct an explicit feature map for the JS kernel, as ϕ(x)=∫+∞−∞Ψx(ω)dω, where Ψx(ω):R→C is given by

Ψx(ω)=exp(iωlnx)√2xsech(πω)(ln4)(1+4ω2).

Hence we have for x, y∈R, JS(x,y)=∥ϕ(x)−ϕ(y)∥2=∫+∞−∞∥Ψx(ω)−Ψy(ω)∥2dω.
The “embedding” for a given distribution p∈Δd is then
the concatenation of the functions ϕ(pi), i.e.,
ϕ(p)=(ϕp1,…,ϕpd).
{definition}[Well-behaved divergence]
A well-behavedf-divergence is a regular f-divergence such that f(1)=0, f′(1)=0, f′′(1)>0,
and f′′′(1) exists.

In this paper, we will focus on the following well-behaved f-divergences.
{definition}
The Jensen–Shannon (JS), Hellinger, and χ2
divergences between distributions p and q are defined as:

We present two algorithms for embedding JS into ℓ22. The first is deterministic and gives an additive error approximation whereas the second is
randomized but yields a multiplicative
approximation in an offline setting. The advantage of the first algorithm is that it can be
realized in the streaming model, and if we make a standard assumption of polynomial
precision in the streaming input, yields a (1+ε)-multiplicative approximation as well in this setting.

We derive some terms in the kernel representation of JS(x,y) which we will find convenient. First, the explicit formulation in Section 3 yields that for x, y∈R:

4.1 Deterministic embedding

We will produce an embedding ϕ(p)=(ϕp1,…,ϕpd),
where each ϕpi is an integral that we can discretize by
quantizing and truncating carefully.

{algorithm}

Embed p∈Δd under JS into ℓ22.\DontPrintSemicolon\KwInp={p1,…,pd} where coordinates are ordered by arrival.
\KwOutA vector cp of length O(d2εlogdε)ℓ←1; J←⌈32dεln(8dε)⌉,
\Forj←−JtoJwj←j×ε/32d\Fori←1tod\Forj←−JtoJ−1apℓ←√picos(ωjlnpi)√∫ωj+1ωjκ(ω)dωbpℓ←√pisin(ωjlnpi)√∫ωj+1ωjκ(ω)dωℓ←ℓ+1\Returnap concatenated with bp.

To analyze Algorithm 4.1, we first obtain bounds on the function h and its derivative.
{lemma}
For 0≤x,y,≤1, we have
0≤h(x,y,ω)≤2
and
∣∣∂h(x,y,ω)∂ω∣∣≤16.
{proof}
Clearly h(x,y,ω)≥0. Furthermore, since
0≤x,y≤1, we have

h(x,y,ω)≤∣∣√xeiωlnx∣∣2+∣∣√yeiωlny∣∣2=x+y≤2.

Next, ∣∣∣∂h(x,y,ω)∂ω∣∣∣

=

∣∣2(√xcos(ωlnx)−√ycos(ωlny))(−√xsin(ωlnx)lnx+√ysin(ωlny)lny)

+2(√xsin(ωlnx)−√ysin(ωlny))(√xcos(ωlnx)lnx−√ycos(ωlny)lny)∣∣

≤

∣∣2(√x+√y)(√xlnx+√ylny)∣∣+2∣∣(√x+√y)(√xlnx+√ylny)∣∣≤16,

where the last inequality follows since
max0≤x≤1|√xlnx|<1.
The next two steps are useful to approximate the infinite-dimensional
continuous representation by a finite-dimensional discrete representation
by appropriately truncating and quantizing the integral.
{lemma}[Truncation]
For t≥ln(4/ε),

fJ(x,y)≥∫t−th(x,y,ω)κ(ω)dω≥fJ(x,y)−ε.

{proof}

The first inequality follows since h(x,y,ω)≥0.
For the second inequality, we use h(x,y,ω)≤2:

Define ωi=εi/16 for i∈{…,−2,−1,0,1,2,…} and
~h(x,y,ω)=h(x,y,ωi) where i=max{j∣ωj≤ω}.

{lemma}

[Quantization]
For any a,b,

∫bah(x,y,ω)κ(ω)dω=∫ba~h(x,y,ω)κ(ω)dω±ε.

{proof}

First note that

|~h(x,y,ω)−h(x,y,ω)|≤(ε16)⋅maxx,y∈[0,1],ω∣∣∣∂h(x,y,ω)∂ω∣∣∣≤ε.

Hence,
∣∣∫b−a~h(x,y,ω)κ(ω)dω−∫b−ah(x,y,ω)κ(ω)dω∣∣≤∣∣∫b−aεκ(ω)dω∣∣≤ε.

Given a real number z, define vectors vz and
uz indexed by i∈{−i∗,…,−2,−1,0,1,2,…i∗} where
i∗=⌈16ε−1ln(4/ε)⌉ by:

vz=√zcos(ωilnz)√∫ωi+1ωiκ(ω)dω,uz=√zsin(ωilnz)√∫ωi+1ωiκ(ω)dω,

and note that

(vxi−vyi)2+(uxi−uyi)2=h(x,y,ωi)∫ωi+1ωiκ(ω)dω.

Therefore,

∥vx−vy∥22+∥ux−uy∥22

=

∫wi∗+1w−i∗~h(x,y,ω)κ(ω)dω=∫wi∗+1w−i∗h(x,y,ω)κ(ω)dω±ε

=

∫∞−∞h(x,y,ω)κ(ω)dω±2ε=fJ(x,y)±2ε,

where the second to last line follows from Lemma 4.1 and the last line follows from Lemma 4.1, since min(|w−i∗|,wi∗+1)≥ln(4/ε).

Define the vector ap to be the vector generated by concatenating vpi and upi for i∈[d]. Then if follows that

∥ap−aq∥22=JS(p,q)±2εd

Hence we have reduced the problem of estimating JS(p,q) to ℓ2 estimation. Rescaling ε←ε/(2d) ensures the additive error is ε while the length of the vectors ap and aq is O(d2εlogdε).
{theorem}
Algorithm 4.1 embeds a set P of points under JS
into O(d2εlogdε) dimensions
under ℓ22 with ε additive error, independent of the size of |P|.
Note that using the JL-Lemma, the dimensionality of the target space can be
reduced to O(log|P|ε2). Theorem 4.1,
along with the AMS sketch of Alon et al. (1996), and the standard assumption of polynomial precision
immediately implies:
{corollary}
There is an algorithm that works in the aggregate streaming model to
approximate JS to within (1+ε)-multiplicative factor using
O(1ε2log1εlogd) space.
As noted earlier, this is the first algorithm in the aggregate streaming
model to obtain an (1+ε)-multiplicative approximation to JS,
which contrasts against linear space lower bounds for the same problem
in the update streaming model.

4.2 Randomized embedding

In this section we show how to embed n points of JS into ℓ¯d2 with (1+ε) distortion where ¯d=O(n2d3ε−2). 3

For fixed x,y,∈[0,1], we first consider the random variable T
where T takes the value h(x,y,ω) with probability κ(ω).
(Recall that κ(⋅) is a distribution.) We compute the
first and second moments of T.
{theorem}E[T]=fJ(x,y) and var[T]≤36(fJ(x,y))2.
{proof}
The expectation follows immediately from the definition:

E[T]=∫∞−∞h(x,y,ω)κ(ω)dω=fJ(x,y).

To bound the variance it will be useful to define the function
fH(x,y)=(√x−√y)2 corresponding to the one-dimensional Hellinger distance that is related to fJ(x,y) as follows. We now state two claims regarding fH(x,y) and fχ(x,y):

Claim 4.1

For all x,y∈[0,1], fH(x,y)≤2fJ(x,y).

{proof}

Let fχ(x,y)=(x−y)2x+y correspond to the one-dimensional χ2 distance. Then, we have

fχ(x,y)fH(x,y)

=(x−y)2(x+y)(√x−√y)2=(√x+√y)2x+y=x+y+2√xyx+y≥1.

This shows that fH(x,y)≤fχ(x,y). To show fχ(x,y)≤2fJ(x,y) we refer the reader to (Topsøe, 2000, Section 3). Combining these two relationships gives us our claim.

We then bound h(x,y,ω) in terms of fH(x,y) as follows.

Claim 4.2

For all x,y∈[0,1],ω∈R,
h(x,y,ω)≤fH(x,y)(1+2|ω|)2.

{proof}

Without loss of generality, assume x≥y.

√h(x,y,ω)

=

|√x⋅eiωlnx−√y⋅eiωlny|

≤

|√x⋅eiωlnx−√y⋅eiωlnx|+|√y⋅eiωlnx−√y⋅eiωlny|

=

|√x−√y|+√y⋅|eiωlnx−eiωlny|

=

|√x−√y|+√y⋅2⋅|sin(ωln(x/y)/2)|

≤

√fH(x,y)+√y⋅2⋅|ωln(√x/y)|

≤

√fH(x,y)+√y⋅2⋅|√x/y−1|⋅|ω|

=

√fH(x,y)+2√fH(x,y)⋅|ω|

and hence h(x,y,ω)≤fH(x,y)(1+2|ω|)2 as required.

These claims allow us to bound the variance:

var[T]≤E[T2]=∫∞−∞(h(x,y,ω))2κ(ω)dω

≤

fH(x,y)2∫∞−∞(1+2|ω|)4κ(ω)dω

=

fH(x,y)2⋅8.94<36fJ(x,y)2,

This naturally gives rise to the following algorithm.
{algorithm}Embeds point p∈Δd under JS into ℓ22.\DontPrintSemicolon\KwInp={p1,…,pd}.
\KwOutA vector cp of length O(n2d3ε−2)ℓ←1; s←⌈36n2d2ε−2⌉\Forj←1tosωj← a draw from κ(ω);
\Fori←1tod\Forj←1tosapℓ←(√picos(ωjlnpi)/s)bpℓ←(√pisin(ωjlnpi)/s)ℓ←ℓ+1\Returnap concatenated with bp.
Let ω1,…,ωt be t independent samples chosen according to κ(ω). For any distribution p on [d], define vectors vp,up∈Rtd where, for i∈[d],j∈[t],

vpi,j=√pi⋅cos(ωjlnpi)/t,upi,j=√pi⋅sin(ωjlnpi)/t.

Let vpi be a concatenation of vpi,j and
upi,j over all j∈[t].
Then note that E[∥vpi−vqi∥22]=fJ(pi,qi) and var[∥vpi−vqi∥22]≤36(fJ(pi,qi))2/t.
Hence, for t=36n2d2ε−2, by an application of the Chebyshev bound,

Pr[|∥vpi−vqi∥22−fJ(pi,qi)|≥εfJ(x,y)]≤36ε−2/t=(nd)−2.

(4.1)

By an application of the union bound over all pairs of points:

Pr[∃i∈[d],p,q∈P|∥vpi−vqi∥22−fJ(pi,qi)|≥εfJ(pi,qi)]≤1/d.

And hence, if vp is a concatenation of vpi over all
i∈[d], then with probability at least 1−1/d it holds for all p,q∈P:

(1−ε)JS(p,q)≤∥vp−vq∥≤(1+ε)JS(p,q).

The final length of the vectors is then td=36n2d3ε−2 for approximately preserving distances between every pair of points with probability at least 1−1d. This can be reduced further to O(logn/ε2) by simply applying the JL-Lemma.

We give here two algorithms for embedding the χ2 divergence into ℓ22. The computation and resulting two algorithms are highly analogous to Section 4. First, the explicit formulation given by Vedaldi and Zisserman (2012) yields that for x, y∈R:

χ2(x,y)

=∫+∞−∞∥∥eiωlnx√xsech(πω)−eiωlny√ysech(πω)∥∥2dω

=∫+∞−∞(sech(πω))∥√xeiωlnx−√yeiωlny∥2dω.

For convenience, we now define:

h(x,y,ω)=∥√xeiωlnx−√yeiωlny∥2

and κχ(ω)=sech(πω).

We can then write χ2(p,q)=∑di=1fχ(pi,qi) where

fχ(x,y)=∫∞−∞h(x,y,ω)κχ(ω)dω=(x−y)2x+y.

It is easy to verify that κχ(ω) is a distribution, i.e.,
∫∞−∞κχ(ω)dω=1.

5.1 Deterministic embedding

We will produce an embedding ϕ(p)=(ϕp1,…,ϕpd),
where each ϕpi is an integral that we discretize appropriately.

{algorithm}

Embed p∈Δd under χ2 into ℓ22.\DontPrintSemicolon\KwInp={p1,…,pd} where coordinates are ordered by arrival.
\KwOutA vector cp of length O(d2εlogdε)ℓ←1; J←⌈32dεln(6dε)⌉,
\Forj←−JtoJwj←j×ε/32d\Fori←1tod\Forj←−JtoJ−1apℓ←√picos(ωjlnpi)√∫ωj+1ωjκχ(ω)dωbpℓ←√pisin(ωjlnpi)√∫ωj+1ωjκχ(ω)dωℓ←ℓ+1\Returnap concatenated with bp.

{lemma}

For 0≤x,y,≤1, we have
0≤h(x,y,ω)≤2
and
∣∣∂h(x,y,ω)∂ω∣∣≤16.

Similar to Section 4, the next two steps analyze truncating and quantizing the integral.
{lemma}[Truncation]
For t≥ln(3/ε),

fχ(x,y)≥∫t−th(x,y,ω)κχ(ω)dω≥fχ(x,y)−ε.

{proof}

The first inequality follows since h(x,y,ω)≥0.
For the second inequality, we use h(x,y,ω)≤2:

Define ωi=εi/16 for i∈{…,−2,−1,0,1,2,…} and
~h(x,y,ω)=h(x,y,ωi) where i=max{j∣ωj≤ω}. We recall the following Lemma from Section 4:

{lemma}

[Quantization]
For any a,b,

∫bah(x,y,ω)κχ(ω)dω=∫ba~h(x,y,ω)κχ(ω)dω±ε.

Given a real number z, define vectors vz and
uz indexed by i∈{−i∗,…,−2,−1,0,1,2,…i∗} where
i∗=⌈16ε−1ln(3/ε)⌉ by:

vz=√zcos(ωilnz)√∫ωi+1ωiκχ(ω)dω,uz=√zsin(ωilnz)√∫ωi+1ωiκχ(ω)dω,

and note that

(vxi−vyi)2+(uxi−uyi)2=h(x,y,ωi)∫ωi+1ωiκχ(ω)dω.

Therefore,

∥vx−vy∥22+∥ux−uy∥22

=

∫wi∗+1w−i∗~h(x,y,ω)κχ(ω)dω=∫wi∗+1w−i∗h(x,y,ω)κχ(ω)dω±ε

=

∫∞−∞h(x,y,ω)κχ(ω)dω±2ε=fχ(x,y)±2ε,

where the second to last line follows from Lemma 5.1 and the last line follows from Lemma 5.1, since min(|w−i∗|,wi∗+1)≥ln(3/ε).

Define the vector ap to be the vector generated by concatenating vpi and upi for i∈[d]. Then if follows that

∥ap−aq∥22=χ2(p,q)±2εd

Hence we have reduced the problem of estimating χ2(p,q) to ℓ2 estimation. Rescaling ε←ε/(2d) ensures the additive error is ε while the length of the vectors ap and aq is O(d2εlogdε).
{theorem}
Algorithm 5.1 embeds a set P of points under χ2
into O(d2ε