Biased Range Trees

Vida Dujmović
John Howat
Pat Morin

Abstract

A data structure, called a biased range tree, is presented that
preprocesses a set S of n points in R2 and a query
distribution D for 2-sided orthogonal range counting queries. The
expected query time for this data structure, when queries are drawn
according to D, matches, to within a constant factor, that of the
optimal decision tree for S and D. The memory and preprocessing
requirements of the data structure are O(nlogn).

Let S be a set of n points in R2 and let D be a probability
measure over R2. A 2-sided orthogonal range counting query
over S asks, for a query point q=(qx,qy), to report the number
of points (px,py)∈S such that px≥qx and py≥qy.
A 2-sided range counting query has distribution D if the
query point q is chosen from the probability measure D. If T is
a data structure for answering 2-sided range counting queries over S
then we denote by μD(T) the expected time, using T, to answer a
range query with distribution D. The current paper is concerned
with preprocessing the pair (S,D) to build a data structure T that
minimizes μD(T).

1.1 Previous Work

The general topic of geometric range queries is a field that has seen
an enormous amount of activity in the last century. Results in this
field depend heavily on the types of objects the data structure stores
and on the shape of the query ranges. In this section we only mention
a few data structures for orthogonal range counting and semigroup
queries in 2 dimensions. The interested reader is directed to the
excellent, and easily accessible, survey by Agarwal and Erickson
[9].

Orthogonal range counting is a classic problem in computational
geometry. The 2- (and 3- and 4-) sided range counting problem can be
solved by Bentley’s range trees[3]. Range trees use
O(nlogn) space and can be constructed in O(nlogn) time.
Originally, range trees answered queries in O(log2n) time.
However, with the application of fractional cascading
[6, 11] the query time can be reduced to O(logn) without
increasing the space requirement by more than a constant factor.
Range trees can also answer more general semigroup queries in
which each point of S is assigned a weight from a commutative
semigroup and the goal is to report the weight of all points in the
query range [10, 15].

For 2-sided orthogonal range counting queries, Chazelle
[4, 5] proposes a data structure of size O(n), that can be
constructed in O(nlogn) time, and that can answer range couting
queries in O(logn) time. Unfortunately, this data structure is
not capable of answering semigroup queries in the same time bound.
For semigroup queries, Chazelle provides data structures with the
following requirements: (1) O(n) space and O(log2+ϵn)
query time, (2) O(nloglogn) space and O(log2nloglogn)
query time, and (3) O(nlogϵn) space and O(log2n)
query time.

Practical linear space data structures for range counting include
k-d trees [2], quad-trees [13], and their
variants. These structures are practical in the sense that they are
easy to implement and use only O(n) space. Unfortunately, neither
of these structures has a worst-case query time of logO(1)n.
Thus, in terms of query time, k-d trees and quad-trees are nowhere
near competitive with range trees.

Despite the long history of data structures for orthogonal range
queries, range trees with fractional cascading are still the most
effective data structure for 2-sided orthogonal range queries in the
semigroup model. In particular, no data structure is currently known
that uses o(nlogn) space and can answer 2-sided orthogonal range
queries in O(logn) time.

1.2 New Results

In the current paper we present a data structure, the biased
range tree, for 2-sided orthogonal range counting. Biased range
trees fit into the comparison tree model of computation, in
which all decisions made during a query are based on the result of
comparing either the x- or y-coordinate of the query point to some
precomputed values. Most data structures for orthogonal range
searching, including range trees, k-d trees and quadtrees, fit into
the comparison tree model. This model makes no assumptions about the
x- or y-coordinates of points other than that they each come from
some (possibly different) total order. This is particularly useful in
practice since it avoid the precision problems usually associated with
algebraic decisions and allows the mixing of different data types
(one for x-coordinates and one for y-coordinates) in one data
structure.

A biased range tree has size O(nlogn), can be constructed in
O(nlogn) time, and can answer range counting (or semigroup)
queries in O(μD(T∗)) expected time, where T∗ is any
comparison tree that answers range counting queries over S. In
particular, T∗ could be a comparison tree that minimizes
μD(T∗) implying that the expected query time of our data
structure is as fast as the fastest comparison-based data structure
for answering range counting queries over S. Moreover, the
worst-case search time of biased range trees is O(logn), matching
the worst-case performance of range trees.

Note that we do not place any restrictions on the comparison tree
T∗. Biased range trees, while requiring only O(nlogn) space,
are competitive with any comparison-based data structure. Thus, the
memory requirement of biased range trees is the same as that of range
trees but their expected query time can never be any worse.

The remainder of the paper is organized as follows. In
Section 2 we present background material that is used in
subsequent sections. In Section 3 we define biased
range trees. In Section 4 we prove that biased range trees
are optimal. In Section 5 we recap, summarize, and describe
directions for future work.

In this section we give definitions, notations, and background
that are prerequisites for subsequent sections.

Rectangles.

For the purposes of the current paper, a rectangleR(a,b,c,d) is defined as

R(a,b,c,d)={(x,y):a≤x≤b and c≤y≤d}.

We also allow unbounded rectangles by setting a,c=−∞ and/or
b,d=∞. Therefore, under this definition, rectangles can have
0, 1, 2, 3, or 4 sides. For a query point q=(qx,qy) we denote
by R(q) the query range R(qx,∞,qy,∞). A
horizontal strip is rectangle of the form
R(−∞,∞,c,d) and a vertical strip is a rectangle of
the form R(a,b,−∞,∞).

Classification Problems and Classification Trees.

A classification problem over a domain D is a
function P:D↦{0,…,k−1}. The
special case in which k=2 is called a decision problem. A
d-ary classification tree is a full d-ary tree1 in which each internal node v is labelled
with a function Pv:D↦{0,.…,d−1} and for
which each leaf ℓ is labelled with a value
in {0,…,k−1}. The search path of an input q
in a classification tree T starts at the root of T and, at each
internal node v, evaluates i=Pv(q) and proceeds to the ith
child of v. We denote by T(q) the label of the final (leaf) node
in the search path for q. We say that the classification tree Tsolves the classification problem P over the domain
D if, for every q∈D, P(q)=T(q).

The particular type of classification trees we are concerned with are
comparison trees. These are binary classification trees in
which the function Pv at each node v compares either qx or
qy to a fixed value (that may depend on the point set S and the
distribution D). For the problem of 2-sided range counting over
S, the leaves of T are labelled with values in {0,…,|S|}
and T(q)=|R(q)∩S| for all q∈R2.

Probability.

For a probability measure D and an event X, we denote by D|X the
distribution D conditioned on X. That is, the distribution where
the probability of an event Y is Pr(Y∣X)=Pr(Y∩X)/Pr(X).
The probability measures used in this paper are usually defined over
R2. We make no assumptions about how these measures are
represented, but we assume that an algorithm can, in constant time,
given a rectangle r, determine Pr(r).

For a classification tree T that solves a problem
P:D↦{0,…,k−1} and a probability measure D
over D, the expected search time of T, denoted
by μD(T), is the
expected length of the search path for q when q is drawn at random
from D according to D. Note that, for each leaf ℓ
of T there is a maximal subset r(ℓ)⊆D such
that the search path for any q∈r(ℓ) ends at ℓ. Thus, the
expected search time of T (under distribution D) can be written as

μD(T)=∑ℓ∈L(T)Pr(r(ℓ))×0ptT(ℓ),

where L(T) denotes the leaves of T and 0ptT(ℓ) denotes the
length of the path from the root of T to ℓ. When the tree T
is obvious based on context we will sometimes use the notation
d(ℓ) to denote dT(ℓ). Note that, for
comparison trees, the closure of r(ℓ) is always a rectangle. For
a node v in a tree, we will use the phrases depth of v and
level of v interchangeably and they both refer to d(v).

The following theorem is a restatement of (half of) Shannon’s
Fundamental Theorem for a Noiseless Channel [14, Theorem 9].

Theorem 1.

Let P:D↦{0,…,k−1} be a classification
problem and let p∈D be selected from a distibution D such
that Pr{P(p)=i}=pi, for 0≤i<k. Then, any
d-ary classification tree T that solves P has

μD(T)≥k−1∑i=0pilogd(1/pi).

(1)

In terms of range counting, Theorem 1 immediately implies that,
if pi is the probability that the query range contains i points
of S, then any binary decision tree T that does range counting has
μD(T)≥∑ni=0pilog(1/pi). Unfortunately for us,
this lower bound is too weak and, in general, there is no decision
tree whose performance matches this obvious entropy lower bound.

A stronger lower bound on the cost of range searching can be obtained
by considering the arrangement A of 2n rays obtained by drawing
two rays originating at each point of S, one to the left and one
downwards (see Figure 1.a). This arrangement partitions
the plane into a set of faces F(A). If T is a comparison tree for
range counting in S, then there is no leaf ℓ of T such that
the interior of r(ℓ) intersects any edge of A since otherwise
there are query points q in the neighbourhood of this intersection
for which T(q)≠|R(q)∩S|. Therefore, by relabelling the leaves
of T with the faces of A, we obtain a data structure for
determining which face of A contains the query point q.
By Theorem 1, this implies that

μD(T)≥∑f∈F(A)Pr(f)log(1/Pr(f)).

Unfortunately, this bound is still not strong enough and, in general,
there is no decision tree T that matches this lower bound. To see
this, consider Figure 1.b, when the query point q is
uniformly distributed among the n+1 shaded circles. In this case,
q is always in the same face of A so the lower bound given above
is 0. Nevertheless, it is not hard to see that the leaves of
any decision tree T for range searching in S can be relabelled to
determine which of the n+1 circles contains q, so μD(T)≥log(n+1).

(a)

(b)

Figure 1: (a) The distribution of the query point q over the faces
of the arrangement A gives a lower bound on the cost of any
comparison tree for range counting in S. (b) The lower bound is
not always achievable by a comparison tree.

Biased Search Trees.

Biased search trees are a classic data structure for solving
the following 1-dimensional problem: Given an increasing sequence of
real numbers X=⟨x0=−∞,x1,x2,…,xn,xn+1=∞⟩ and a probability distribution D over
R, construct a binary search tree T=T(X,D) so that, for any
query value q drawn from D, one can quickly find the unique
interval [xi,xi+1) containing q. If pi is the probability
that q∈[xi,xi+1) then the expected number of comparisons
performed while searching for q is given
by

μD(T)≤n∑i=1pilog(1/pi)+1

and the tree T can be constructed in O(n) time [12].
Clearly, by Theorem 1, the query time of this binary search
tree is optimal up to an additive constant term. Note that, by having
each node of T store the size of its subtree, a biased search tree
can count the number of elements of X in the interval
I(q)=[q,∞) without increasing the search time by more than a
constant factor. Thus, biased search trees are an optimal data
structure for 1-dimensional range counting.

In this section we describe the biased range tree data structure,
which has three main parts: the backup tree, the primary tree, and a
set of catalogues that adorn the nodes of the primary tree.

3.1 The Backup Tree

In trying to achieve optimal query time, biased range trees will try
to quickly answer queries that are, in some sense, easy. In some
cases, a query is difficult and it cannot be answered in o(logn)
time. For these queries, a backup range tree that stores the
points of S and can answer any 2-sided range query in O(logn)
worst-case time is used. The preprocessing time and space
requirements of this backup tree are O(nlogn)[8].

3.2 The Primary Tree

Like a range tree, a biased range tree is an augmented data structure
consisting of a primary tree whose nodes store secondary structures.
However, in a range tree the primary tree is a binary search tree that
discriminates based only on the x-coordinate of the query point q.
In order to achieve optimal expected query time, this turns out to be
insufficient, so instead biased range trees use a variation of a k-d
tree as the primary tree.

The primary tree is constructed in a top-down fashion. Each node v
of T is associated with a region r(v) whose closure is a
rectangle. The region associated with the root of T is all of
R2. We say that a node v is bad if its depth is at least
⌈log2n⌉ and r(v)∩S≠∅. A node v is
split if v its depth is less than ⌈log2n⌉, and
r(v)∩S≠∅. The two children of a split node v are
associated with the two regions obtained by removing a horizontal or
vertical strip s(v) from r(v) depending on whether the depth of
v is even or odd, respectively. We call a node v at even distance
from the root a vertical node, otherwise we call v a
horizontal node.

Refer to Figure 2. For a vertical node v, we denote its
children by left(v) and right(v) and call them the left
child and right child of v, depending on which side of the
vertical strip (left or right) they are. For uniformity, we will also
call the children of a node v that is split with a horizontal strip
left(v) and right(v). The child below the strip is denote by left(v) and
the child above the strip is denoted by right(v).
Similarly, the left and right
boundaries of a strip s(v) at a horizontal node v refer to the
bottom and top sides of s(v). Note that, with these conventions, if
the query point q is in r(left(v)) then R(q) intersects
r(right(v)). However, if q∈r(right(v)) then R(q) does not
intersect r(left(v)). Similarly, for a query point q∈s(v), the
query range R(q) intersects r(right(v)) but not r(left(v))

(a)

(b)

Figure 2: The splitting of (a) a vertical node v and (b) a horizontal
node v.

All that remains is to define the strip s(v) for each node v. If
v is a leaf then we use the convention that s(v)=r(v). If v is
not a leaf then s(v)⊆r(v) is selected as a maximal strip
containing no point of r(v)∩S in its interior, that is closed on
its right side and open on its left side and such that each of the at
most two components of r(v)∖s(v) has probability at most
Pr(r(v))/2. Suppose v is a vertical node. Then let
r(v)1,…,r(v)k, be a partitioning of r(v) into strips, in
left-to-right order, obtained by drawing a vertical line through each
of the k points in S∩r(v). We use the convention that each
strip is closed on its right side and open on its left side. Then
there is a unique strip s(v)=r(v)i such that ∑i−1j=1Pr(r(v)j)≤Pr(r(v))/2 and ∑kj=i+1Pr(r(v)j)<Pr(r(v))/2. For a
horizontal node v, the definition of s(v) is analagous except we
use horizontal lines through each point of r(v)∩S.

Note that for a node v that is not a leaf, we use the convention
that s(v) contains its right side but not its left side and that
r(right(v) and r(left(v)) are the two components of r(v)∖s(v). This implies that r(left(v)) and/or r(right(v)) may be
empty, in which case left(v), respectively, right(v) is a leaf of
T. With these definitions, for any point q∈R2 there is
exactly one vertex v of T such that q∈s(v).

The following two properties are easily derived from the definition of
T and are necessary to prove the optimality of biased range trees:

Any node v at depth i in T has Pr(s(v))≤Pr(r(v))≤1/2i.

For any node v of T, if Pr(r(v))>0, then the closure of
r(v) contains at least one point of S.

Point 1 above follows immediately from the definition of s(v). Next
we explain the logic leading to Point 2. If r(v) contains a point
of S then so does the closure of r(v). If r(v)=∅, then
Pr(r(v))=0. Otherwise, r(v)≠∅ and r(v) has no
point of S in its interior. Then consider the parent w of v.
Since s(w) does not contain r(v) there must be a point of S on
the boundary of s(w) that is also on the boundary of r(v).
Therefore r(v) contains this point in its closure.

3.3 The Catalogues

The nodes of the tree T are augmented with additional data
structures called catalogues that hold subsets of S. Each
node v has two catalogues, Cx(v) and Cy(v) that store subsets
of S sorted by their x-, respectively, y-, coordinate.
Intuitively, Cx(v) stores points that are “above” r(v) and
Cy(v) stores points that are “to the right of” r(v). (Refer to
Figure 3.) More precisely, if v is a horizontal node,
then Cx(left(v))=(s(v)∪r(right(v)))∩S and
Cy(left(v))=∅. If v is a vertical node, then
Cy(left(v))=(s(v)∪r(right(v)))∩S and
Cx(left(v))=∅. For any node v that is the root of T or
a right child of its parent, Cx(v)=Cy(v)=∅.

(a)

(b)

Figure 3: The catalogues of (a) a horizontal node v and (b) a
vertical node v.

Consider any node v that is not a bad leaf and any point q∈s(v).
If v has a left child then let v1=left(v), otherwise, let
v1=v.
Let v1,…,vk denote the path from v1 to the root of T
(see Figure 4). Then the catalogues of v1,…,vk have
the following properties:

Figure 4: The area covered by catalogues on the path v to the root
of T. The × symbol shows the location of the query point q.

The points in the catalogues of v1,…,vk are above or to
the right of q. That is, for each 1≤i≤k, all points in
Cy(vi), respectively, Cx(vi) have their x-, respectively,
y-, coordinate greater than or equal to qx, respectively, qy.

All catalogues at nodes in v1,…,vk are disjoint. That,
is, for each 1≤i≤j≤k,
Cx(vi)∩Cx(vj)=∅,
Cy(vi)∩Cy(vj)=∅,
Cx(vi)∩Cy(vj)=∅, and
Cx(vj)∩Cy(vi)=∅.

The catalogues at nodes v1,…,vk contain all points in
the query range R(q). That is,

R(q)∩S⊆k⋃i=1(Cx(vi)∪Cy(vi)).

Note that, points 1, 2 and 3 above imply that determining |R(q)∩S| can be done by solving a sequence of 1-sided range queries in the
x- and y-catalogues of v1,…,vk. However, performing these queries
individually would take too long.

To speed up the process of navigating the catalogues of T,
fractional cascading [6] is used. Starting at the root of T and as long as v is
not a leaf, a fraction of the data in Cx(v) is cascaded into
Cx(right(v)) and Cx(left(v)). As well, a fraction of the data
in Cy(v) is cascaded into both Cy(right(v)) and Cy(left(v)).
Note that this
cascading is done only to speed up navigation between the catalogues
of T. Although fractional cascading introduces extra data into the
catalogues of T we will continue to use the notations Cx(v) and
Cy(v) to denote the set of points contained in the catalogues of
v before fractional cascading takes place.

Finally, each catalogue Cx(v) and Cy(v) is indexed by a biased
binary search tree Tx(v), respectively, Ty(v). If v is the
left child of its parent, then the weight of an interval (a,b] in
Tx(v), respectively, Ty(v) is given by the probability that
qx, respectively, qy, is in the interval (a,b] when q is
drawn according to the distribution D∣s(parent(v)). Otherwise
(v is not a left child), the weight of an interval is determined by
the distribution D∣s(v).

3.4 Construction Time and Space Requirements

The biased range tree data structure is now completely defined. The
structure consists of a backup tree, a primary tree, and the
catalogues of the primary tree. We now analyze the construction time
and space requirements of biased range trees.

The backup tree has size O(nlogn) and can be constructed in
O(nlogn) time [8, Theorem 5.11]. To construct the
primary tree quickly we presort the points of S by their x and y
coordinates. Since the primary tree has height O(logn), it is
then easily constructed in O(nlogn) time. Ignoring any copies of
points created by fractional cascading, each point in S occurs in at
most 2 catalogues at each level of the primary tree. Thus, the sizes
of all catalogues (before fractional cascading) is O(nlogn) and
these catalogues can be constructed in O(nlogn) time (because of
elements of S are presorted; see de Berg et al[8, Section 5.3] for details). The fractional cascading
between catalogues does not increase the size of catalogues by more
than a constant factor since each catalogue is cascaded into only a
constant number of other catalogues [6].

In summary, given the point set S and access to the distribution
D, a biased range tree for (S,D) can be constructed in O(nlogn) time and requires O(nlogn) space.

3.5 The Query Algorithm

The algorithm to answer a 2-sided range query q=(qx,qy) proceeds
in three steps:

The algorithm navigates the tree T from top to bottom to
locate the unique node v such that q∈s(v). This step takes
O(dT(q)) time, where dT(q) is the depth of the node v. If v
is a bad leaf (so dT(q)≥logn) then the algorithm performs a
range query in O(logn) time using the backup range tree and the
query algorithm does not execute the next two steps.

If v has a left child then let u=left(v), otherwise let
u=v. The algorithm uses Tx(u) and Ty(u) to locate qx and
qy, respectively, in the catalogues Cx(u) and Cy(u),
respectively.

The algorithm walks back from u to the root of T, locating
q in the catalogues of all nodes on this path and computing the
results of the range counting query as it goes. Thanks to fractional
cascading, each step of this walk can be done in constant time, so the
overall time for this step is also O(dT(q)).

Observe that Steps 1 and 3 of the query algorithm each take
O(dT(q)) time. The time needed to accomplish Step 2 of the
algorithm depends on exactly what is in the catalogues Cx(u) and
Cy(u), and will be the first quantity we study in the next section.

In this section we show that the expected query time of biased range
trees is as good as the expected query time of any comparison tree.
The expected query time has two components. The first component is
the expected depth, dT(q), of the node v such that s(v)
contains q. The second component is the expected cost of locating
q in the catalogues of u (recall that u=left(v) or u=v if v
has no left child). We will show that each of these two components
is a lower bound on the expected cost of any decision tree for
two-sided range searching on S where queries come from distribution
D. In order to simplify notation in this section we will use the
convention Pr(v)=Pr(s(v)) is the probability that a search
terminates at node v of T.

4.1 The Catalogue Location Step

First we show that the expected cost of locating q in the two
catalogues, Cx(u) and Cy(u) is
a lower bound on the expected cost of any decision tree for answering
2-sided range queries in S. The intuition behind this proof is
that, in order to correctly answer range counting queries, any decision tree
for range counting must locate the x-coordinate of q
with respect to the x-coordinates of all points above q.
Similarly, it must locate the y-coordinate of q with respect to
the y-coordinates of all points to the right of q. The structure
of the catalogues ensures that biased range trees do this in the most
efficient manner possible.

Lemma 1.

Let S be a set of n points and let D be a probability measure
over R2.
Let T∗ be any decision tree for 2-sided range counting in S and let
C2(S,D) denote the expected cost of locating q in Step 2 of the
biased range tree query algorithm on the biased range tree T=T(S,D).
Then

μD(T∗)=Ω(C2(S,D)).

Proof.

We first observe that, by definition,

C2(S,D)=∑v∈TPr(v)(μD∣s(v)(Tx(u))+μD∣s(v)(Ty(u))).

Consider some node v of T. For a point q∈s(v), all of the
points in Tx(v) are points that may or may not be in the query
range R(q) depending on where exactly q is located within s(v).
This implies that, if T∗ correctly answers range queries for every
point q∈s(v) then it must determine the location of the
x-coordinate of q with respect to all points in Tx(v). More
precisely, the leaves of T∗ could be relabelled to obtain a
comparison tree that determines, for any q∈s(v), which interval
of Tx(v) contains qx. Since Tx(u) is a biased search tree
for the probability measure D∣s(v), this implies that

μD∣s(v)(T∗)≥μD∣s(v)(Tx(u))−1.

Similarly, the same argument applied to Ty(v) yields

μD∣s(v)(T∗)≥μD∣s(v)(Ty(u))−1.

We can now complete the proof with

μD(T∗)

=

∑v∈TPr(v)⋅μD∣s(v)(T∗)

≥

∑v∈TPr(v)⋅max{μD∣s(v)(Tx(u)),μD∣s(v)(Ty(u))}−1

≥

∑v∈T12Pr(v)⋅(μD∣s(v)(Tx(u))+μD∣s(v)(Ty(u)))−1

=

12⋅C2(S,D)−1=Ω(C2(S,D)).

∎

4.2 The Tree Searching Step

Next we bound the expected depth dT(q) of the node v of T such
that q∈s(v). We do this by showing that any decision tree T∗
for range counting in S must solve a set of point location problems
and that the expected depth of v is a lower bound on the complexity
of solving these problems.

We say that a set of rectangles is HV-independent if no
horizontal or vertical line intersects more than one rectangle in the
set. We say that a set {v1,…,vk} of nodes in T is
HV-independent if the set {r(v1),…,r(vk)} is
HV-independent.

Lemma 2.

Let S be a set of n points and let D be a probability measure
over R2.
Let T=T(S,D) be the biased range tree for (S,D) and label
each node of T white
or black, such that all white nodes are at distance at most i from
the root of T. Then, if T contains more than γi white nodes
then T contains an HV-independent set of white nodes of size
Ω((γ/√2)i).

Proof.

Define a graph G=(V,E) whose vertices are the white nodes of T and
for which uv∈E if and only if there is a horizontal or vertical line that
intersects both r(u) and r(v). Note that an independent set of
vertices in G is an HV-independent set of which nodes in T. Thus,
it suffices to find a sufficiently large independent set in G

A well-know result on k-d trees states that, for a k-d tree of
height i, any horizontal or vertical line intersects at most
2⌈i/2⌉ rectangles of the k-d tree
[8, Lemma 5.4]. Therefore, since T is a k-d
tree,2 the
number of edges in G is at most |V|⋅2⌈i/2⌉. This
implies that G has a vertex v of degree at most 2⌈i/2⌉+1
and this is also true of any vertex-induced subgraph of G.

We can therefore obtain an independent set in G by repeatedly
selecting a vertex v of degree 2⌈i/2⌉+1, adding v to the
independent set and deleting v and its neighbours from G. Since, at
each step we add one vertex to the independent set and delete at most
2⌈i/2⌉+1+1 vertices from G, this produces an independent of size
Ω(|V|/2i/2)=Ω((γ/√2)i), as required.
∎

We can now provide the second piece of the lower bound.

Lemma 3.

Let S be a set of n points and let D be a probability measure
over R2.
Let T∗ be any comparison tree that does range counting over S. Let
C1(S,D) denote the expected depth of the node v of the biased
range tree T=T(S,D) such that q∈s(v). Then

μD(T∗)=Ω(C1(S,D))

Proof.

Partition the nodes of T into groups G1,G2,… where Gi
contains all nodes v such that 1/2i≤Pr(v)≤1/2i−1.
Observe that the nodes in group Gi occur in the first i levels of
T. Select a constants γ and β with √2<γ<β<2 and define α=γ/√2. By
repeatedly applying Lemma 2, each group Gi can be
partitioned into groups Gi,1,…,Gi,ti where, for each 1≤j<ti, Gi,j is an HV-independent set with |Gi,j|≥αi. Furthermore, |Gi,ti|≤γi. (Note that
Gi,ti is not necessarily HV-independent.)

Consider some group Gi,j for 1≤j<ti. Let ℓ be a
leaf of T∗ and observe that, because the nodes in Gi,j are
independent and each one contains at least one point of S in its
closure, there are at most 4 nodes v in Gi,j such that
r(ℓ) intersects the closure of r(v). (Otherwise r(ℓ)
contains a point of S in its interior and therefore T∗ does not
solve the range counting problem for S.) Thus, by performing 2
additional comparisons, T∗ can be used to determine which node of
v∈Gi,j (if any) contains the query point q in s(v).
However, Gi,j contains Ω(αi) nodes and the search
path for q terminates at each of these with probability between
1/2i and 1/2i−1. Therefore, if we denote by Di,j the
distribution D conditioned on the search path for q terminating in
one of the nodes in Gi,j then we have, by applying
Theorem 1,

μDi,j(T∗)+2

≥

∑v∈Gi,jPr(v∣Gi,j)log(1/Pr(v∣Gi,j)

≥

∑v∈Gi,jPr(v∣Gi,j)log(Ω(αi))

≥

log(Ω(αi))

=

ilogα−O(1).

Putting this all together, we obtain

μD(T∗)

=

∞∑i=1ti∑j=1Pr(Gi,j)μDi,j(T∗)

≥

∞∑i=1ti−1∑j=1Pr(Gi,j)μDi,j(T∗)

≥

∞∑i=1ti−1∑j=1Pr(Gi,j)(ilogα−O(1))

≥

(logα)⋅∞∑i=1ti−1∑j=1∑v∈Gi,jPr(v)⋅0pt(v)−O(1)

=

(logα)⋅∑v∈TPr(v)⋅0pt(v)−∞∑i=1∑v∈Gi,tiPr(v)⋅0pt(v)−O(1)

≥

(logα)⋅∑v∈TPr(v)⋅0pt(v)−∞∑i=1i⋅Pr(Gi,ti)−O(1)

≥

(logα)⋅∑v∈TPr(v)⋅0pt(v)−⌈logn⌉∑i=1iγi/2i−1−O(1)

≥

(logα)⋅∑v∈TPr(v)⋅0pt(v)−O(1)

=

Ω(C1(S,D)),

where the last inequality follows from the fact that γ/2<1.
∎

To get some idea of the constants involved in the proof of
Lemma 3, we can select γ=1.6, so that
α=1.6/√2≈1.13137085 and logα≈0.178071905 and the O(1) term is approximately 20. Thus, for this
choice of parameters, the depth in T is competitive with T∗ to
within a factor of 1/0.178071905≈5.615 and an additive
constant of 20. Alternatively, selecting γ=1.8 gives a
constant factor less than 3 and an additive term of approximately 90.

And now the main event:

Theorem 2.

Let S be a set of n points and let D be a probability measure
over R2.
Let T=T(S,D) be the biased range tree for S and D and
let T∗ be any decision
tree that answers range counting queries for S. Then

μD(T∗)=Ω(μD(T)).

Proof.

By the definition of C1 and C2, the expected cost of searching in
T is μD(T)=O(C1(S,D)+C2(S,D)). On the other hand, by
Lemma 3 and Lemma 1μD(T∗)=Ω(max{C1(S,D),C2(S,D)})=Ω(C1(S,D)+C2(S,D))=Ω(μD(T)). This completes the proof.
∎

We have presented biased range trees, an optimal data structure for
2-sided orthogonal range counting queries when the point set S and
query distribution D is known in advance. The expected time required
to answer queries with a biased range tree, when the queries are
distributed according to D, is within a constant factor of any
decision tree for answering range queries over S. Like standard
range trees, biased range trees use O(nlogn) space and can also
answer semigroup queries [10, 15].3
Although the analysis of
biased range trees is complicated, their implementation is not much
more complicated than that of standard range trees.

As a small optimization, the backup range tree data structure can be
eliminated from biased range trees. Instead, once the probability of
a node v drops below 1/n the node can be split by ignoring the
distribution D and simply splitting the points of r(v)∩S into
two sets of roughly equal size. This results in a tree of depth at
most 2(logn+1).

This work is just one of many possible results on
distribution-sensitive range searching. Several open problems
immediately arise.

Open Problem 1.

Note that a 4-sided orthogonal range counting query can be reduced to
4 2-sided orthogonal range counting queries using the principle of
inclusion-exclusion. Unfortunately, this reduction does not produce
an optimal distribution-sensitive data structure. To see this,
consider 4-sided queries consisting of unit squares whose bottom left
corner is uniformly distributed in the shaded region of
Figure 5. All such queries contain no points in the query
region and all such queries can be answered in O(1) time by simply
checking that all four corners of the square are to the left of the point
set. However, when we decompose these queries into a four 2-sided
queries we obtain 2-sided queries that require Ω(logn) time
to be answered.

Figure 5: Decomposing a 4-sided query into four 2-sided queries can
produce a bad distribution of 2-sided queries.

Open Problem 2.

Biased range trees require that the point set S and the
distribution D be known in advance. Is
there a self-adapting version of biased range trees that, without
knowing D in advance, can answer m queries, each drawn
independently from D in O(nlogn+mμD(T∗)) expected time?

Open Problem 3.

Determine the worst-case or the average case constants associated with
2-dimensional orthogonal range searching for comparison-based data structures.
By applying the result of Adamy and Seidel [1] on point
location to the arrangement A described in Section 2 one
immediately obtains an O(n2) space data structure that answers
queries using at most 2logn+O(loglogn) comparisons. Is there
an O(nlogn) space structure with the same performance?

Open Problem 4.

A point q∈Rd is maximal with respect to S⊆Rd
if no point of S has every coordinate larger than the corresponding
coordinate of q. For d≥3, is there a distribution-sensitive
data structure for
testing if a query point q is maximal? For point sets in 2
dimensions, an orthogonal variant of the point-location techniques of Collette
et al[7] seems to apply.

Open Problem 5.

Are there distribution-sensitive data structures for d-sided range
search in point sets in Rd? The current fastest structures
for range search in point sets in Rd that use near-linear space have
Θ(logd−1n) query time. Is there a structure that uses
near-linear space and is optimal
when the point set S and the distribution D are known in advance?

Footnotes

A
full d-ary tree is a rooted ordered tree in which each non-leaf node
has exactly d children.

Although T is not exactly a k-d tree as described
in Reference [8], the proof found there still holds.

That biased
range trees can answer semigroup queries follows
from Properties 1–3 of the catalogues in Section 3.3.

References

U. Adamy and R. Seidel.
On the exact worst case query complexity of planar point location.
In Proceedings of the Ninth Annual ACM-SIAM Symposium on
Discrete Algorithms, pages 609–618, 1998.