Abstract

This work deals with the approximate string search in large spatial databases. Speci%uFB01cally, we investigate range queries
augmented with a string similarity search predicate in both Euclidean space and road networks. We dub this query the spatial
approximate string (SAS) query. In Euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise
signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from
strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on the set resemblance between
the query string and the q-grams from the sub-trees of index nodes. We also discuss how to estimate the selectivity of a SAS
query in Euclidean space, for which we present a novel adaptive algorithm to %uFB01nd balanced partitions using both the spatial and
string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which signi%uFB01cantly
outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram based inverted lists and the reference nodes based
pruning. Extensive experiments on large real data sets demonstrate the ef%uFB01ciency and effectiveness of our approaches.

Abstract

In online tracking, an observer S receives a sequence of
values, one per time instance, from a data source that is described
by a function f. A tracker T wants to continuously
maintain an approximation that is within an error threshold
of the value f(t) at any time instance t, with small communication
overhead. This problem was recently formalized
and studied in, and a principled approach with optimal
competitive ratio was proposed. This work extends
the study of online tracking to a distributed setting, where
a tracker T wants to track a function f that is computed
from a set of functions {f1, . . . , fm} from m distributed observers
and respective data sources. This formulation finds
numerous important and natural applications, e.g., sensor
networks, distributed systems, measurement networks, and
pub-sub systems. We formalize this problem and present
effective online algorithms for various topologies of a distributed
system/network for different aggregate functions.
Experiments on large real data sets demonstrate the excellent
performance of our methods in practice.

Abstract

Histogram construction is a fundamental problem in data
management, and a good histogram supports numerous mining
operations. Recent work has extended histograms to
probabilistic data. However, constructing histograms
for probabilistic data can be extremely expensive, and existing
studies suffer from limited scalability. This work
designs novel approximation methods to construct scalable
histograms on probabilistic data. We show that our methods
provide constant approximations compared to the optimal
histograms produced by the state-of-the-art in the
worst case. We also extend our methods to parallel and
distributed settings so that they can run gracefully in a cluster
of commodity machines. We introduced novel synopses
to reduce communication cost when running our methods
in such settings. Extensive experiments on large real data
sets have demonstrated the superb scalability and efficiency
achieved by our methods, when compared to the state-ofthe-art
methods. They also achieved excellent approximation
quality in practice.

Abstract

Ranking temporal data has not been studied until recently [14],
even though ranking is an important operator (being promoted as a
first-class citizen) in database systems [8]. However, only the instant
top-k queries on temporal data were studied in [14], where
objects with the k highest scores at a query time instance t are to
be retrieved. The instant top-k definition clearly comes with limitations
(sensitive to outliers, difficult to choose a meaningful query
time t). A more flexible and general ranking operation is to rank
objects based on the aggregation of their scores in a query interval,
which we dub the aggregate top-k query on temporal data. For
example, return the top-10 weather stations having the highest average
temperature from 10/01/2010 to 10/07/2010; find the top-20
stocks having the largest total transaction volumes from02/05/2011
to 02/07/2011. This work presents a comprehensive study to this
problem by designing both exact and approximate methods (with
approximation quality guarantees). We also provide theoretical
analysis on the construction cost, the index size, the update and the
query costs of each approach. Extensive experiments on large real
datasets clearly demonstrate the efficiency, the effectiveness, and
the scalability of our methods compared to the baseline methods.

Abstract

In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.

Abstract

GIS data usually consist of both spatial and textual information,where the spatial component represents the location ofthe object and the textual element contains a set of stringsdescribing object in that location. For GIS data situated ona road network, shortest path search is a basic operation. Inpractice, however, users are often interested at routing whencertain constraints on the textual information have been alsoincorporated. This work complements the standard shortestpath search with multiple keywords and an approximatestring similarity function, where the goal is to find the shortestpath that passes through at least one matching objectper keyword; we dub this problem the multi-approximatekeywordrouting (makr) query. We present both exact andapproximate solutions. When the number %u03BA of query keywordsis small (e.g., %u03BA %u2264 6), the exact solution works efficiently.However, when %u03BA increases, it becomes increasinglyexpensive (especially on large GIS data). In this case, ourapproximate methods achieve superb query efficiency, excellentscalability, and high approximation quality, as indicatedin our extensive experiments on large, real datasets (up to 2million points on road networks with hundreds of thousandsof nodes and edges). We also prove that one approximatemethod has a %u03BA-approximation in the worst case.