Copyright Notice:
Since most of these papers are published, the copyright has been
transferred to the respective publishers. Therefore, the papers
cannot be duplicated for commercial purposes. The following is
ACM's copyright notice; other
publishers have similar ones.

Random samples as synopses of datasets

Random samples are flexible general-purpose
synopses of data sets that are too massive to store in full,
manipulate, or transmit.
Answers to queries posed over
the original data can be quickly estimated from the sample, and the
same sample can be used for many types of queries.
My work on sampling aims to facilitate a more effective use of
sampling as synopses by designing sampling schemes and estimators
for fundamental classes of queries.

Scalable design of sampling algorithms should depend on how the data
is presented, to avoid the time and
cost of aggregation. We distinguish between
data sets presented as key value pairs (which can be streamed or distributed),
data sets where there is a vector of values associated with each key
and these values are dispersed across servers or time,
and finally unaggregated data sets, where the data contains multiple
elements with the same key and the value of the key is the sum of
element weights.

The value of our sample hinges on the performance of
our estimators. The sampling scheme should be geared to the
statistics (queries) we would like supported and estimators should
optimally use the information in the sample. I am also
interested in understanding the limits of a sampling scheme: Characterize the
queries we can estimate well and provide optimal estimators for a given query.

Structure-aware sampling

Data is typically structured and queries conform to the structure.
Sampling schemes, however, are oblivious to the structure.
We propose sampling schemes that are
structure-aware and allow for more accurate approximate range queries
while retaining the benefits of sample-based synopses. Part of the
work builds on the flexibility within the VarOpt
family of distributions
to gain structure-awareness.

Coordinated sampling is a method of sampling different
instances so
that the sampling of each instance is classic Poisson or
bottom-k but samples of different instances are
more similar when the instances are more similar.
Statisticians proposed and used coordination
as a method of controlling overlap in repeated
surveys.
Computer Scientists, myself included, (re-)discovered coordination for data analysis
from samples: aggregates such as distinct counts and Jaccard Similarity
can be better estimated when samples are
coordinated. Coordination generalizes the technique popularly
known as MinHash sketches. Sample coordination is a form of
Locality
Sensitive Hashing (LSH). I first came across coordination when
observing that coordinated samples of all neighborhoods (or
reachability sets) in a graph can be computed much more
efficiently than independent samples. With time, I recognized the
potential of coordination for large scale data analysis and opted to better
understand it. We presented tighter estimators for basic
queries, and more recently (see ),
proposed a model of monotone sampling, which generalizes
coordinated sampling, and provided a full characterization
for the queries we can successfully answer along with efficient and practical optimal
estimators.

In particular, since in general there may not exist one estimator with minimum variance
for all possible data, we seek admissibility (Pareto variance optimality),
meaning that estimators can not be improved for one data set in the domain
without increasing variance for another.
We also propose the stricter
order-based optimality, where an order is
specified over the data domain and we look for estimators which
can not be improved for one data without increasing variance for
preceding data.

Get the most out of your sample: Optimal
unbiased estimators using partial information.
Edith Cohen and
Haim Kaplan .

Independent sampling was widely believed to be useless for basic
queries, including distinct values counts: Information theoretic lower bounds show that
we need to sample almost the full data set in order to obtain meaningful
estimates.
We show that when sampling with "reproducible randomization,"
these lower bounds do not apply and we
obtain good estimators when only a small fraction of
the data is sampled for: distinct counts,
their generalization of L1 norm of the coordinate-wise maximum,
and in follow-up work for Lp difference.
We also propose a principled method of deriving order-optimal
estimators.

What you can do with Coordinated Samples
Edith Cohen and
Haim Kaplan .

For coordinated sampling, which has simpler structure than
independent sampling, we were able to precisely characterize the queries for which
good estimators exist. We introduce a notion of variance
competitiveness, meaning that for any data, the expectation of
the square of the estimate is bounded by
the minimum possible for the data. Moreover, we show that O(1) variance
competitiveness is always attainable when an estimator with finite
variances exists.

We define a model for monotone sampling and monotone
estimation:
Suppose we are interested in estimating a nonnegative
function f of the data from a "sample" of the data. We define a
"sampling scheme" here as a
function of the data and a (uniform at random) seed value
from [0,1].
Monotonicity of the sampling scheme means that the output provides
more information on the data when the seed value is lower.
Our goal is to obtain a good estimator for f.
We are interested in estimators that are unbiased (because for
applications we sum aggregate many estimation problems) and
nonnegative (because the function f is). We also seek
admissible and variance competitive estimators.
This formulation generalizes
coordinated sampling. Interestingly, the characterization we provided
here for this special case extends
to monotone sampling.

We make several contributions aimed at both a better theoretical understanding of monotone
sampling and for providing practical and effective estimators to be
used as tools for data mining and analysis.
We explore the space of admissible and variance competitive estimators.
We show how to derive an order-optimal estimator for any specified
order on the data domain. This allows us to customize the estimator by choosing an
order which reflects patterns present in our data.
We present the L* estimator, which has a very compelling and natural set of properties:
The L* estimator is defined (and can be constructed) for any monotone estimation problem for
which an estimator with finite
variances exists. It is guaranteed to be (at most) 4-competitive. Moreover, it is the unique
admissible monotone estimator, meaning that the estimate we obtain is
non-decreasing with the information we have on the data. In that, it
improves over the classic Horvitz-Thompson estimator.
Lastly, we explore and tighten the gap between the worst-case upper and lower bounds of variance competitiveness.

A particularly important class of queries are Lp difference queries.
The query specifies a domain via a selection predicate over keys
and we are interested in the Lp difference on the values assumed
by these keys in the different instances.
When the data is sampled, we are interested in approximating the Lp
difference from the samples.
We derive Lp difference estimators that are applied to sampled
instances. We consider both independent and coordinated
sampling. For independent sampling extending
our technique and for
coordinated sampling using our monotone estimation framework.
We experimentally study these estimators to data sets with
different properties.
We demonstrate accurate estimates with only a
small fraction of items sampled. For coordinated samples, we consider
the L* estimator, which is optimized for small differences, the U*
estimator, which is optimized for large differences, and the
optimally competitive estimator, which minimizes the worst ratio
over the data domain. We demonstrate the benefit of an informed
choice of estimators when we have information on
data patterns.

Stream Sampling for Frequency Cap Statistics.
Edith Cohen.

Two common statistics over unaggreagated data are distinct keys, which
is the number of active keys in a specified segment, and sum, which
is the sum of the frequencies of keys in the segment. These are two
special cases of frequency cap statistics, defined as
the sum of frequencies capped by a parameter T, which are
popular in online advertising platforms. We propose the first solution
for scalable and accurate estimation of general frequency cap
statistics. Our framework brings the benefits of approximate
distinct counters to general frequency cap statistics.

The average distance from a node to all other nodes in a graph, or
from a query point in a metric space to a set of points, is a
fundamental quantity in data analysis. The inverse of the average
distance, known as the (classic) closeness centrality of a node, is a
popular importance measure in the study of social networks.
We develop novel structural
insights on the sparsifiability of the distance relation via weighted
sampling. Based on that, we present novel
algorithms with strong statistical guarantees for estimating
average distances from a query point/node or
between all pairs of points/nodes. Our results significantly improve over
previous work, which was mostly based on uniform sampling, in terms of computation and estimation quality tradeoffs.

Graphs with billions of edges, such as social and Web graphs, are
common and are a powerful source of information. The graph
structure, alone or combined with meta-data, a static snapshot or
evolving, encodes
information on centralities, interests, similarities,
influence, and communities. To effectively mine this information,
algorithms must be highly scalable and models should accurately
capture the real-world properties we are after.

I use a combination of algorithmic techniques, sketching and
estimation techniques, modeling, and data sets, to develop effective tools for
mining massive graphs.

Influence functions in a graph are defined for subsets of
nodes, according to graph structure. The aim is to find subsets of
seed nodes with (approximately) optimal tradeoff of size and
influence. We define a rich class of graph-based influence functions
which unifies and extends previous work. Influence is defined through
pairwise utility values derived from relations such as
reachability, distance, reverse rank. The utility of a seed set to a
node is then defined as a submodular aggregate of individual seed
utilities. Finally, the influence of the seed set is the sum over
nodes of its utility to the node. Previous work focused on
maximum aggregation only. More general submodular aggregation
captures fault tolerance of the seed set and more generally
allows value from additional seeds.
We present a meta-algorithm for approximate greedy maximization with
strong approximation quality guarantees and worst-case near-linear computation for all functions
in our class. Our meta-algorithm generalizes recent designs specific
for maximum aggregation and specific utility functions.

Distances in a network capture relations between nodes
and are the basis of centrality, similarity, and influence measures.
Often, however, the relevance of a node u to a node
v is more precisely measured not by the magnitude of the distance, but
by the number of nodes that are closer to v than u. That is,
by the rank of u in an ordering of nodes by increasing distance
from v.
We identify and address fundamental challenges in rank-based
graph mining. We first consider single-source computation of reverse-ranks and
design a ``Dijkstra-like'' algorithm which computes
nodes in order of increasing approximate reverse rank while only
traversing edges adjacent to returned nodes.
We then define reverse-rank influence and
present a near-linear algorithms for
greedy approximate reverse-rank influence maximization. The design relies on
our single-source algorithm and on the SKIM
sketch-based approximate greedy maximization framework.
Our algorithms utilize near-linear
preprocessing of the network to compute all-distance sketches.
As a contribution of independent interest, we present a
novel algorithm for computing these sketches, which have many other
applications, on multi-core
architectures.
We complete this work by establishing the hardness of
computing exact reverse-ranks for a single source and
exact reverse-rank influence, which show that with
near-linear computation, the small relative errors we obtain are
the best we can currently hope for.
Finally, we conduct an experimental evaluation on graphs with tens of
millions of edges, demonstrating both scalability and accuracy.
This project is Eliav's M.Sc. thesis (Tel-Aviv university).

A premise at a heart of network analysis is that entities in a
network derive utilities from their connections. The
influence of a seed set S of nodes is defined as the sum over
nodes j of the utility of S to
j.
Distance-based utility, which is a decreasing function of the distance from S to
j, was explored in several successful research threads from social
network analysis and economics: Network formation games [Bloch and
Jackson 2007], Reachability-based influence [Richardson and Domingos
2002; Kempe et al. 2003]; ``threshold'' influence [Gomez-Rodriguez
et al. 2011]; and {\em closeness centrality} [Bavelas 1948].
We formulate a model that unifies and extends this previous work and
address the two fundamental computational problems in this domain:
Influence oracles and influence maximization (IM). An
oracle performs some preprocessing, after which influence queries for
arbitrary seed sets can be efficiently computed. With IM, we seek a
set of nodes of a given size with maximum influence. Since the IM
problem is computationally hard, we instead seek a greedy
sequence of nodes, with each prefix having influence that is at
least 1-1/e of that of the optimal seed set of the same size. We
present the first highly scalable algorithms for both problems,
providing statistical guarantees on approximation quality and
near-linear worst-case bounds on the computation. We perform an
experimental evaluation which demonstrates the effectiveness of our
designs on networks with hundreds of millions of edges.

Propagation of contagion is a fundamental process in social,
biological, and physical networks. It is used to model the spread of information,
influence, or a viral infection. Diffusion patterns can be
specified by a probabilistic model, such as Independent Cascade (IC),
or captured by a set of representative traces.
Basic computational problems in the study of diffusion
are influence queries (determining the potency of a specified
{\em seed set} of nodes) and Influence
Maximization (identifying the most influential seed set
of a given size). Answering each influence query involves
many edge traversals, and does not scale when
there are many queries on very large graphs.
The gold standard for Influence Maximization is
the greedy algorithm, which iteratively adds to the seed set a node
maximizing the marginal gain in influence. Greedy has a guaranteed
approximation ratio of at least (1-1/e) and produces a sequence of nodes, with each prefix
having approximation guarantee with respect to the same-size optimum.
Since Greedy does not scale well beyond
a few million edges, for larger inputs one must currently use either
heuristics or alternative algorithms designed for a pre-specified
small seed set size.
We propose a novel sketch-based design for influence computation.
Our greedy Sketch-based Influence Maximization (SKIM)
algorithm scales to graphs with billions of edges, with one to two orders of
magnitude speedup over the best greedy methods. It still has a guaranteed
approximation ratio, and in practice its quality nearly matches that
of exact greedy. We also present influence oracles , which use linear-time preprocessing to
generate a small sketch for each node, allowing the influence of any seed set to be
quickly answered from the sketches of its nodes.
In terms of techniques, this work utilizes my 1994 reachability sketching
technique for directed graphs and
efficient estimators. We extend reachability sketches to
be across multiple graphs or a probabilistic model, retaining the
same storage requirement and approximation guarantees. For influence maximization,
our efficient approximate greedy algorithm SKIM only computes sketches to the
point needed to determine the approximate maximum marginal gain with
sufficient confidence. It also updates the sketches after each
selection to be with respect to the residual problem, so that no
work is wasted.

The All-Distances Sketches (ADS), which I
introduced back in 1994,
turn out to be a very effective tool for mining massive graphs.
A small-size ADS can be
efficiently computed for each node in the graph. In my original work
I used the ADS of each node to estimate neighborhood cardinalities (number
of nodes within a certain distance) and also showed how sketches of
different nodes can be used to estimate neighborhood relations.
We provide a unified view of several ADS definitions which evolved
over time and propose the novel HIP estimators. The HIP estimators
obtain tighter estimates of neighborhood cardinalities and closeness
centralities from the sketch of a node. They
also facilitate good estimates for a powerful natural class of
queries.
We also show how the HIP estimators can be applied to approximate
distinct counting on data streams, improving over state-of-the-art
practical algorithms.

Closeness centrality is an
importance measure of nodes in social and massive graphs which is based
on the distances from the node to all other nodes.
The classic definition, proposed by Bavelas (1950), Beauchamp (1965), and Sabidussi (1966), is (the inverse of) the average distance
to all other nodes. We propose the first highly scalable (near linear-time
processing and linear space overhead) algorithm for
estimating, within a small relative error, the classic closeness centralities
of all nodes in the graph. Our algorithm provides strong probabilistic
guarantees on the approximation quality for all nodes of any undirected graph, as well as for centrality computed with respect to round-trip distances in directed graphs.
For directed graphs, we also propose algorithms that approximate
generalizations of classic closeness centrality to outbound and
inbound centralities. This algorithm does not provide worst-case theoretical approximation guarantees, but performs well on the networks we tested.
We perform extensive experiments on large networks,
demonstrating high scalability and accuracy.

Similarity estimation between nodes based on structural properties of graphs is a basic building block in the analysis of massive
networks, used for link prediction, product recommendations,
advertisement, attribute completion, and more. Similarity measures
that are based on global properties have better recall than local
ones but are harder to compute. We make several contributions
aiming for both accuracy and scalability: First, we define
closeness similarity, a natural measure that compares two nodes
based on the similarity of their relations to all other
nodes. Second, we show how the all-distances sketch (ADS) node
labels, which are efficient to compute, can support the estimation
of closeness similarity in logarithmic query time. ADSs can also
be used as SP distance oracles, matching the performance of
dedicated techniques.
Third, we propose the randomized edge
lengths (REL) technique and define the corresponding REL distance,
which captures both path length and path multiplicity and therefore
improves over the SP distance as a similarity measure. The REL
distance can also be the basis of closeness similarity and can be
estimated using SP computation or using the ADS labels.

Size-Estimation Scheme and Applications

My size-estimation scheme is based on a simple but powerful technique: the
least-element (LE) method. This method is better known today as min-hash:
Suppose you have subset from a universe of items. You (implicitly) assign a
random permutation (or random values) to all items. We refer to
the permutation rank or the random value assigned to an item as its
rank.
Now, the LE of a subset is the minimum rank of its members.
The LE is in expectation smaller when there are more elements in
the set. Therefore, it is possible to estimate the size of the set
from its LE rank. When LE ranks of different subsets are obtained
using the same random permutation, they are coordinated.
Coordination facilitates many applications. In particular, it means
that the LE ranks are mergeable: The LE of the
union of subsets is the minimum of the LEs of its members. The LE ranks
can also be used to estimate similarity of sets. For example, the
probability that two sets have the same LE is proportional to their
Jaccard similarity, which is the ratio of the intersection size to
the union size. Therefore, the Jaccard similarity of the sets can be estimated
from the LE similarity.

The accuracy of these size or similarity estimates can be enhanced by repeating
this k times (using k permutations), by taking the k smallest ranks
in a single permutation, or by hashing elements randomly to k
buckets and taking the LE of each of the k buckets. I refer to
these sketch flavors, respectively, as k-mins, bottom-k, or k-partition
sketches.
These three flavors
have different advantages, depending on the tradeoffs we want to
achieve. Asymptotically, the estimators with all three have the
same behavior but bottom-k sketches carry the most information. In this
early work (1994-1997), I studied mostly k-mins estimators (with
mention of bottom-k sketches) and studied cardinality and similarity estimation.

The first application of the LE technique for size estimation I am aware of is
for approximate distinct counting [Flajolet and Martin 1985]. As for
similarity estimation,
LE sketches can be viewed as a special case of coordinated samples
[Brewer, Early, Joice 1972].
A very well known application of the LE technique is for
near-duplicate detection of Web pages [Broder 1997].

I first applied the LE technique in a graph setting (see below): It
turned out that (coordinated) sketches of all neighborhoods, and all
reachability sets of all nodes in a graph can be very efficiently
computed, with processing that is nearly-linear in graph size.
This resulted in a powerful technique for analyzing very
large graphs.
Other early applications I explored (see below) are determining the optimal order
to perform a chain multiplication of sparse matrices and sketch-based
1 tracking of the state of distributed processes for roll-back recovery.

Parallel Algorithms: Shortest Paths and Maximum Flow

These algorithms, which I developed in the early 1990's, consider the
idealized PRAM model (Parallel Random Access Machine) of parallel
computation. The PRAM model assumes
we have as many parallel processors as we can use, and considers
the tradeoff between processing time and the total
"work" we use (which is the product of time and number of
processors). The principled exploration of the fundamental
chains of dependencies, however, is relevant for
the GPUs, multi-core, and map-reduce architecture which dominate
massive computation today. Some of the structures I used also turn
out to be relevant for the design of sketching, streaming, and
dynamic sequential algorithms. Finally, the basic ideas are rather simple,
not (only) aiming for improved worst-case bounds but also for
simplicity and implementability.

Single-source shortest-paths (SSSP) computation seemed not to parallelize
efficiently. We show that for undirected graphs, if we allow a
slight fractional stretch (ratio of approximate distance to actual distance), we can perform s approximate
SSSP computations in polylogarithmic time
using work that is only a little more than performing s
sequential applications of Dijkstra's SSSP algorithm.
Our chief instrument, which is of independent
interest, are efficient constructions of Hopsets. A hopset
for a fraction e and integer d is a small set of auxiliary edges E'
such that the shortest d-hop path in E+E' between any pair of nodes has length that is at most (1+e) times the distance between the nodes. Hopsets can
also be used as sparse spanners with fractional stretch
and small additive error.

Bottom-k sketches: better and more efficient estimation of aggregates.
E. Cohen and H. Kaplan.
In Proceedings of the ACM Sigmetrics/Performance Joint International Conference on Measurements and Modeling of Computer Systems (poster). 2007.

Packet Classification in large ISPs: Design and evaluation of
decision tree classifiers.
E. Cohen and C. Lund.
In Proceedings of the ACM Sigmetrics/Performance Joint International Conference on Measurements and Modeling of Computer Systems. 2005.

Spatially-Decaying Aggregation over a Network: Model and Algorithms.
E. Cohen and H. Kaplan.
In Proceedings of the ACM SIGMOD
International Conference on Management of Data. 2004.