In this paper, we study the computation of statistics over data that is stored
across a distributed set of nodes. In this setting, a data vector of size N is
distributed into L nodes such that each node holds a partial data in such a way
that the global data vector is the sum of the locally stored data vectors. The
goal is that each node transmits information to some global aggregator node,
which can then compute the desired statistical function. For this setting, there
are well-known lower bounds that show that even for simple aggregation functions
such as the max-value, the total amount of communication required is at least
linear in the size of the data N in the worst-case.
In this paper, we show that
these lower bounds can be beaten if we assume that the underlying data has
sparsity properties. Specifically, we devise a new algorithm for the distributed
outlier detection problem that is based on compressive-sensing and exploits the
sparse structure of the underlying data. We show both empirically on real
web-scale production data as well as theoretically that such use of compressive
sensing-based techniques can result in substantial improvements for distributed
computing problems. Specifically, we prove under some conjecture that the
algorithm can succeed in recovering outliers using only O(sc {\cdot} logN)
constant size messages, where c is a small constant and s is the sparsity of the
underlying data.