KDD 2006 Conference Review Session

Given $\QN$ nodes in a social network (say, authorship network),
how can we find the node/author that is the center-piece,
and has direct or indirect connections to all, or most of them?
For example, this node could be the common advisor,
or someone who started the research area that the $\QN$ nodes belong to.
Isomorphic scenarios appear in law enforcement
(find the master-mind criminal, connected to all current suspects),
gene regulatory networks (find the protein
that participates in pathways with all or most of the given $\QN$ proteins),
viral marketing and many more.
Connection subgraphs is an important first step, handling the case
of $\QN$=2 query nodes. Then, the connection subgraph algorithm
finds the $b$ intermediate nodes, that provide a good connection
between the two query nodes.
Here we generalize the challenge in multiple
dimensions: First, we allow more than two query nodes. Second, we
allow a whole family of queries, ranging from 'OR' to 'AND', with
'softAND' in-between. Finally, we design and compare a fast
approximation, and study the quality/speed trade-off.
We also present experiments on the DBLP dataset. The experiments
confirm that our proposed method naturally deals with multi-source
queries and that the resulting subgraphs agree with our intuition.
Wall-clock timing results on the DBLP dataset show that our
proposed approximation achieve good accuracy for about $6:1$
speedup.

How do we find patterns in author-keyword associations, evolving over
time? Or in DataCubes, with product-branch-customer sales information?
Matrix decompositions, like principal component analysis (PCA) and
variants, are invaluable tools for mining, dimensionality reduction,
feature selection, rule identification in numerous settings like streaming
data, text, graphs, social networks and many more.However, they have only
two orders, like author and keyword, in the above example. We propose to
envision such higher order data as tensors, and tap the vast literature on
the topic. However, these methods do not necessarily scale up, let alone
operate on semi-infinite streams. Thus, we introduce the dynamic tensor
analysis (DTA) method, and its variants. DTA provides a compact summary
for high-order and high-dimensional data, and it also reveals the hidden
correlations. Algorithmically,we designed DTA very carefully so that it is
(a) scalable, (b) space efficient (it does not need to store the past) and
(c) fully automatic with no need for user defined parameters. Moreover, we
propose STA, a streaming tensor analysis method, which provides a fast,
streaming approximation to DTA. We implemented all our methods, and
applied them in two real settings, namely, anomaly detection and multi-way
latent semantic indexing. We used two real, large datasets, one on network
flow data (100GB over 1 month) and one from DBLP (200MB over 25 years).
Our experiments show that our methods are fast, accurate and that they
find interesting patterns and outliers on the real datasets.