Highlights from ICDM 2015

Nov 23, 2015

This past week I had the opportunity to attend the 15th IEEE International Conference on
Data Mining, held in Atlantic City, NJ, November 14-17, 2015.
This was the first scientific conference I attended and we had a chance to present our
work on scalable graph similarity calculation. In this post I will try to point out some
of the more interesting work from the conference (based on some of the sessions I attended)
and summarize the keynotes. I’ve included links to the full-text papers whenever I could
find them.

Highlights from the sessions I attended:

Day 1

Applications 1

The first session I attended was named “Applications 1” and it included
a number of applications (surprise!) on a diverse set of domains. The session started
with some very solid work on “Modeling Adoption and Usage of Competing Products”,
where the authors create a model that can provide insight into the factors that drive
product adoption and frequency of use, which they evaluate at a large scale by looking
into the use of URL shorteners on Twitter.
In “Mining Indecisiveness in Customer Behaviors” the authors investigated how they could
reduce indecisiveness in users interacting with an online retail platform, by making use
of information about competing products. The end goal is to increase conversion of course,
but it would be interesting to see how such a system could be implemented in a way that
is fair to all retailers/brands.

Two short papers I should point out were “Personalized Grade Prediction: A Data Mining
Approach” and
“Sparse Hierarchical Tucker and its Application to Healthcare”.
The first
paper deals with personalized early grade prediction for students using only assignment/homework data,
that could allow course instructors to identify students who might have
trouble in a course early on, most importantly using only their data from the specific
course, thereby avoiding any potential privacy pitfalls. In the second work, a new tensor
factorization method is proposed, that is 18x more accurate and 7.5x faster than the current
state-of-the-art. While the application presented here is limited to healthcare, I hope
that it can prove a starting point for a more generalized approach, as tensor factorization
problems can surface in wide variety of domains so solving their scalability problems
could have an effect on a wide range of fields.

Mining Social Networks 1

The next session I attended was “Mining Social Networks 1”, where the best student paper,
“From Micro to Macro: Uncovering and Predicting Information Cascading Process with
Behavioral Dynamics” was presented among others.
Cascade prediction has applications
in areas like viral marketing and epidemic prevention, so it’s a problem of great interest
in the industry as well as society. The work presented here utilized a data-driven approach
to create a “Networked Weibull Regression” model, and use it for predicting cascades
as they occur, going from micro behavioral dynamics modelling which are aggregated to predict
the macro cascading processes.

They evaluate their method on a dataset from Weibo, one of the largest Twitter-style
services in China, and show that their method handily beats the current state of the art.
It’s a well written work that deserves the praise it got, however I would definitely be interested
in seeing it applied and evaluated on a different publicly available dataset, (although they are
hard to come by in this domain) and an extension of the method that predicts the cascades as they
happen in real-time (shameless plug: Use Apache Flink for your real-time processing needs!).

Big Data 2

The last session I attended on Sunday was “Big Data 2”. The two regular papers from that
session were perhaps application specific but nonetheless provided some valuable insights.
The first, “Accelerating Exact Similarity Search on CPU-GPU Systems” dealt with the exact
kNN problem, and how it can be efficiently accelerated on GPU-equipped systems. Although
approximate kNN methods like LSH seem to be the standard at the industry currently, the
authors mentioned that the techniques presented could be used in that context as well,
so this is something to look forward to definitely. The second regular paper “Online Model
Evaluation in a Large-Scale Computational Advertising Platform”
provided a rare look into how a large advertising platform like Turn evaluates its bid prediction models online,
something that a previous related paper from Google,
“Ad Click Prediction: a View from the Trenches”,
was missing.

Day 2

Network Mining 1

An interesting idea presented in “Network Mining 1” was “Absorbing random-walk centrality”,
where the authors presented a way to identify teams of central nodes in a graph. An application
for this measure could be for example: given a subgraph of Twitter that we know contains
a number of accounts about politics, find the important nodes that represent a diverse set
of political views. The authors show that this is an NP-hard problem, and the greedy algorithm
presented has a complexity of O(n^3), where n is the number of nodes, which makes it
inapplicable for large graphs. Personalized PageRank could be used as heuristic however which
is more computationally efficient.

As an example, think of a correlation graph where we have words as nodes and edges between words are created by taking the conditional probability of a word appearing
within n words of another one. This can be easily extracted from a text corpus and pairs like (Rooney, goal)
could have a high correlation score. What we want to do with our algorithm is to discover similarities
between items that go beyond simple correlation, and show characteristics such as exchangability.
For example a pair (Rooney, Ronaldo) could be a good pair in this sense, as you could replace
Rooney with Ronaldo in a sentence and it should still make sense. The approach we presented is domain
agnostic, and as such is not limited to just text; we applied our algorithm on graphs of music artists and codons
as well. I will soon write up a more extensive summary of our work, including code and examples.
For now enjoy this nice visualization
of word relations and clusters that can be created using our method.
Note: better to download and view in a PDF viewer which has lots of zoom.

Some impressive work for me from that session was “Efficient Graphlet Counting for Large Networks”.
Graphlets are small, connected, induced (i.e. the edges
in the graphlet correspond to those in the large graph) subgraphs of a large network, and can be used
for things like graph comparison and classification. The method presented here uses already proven
combinatorial arguments to reduce the number of graphlets one has to count for every edge, and
obtains the remaining counts in constant time. In a large study of over 300 networks the algorithm
is shown to be on average 460 times faster that the current state-of-the-art, allowing the largest
graphlet computations to date. I am always happy when I see established results used in a clever
way to solve new problems, especially when the results are so impressive.

Keynotes

Robert F. Engle

ICDM featured 3 keynotes this year. The first one was given by Robert F. Engle, winner of the
Nobel Memorial Prize in Economic Sciences in 2003. He presented a summary of some of his seminal
work on ARCH,
and presented some more recent work on financial volatility measurement through the V-lab
project. This keynote was quite math-heavy as a result and I think many people in the audience did
not find it that interesting/relevant to their work, estimated from the proportion of people looking at their
laptops around me.

Michael I. Jordan

The second keynote, and the most interesting for me, was given by M.I. Jordan, with the title “On
Computational Thinking, Inferential Thinking and ‘Big Data’”, a talk he has delivered in a couple
of other venues before, so (some of) the slides are available.
His keynote revolved around some of what he identified as central demands for learning and inference
and the tradeoffs between them; namely error bounds (“inferential quality”),
scaling/runtime/communication, and privacy. He identified the problem of the lack of an interface
between statistical theory and computational theory which currently have an “oil/water”
relationship. In statistics, more data points are great as they reduce uncertainty, but can be a cause
of problems in terms of computation as we usually measure complexity in the order of data points. The approach
suggested is to “treat computation, communication, and privacy as constraints on statistical
risk”.

In terms of privacy he mentioned how our inference problem basically has 3 components: the
population P, which we try to approximate with our sample S, which we then modify
according to our privacy concerns to get our final dataset Q, which we can query.
In dealing with privacy issues he mentioned differential privacy
as a good way to quantify the privacy loss for a query. This should allow us, given some privacy
concerns, to estimate the amount of data we need to achieve the same level of risk in our queries.

For the tradeoff between inferential quality and communication, common in distributed
learning settings, he proposed the use of a channel with certain communication constraints, as a
way to impose bitrate constraints. The proposed solution involves minimax risk with B-bounded
communication, which allows for optimal estimation under a communication constraint (see
here
for the NIPS paper on the subject).

The last part of his talk was new (i.e. is not the slides linked) and concerned the tradeoff between
inference quality and computation resources. This part focused on efficient distributed bootstrap
processes, with the thesis being that such processes can be used to generate multiple realizations
of samples from the population that allow for the efficient estimation of parameters. The problem
with a frequentist approach in this case is that the communication cost of each resampling can be
prohibitively high for large datasets, e.g. ~623GB for a 1TB dataset
(see here why).
The proposed solution here is the “Bag of Little Bootstraps”,
in which one bootstraps many small subsets of the data and performs multiple computations on these
small samples. The results from these computations are then averaged to obtain an estimate of the
parameters of the population.
This means that in a distributed setting we would use only small subsets of the data to perform
our computation; in the 1TB example above, the resample size could for example be 4GB instead of
the 632GB required by the bootstrap.
Another interesting point made was that obtaining a confidence interval on a parameter instead of
a point estimate like is usually done now, can not only be more useful, but could be done more
efficiently as well.

In closing Jordan identified there are many remaining conceptual and mathematical challenges in the
problem of ‘Big Data’ and facing these will require a “rapprochement between computer science and
statistics” which would reshape both disciplines and might take decades to complete.

Lada Adamic

Unfortunately I had to skip Lada Adamic’s keynote, so I would really appreciate if someone has a
summary that I can add here.

Venue/Organisation

The conference organization was mostly smooth and organizers and volunteers deserve a lot of credit for
the way that everything worked out. Sessions generally began and ended on time, the workshops and
tutorials were well organized and useful, and I particularly enjoyed the PhD forum.
One thing that I found unusual was the fact that even though the proceedings were handed out in
digital form (kudos for that) attendees had to choose between the conference or the workshop
proceedings. My guess is this happened for licensing cost issues, but it would have been nice to have
access to both.

The conference this year took place at Bally’s casino/hotel in Atlantic City.
It was hard to avoid the grumbling from many of the participants for the choice of venue, especially
when one puts it next to last year’s venue in Shenzen or next year in
Barcelona.

Truth be told, the venue was underwhelming, but I guess it was mostly the choice of Atlantic City
that had people irked; there was very little to do and see in the city unless you wanted to gamble.
Still, I was fortunate to meet a lot of cool people at the conference, so I’m looking forward to
attending next year’s edition in Barcelona!

There was a lot of other great work at the conference as well, but these were the presentations
I found most memorable.
So that’s all for now, if I’ve made a terrible mistake when describing your work, shoot me an email
and I’ll fix it ASAP.