We discuss the differences between Machine Learning, Data Mining, and Statistics; the importance of cost-sensitive classification; social network mining and identifying influencers; mining data streams, and more.

I first met Pedro at KDD-95 in Montreal, the first international conference on
Knowledge Discovery and Data Mining. His modest manner did not conceal his brilliance,
and he impressed me with his rare talent of being able to explain difficult technical issues quite clearly.
I quite enjoyed a dinner conversation we had then in a small Montreal restaurant,
but enjoyed our interaction even more when,
in my capacity of KDD Awards chair, I presented him with the
best paper award for KDD-98 in New York City.
When Pedro again won the best paper award next year at KDD-99, I half-jokingly asked him to give others a chance.

However, Pedro has won numerous awards since then for research in areas spanning Machine Learning, Data Mining, Artificial Intelligence, Statistics, and Databases, and his outstanding contributions were recognized by
ACM SIGKDD 2014 Innovation Award, the highest awards in the field - a Data Mining/Data Science "Nobel Prize".

Many of his award winning research ideas are implemented in software, which Prof. Domingos has made freely available. This includes

Gregory Piatetsky: Q1. Congratulations on the KDD 2014 Innovation Award! The award recognizes your work in data mining related areas, but you actually have more publications in Machine Learning and AI journals and conferences. How do you see the relationship between the fields of Machine Learning, Data Mining/Data Science, and Statistics? Is there a significant difference in goals, methodologies, tools ?

Pedro Domingos:
My work fits into all three, and I gave up long ago trying to parse
the differences between them. But from a practical point of view these
are different - if overlapping - communities, and it's good to have a
map of the territory. Machine learning, because of its origins in AI,
is more concerned with making the computer as autonomous as possible.

The great thing for me about the data mining/data science community is its diversity - there you'll find machine learners, statisticians, database, algorithms, and systems folks, all working for a common goal.

There's a lot of emphasis on scaling up, of course, but also on
dealing with real data in all its messiness. Statistics provides a lot
of the underpinnings of this whole enterprise. As a field, however,
it's also more conservative than computer science. So it's good to
have both a common core and different emphases in these fields.

PD:
The problem MetaCost addresses is that most classifiers assume all
errors are equally costly, but in reality this is seldom the case. Not
deleting a spam email will cost a fraction of a second of your
attention, but
deleting an email from your boss could cost you your job.
Now you could go and redesign every classifier from scratch to be
cost-sensitive, but that would take a long time. Instead,

MetaCost is something you can wrap around a standard classifier to make it cost-sensitive.

What it does is relabel examples so that the frontier
between the classes becomes the optimal one according to the cost
measure. For example, perhaps an email is labeled as spam in the
training set, but it looks enough like a legitimate one that MetaCost
relabels it as such, just to be safe. Then it just runs the cost-free
classifier to learn the frontiers according to the relabeled examples.

There was a tremendous follow-up to MetaCost, and today cost-sensitive
classification is a large field, with many techniques that are much
more sophisticated than MetaCost. The bottom line is, you want to use
either a natively cost-sensitive learner or an algorithm like
MetaCost, or your system will be making a lot of costly mistakes.

GP: Q3. You were also a pioneer in social network mining,
(see paper: Mining the Network Value of Customers, KDD 2001) and defined the influence maximization problem and proposed the first algorithms for it. Tell us about key ideas in that work.
What do you think of current social influence rankers, like Klout, PeerIndex, and Kred?

PD:
The motivation for the work was that companies spend a lot of effort
trying to predict the value of a customer, because that determines how
much they're willing to spend to acquire and keep him. But they ignore
(or used to) what is often the single biggest component of that value:
the customer's influence on other customers. If I switch to a
different cell phone service provider, that makes my friends and
family more likely to switch as well, particularly if I'm a persuasive
guy and keep telling them how much better my new carrier is.

So the question I asked myself was: can we model the influence among
customers, mine these models from data, and then use the models to
choose the optimal customers to market to, in the sense that they'll
generate the highest expected return for least marketing expenditure?
We measured what we called the network value of a customer, which is
the expected number of other customers that that customer will
directly or indirectly convert, and found that it was often huge. For
example, in the "web of trust" of the Epinions review site, which we
mined, the network value of the top user was over 20,000. That was
pretty exciting! And today's social networks take it to a whole other level.

I think the social influence ranking industry has a great future - the
power of traditional mass marketing and direct marketing is waning,
and increasingly social influence is what determines the fate of your
product. Having said that, the industry is still in its infancy - a
lot of the modeling is still quite ad hoc, as far as I can tell, and
misses a lot of important effects. But the better it gets, the more
widely it will be used, so watch this space.

GP: Q4. Another of your key contributions is in mining data streams, with the most recent result being VFML toolkit - one of the best open-source resources for stream mining. Tell us about key ideas in VFML and examples of applications.

PD:
The goal in VFML is take any traditional batch learning algorithm -
which has random access to all the data, and typically runs in time
that's quadratic or worse in the size of the data - and easily turn it
into a stream mining algorithm - which has fixed memory, can only see
each sample once, and needs to run in time that's at worst linear in
the size of the data, but ideally independent of it. And we want
statistical guarantees that the model learned by the stream miner is
essentially equivalent to the batch one.

This may seem like a tall order at first, but is in fact quite feasible,
for a simple reason.

If you want to predict who will will the next presidential election, you
don't need to ask every single voter who they'll vote for; a sample of
a few thousand suffices, if you're willing to accept a little bit of
uncertainty.

VFML generalizes this idea to models with many parameters and many structural decisions: Which is the best attribute to test at this node in a decision tree? How many examples do I need to be pretty sure I picked the right one?

Once I do, I can move on and use the
remainder of the stream to pick the tests for that node's children,
etc. And I can aggregate all the uncertainties in the individual
decisions into the overall uncertainty that I've learned the same
model I would if I had used infinite time and data, and use just as
much data as needed to keep this uncertainty within limits set by the user.

GP: Q5. You also made key contributions to statistical multi-relational learning, and proposed Markov logic networks as a means to unify first-order logic and probabilistic graphical models. Can you summarize the key ideas and tell us about your open-source system Alchemy: Algorithms for statistical relational AI

PD:
Most learning algorithms only look at a single relation - a single
table of data - and this is very limiting. Real databases have many
relations, and we want to mine straight from them. This requires a
pretty substantial rethink of how to do statistical learning, because
samples are no longer independent and identically distributed, due to
joins, etc. The solution in Markov logic networks is to use
first-order logic to express patterns among multiple relations - which
it can easily do - but treat logical formulas as features in a Markov
network or log-linear model, which sets up a probability distribution
over relational databases that we can then use to answer questions.

Alchemy is an open-source implementation of the learning and inference
algorithms we've developed for Markov logic. The goal is to provide
for machine learning and data mining something akin to what the
relational model and SQL provide for databases.

GP: Q6. You won 2 best papers awards in a row at KDD-98 and KDD-99 conferences, but have very few publications in KDD or data mining conferences/journals after 2004. What caused the shift?
(I was KDD-98 general chair and KDD-99 Best Paper awards chair, so I actually presented you with the awards, and probably said at KDD-99 something like "Pedro, give others a chance", but I was only joking!)

PD:
I remember that! Made me feel quite guilty. Choosing where to submit
my papers is always tough. I think everything I've published in the
last ten years could have creditably appeared in KDD or another data
mining venue. The reason I've published mostly in machine learning and
AI venues is that Markov logic networks draw heavily on ideas from
those fields, and so the audience in those communities is more primed
to like and follow the work. Also, because of the difficulty of the
issues involved, we had to at first focus on data sets that are fairly
small by KDD standards. But we're now rapidly scaling to larger and
messier data, so expect to see more KDD papers from my group in the
future.

BIO:Pedro Domingos is Professor of Computer Science and Engineering
at the University of Washington. His research interests are in machine
learning, artificial intelligence and data mining. He received a PhD in
Information and Computer Science from the University of California at
Irvine, and is the author or co-author of over 200 technical publications.

He is a member of the editorial board of the Machine Learning journal,
co-founder of the International Machine Learning Society, and past
associate editor of JAIR. He was program co-chair of KDD-2003 and SRL-2009,
and has served on numerous program committees. He is a winner of the
SIGKDD Innovation Award, the highest honor in the data mining field.
He is a AAAI Fellow, and received a Sloan Fellowship, an NSF CAREER Award,
a Fulbright Scholarship, an IBM Faculty Award, and best paper awards at
several leading conferences.