The
objective of this tutorial is to provide a gentle introduction to basic
information-theoretic concepts and to demonstrate how those concepts can be
applied in the context of machine learning. Information theory was originally
developed to describe engineered communication systems. Applying these ideas
in new contexts introduces several challenges. We will discuss some of the
main problems and potential solutions: picking the right measures, estimating
information quantities from limited data, and interpreting results.

We will
consider basic and ubiquitous quantities like mutual information (which is
nevertheless fraught with pitfalls in estimation and interpretation). We will
also explore the more exciting possibility of using information-theoretic
ideas as a principled theoretical foundation for machine learning. In this
vein we will consider different ways of decomposing information and notable
ideas such as InfoMax, ICA, and the information
bottleneck.

Background

The emergence of Information Theory as
a scientific discipline is commonly attributed to a 1948 landmark
paper by Claude Shannon where he laid down the basic principles of
data transmission through a noisy communication channel. In particular, Shannon's theory
tells us that the amount of information we can send through the noisy channel
is related to a quantity called "mutual information". Mutual
Information between two random variables (e.g., transmitted and received
messages) measures the average reduction in the uncertainty in one variable,
if we know the value of the other variable. This concept is illustrated using
the Venn diagram below: Here the yellow and light blue areas denote the uncertainty
in variables X and Y, respectively. Those uncertainties are quantified by the
corresponding entropies H(X) and H(Y). The mutual information then
corresponds to the area of the intersection. The noisy channel is a powerful
framework that has been found numerous applications in speech recognition,
machine translation, text summarization, and so on.

What does this have to do with
influence, human speech, or social media? This abstract framework is
remarkably flexible. What if the input is some statement made or tweeted by
Alice? Then the “noisy channel” consists of (e.g.) sound waves,
the ear drum, and the brain of Bob. Now Bob “outputs” some statement
and we can ask what the information capacity is of the link between Alice and
Bob.

More generally, in recent years information-theoretic concepts have been used
successfully to characterize processes in dynamic social networks and social
media. For instance, Ghosh et. al. used information-theoretic approach to
classification of user activity on Twitter [4]. In particular, they traced
the user activity connected with particular URL, and identified two features,
time-interval entropy, and user entropy. Using just these two features they
were able to categorize content based on the collective user response it
generates. Vet Steeg and Galstyan proposed to use predictability as a measure of influence between two social media users [1,2]. In particular, they introduced content transfer, an information-theoretic measure with a predictive interpretation that directly quantifies the strength of the effect of one
user's generated content on another's in a completely model-free way. Their xperiments with Twitter data showed that content transfer is able to capture non-trivial, predictive relationships even for pairs of users not linked in the follower or mention graph.

Scope of the
tutorial

We will begin with a survey on topics
such as random variables, entropy, mutual information, and conditional mutual
information, focusing on developing a deeper intuition for what these
quantities represent. After demonstrating common pitfalls, we will demonstrate
practical, state of the art methods for estimating entropic measures from
limited data samples. We will discuss various famous approaches to
information-theoretic learning like Infomax, ICA,
sparse coding, the information bottleneck, and CorEx.
Finally, we will show how these tools can be fruitfully applied to real-world
machine learning problems in complex systems like social media, finance,
psychometrics, biology, and more. Possible examples include discovering
meaningful relationships from social signals using transfer entropy [1, 2],
use of entropic measures for classifying temporal activity patterns of users
in social media, characterizing randomness in social interactions on Twitter
[5], and information-theoretic methods for community detection in social
networks [3].