3
Feature Selection? NOT generative modeling! –no assumptions about the source of the data Extracting relevant structure from data –functions of the data (statistics) that preserve information Information about what? Approximate Sufficient Statistics Need a principle that is both general and precise. –Good Principles survive longer!

8
Mutual information How much X is telling about Y? I(X;Y): function of the joint probability distribution p(x,y) - minimal number of yes/no questions (bits) needed to ask about x, in order to learn all we can about Y. Uncertainty removed about X when we know Y: I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X) H(X|Y) H(Y|X) I(X;Y)

9
Relevant Coding What are the questions that we need to ask about X in order to learn about Y? Need to partition X into relevant domains, or clusters, between which we really need to distinguish... XY X|y 1 y2y2 y1y1 P(x|y 1 ) P(x|y 2 ) X|y 2

19
The Information - plane, the optimal for a given is a concave function: Possible phase impossible

20
Regression as relevant encoding Extracting relevant information is fundamental for many problems in learning: Regression: Knowing the parametric class we can calculate p(X,Y), without sampling! Gaussian noise

21
Manifold of relevance The self consistent equations: Assuming a continuous manifold for Coupled (local in ) eigenfunction equations, with  as an eigenvalue.

22
Generalization as relevant encoding The two sample problem: the probability that two samples come from one source Knowing the function class we can estimate p(X,Y). Convergence depends on the class complexity.

27
Sufficient Dimensionality Reduction (with Amir Globerson) Exponential families have sufficient statistics Given a joint distribution, find an approximation of the exponential form: This can be done by alternating maximization of Entropy under the constraints: The resulting functions are our relevant features at rank d.

28
Conclusions There may be a single principle behind... Noise filtering time series prediction categorization and classification feature extraction supervised and unsupervised learning visual and auditory segmentation clustering self organized representation...