The Grad Factor

Friday, July 27, 2012

How
biased are maximum entropy models? Jakob
H. Macke, Iain Murray, Peter E. Latham They
show that some of the common approaches to maximum entropy learning
(subject to constraints in the data like moments) can severely
under-estimate the entropy of the data. One might naively
assume max-ent over-estimates the entropy of the data. Iain calls his
paper a "health warning" for methodology he says he sees
many neuroscientists use.

Unifying
Non-Maximum Likelihood Learning Objectives with Minimum KL
Contraction Siwei
Lyu Looked
like an interesting paper but the author was MIA at the
poster

Statistical
Tests for Optimization Efficiency Levi
Boyles, Anoop Korattikara, Deva Ramanan, Max Welling /0.5 The
idea is that in a conjugate gradient (CG) optimization routine for
learning parameters you can approximate the derivatives as long as
they have the same sign as the true derivatives, i.e. you usually
take steps in the right direction. If the objective is of the
form J(theta)
+ sum_i=1^N f(x_i,y_i,theta) then
you can randomly sub-sample the data when computing the objective and
use a statistical test to limit the false positive rate: taking an
optimization step in the wrong direction. It would be
interesting to extend this to Gaussian process (GP) hyper-parameter
optimization where the objective contains a sum over all pairs of
data points (if you convert the matrix operations to
sums).

Probabilistic
amplitude and frequency demodulation Richard
Turner, Maneesh Sahani Rich
extended some of the work on angular distributions with GPs he gave a
research talk on a while back. He provides a fully
probabilistic interpretation to signal processing frequency analysis
methods.

A
Collaborative Mechanism for Crowdsourcing Prediction Problems Jacob
D. Abernethy, Rafael M. Frongillo They
describe a prediction market mechanism that would more efficiently
combine information from participants in an ML competition.
Instead of a winner take all approach like in the NetFlix
competition, which ended up being a competition between a few giant
ensembles, participants would make bets in a prediction market about
how much their contribution would improve the performance if
integrated into a prediction system. This alleviates the need
for participants to organize themselves into conglomerates, i.e.
ensembles. Amos Storkey gave a similar talk at the workshops on
using prediction market mechanisms for model combination. I
really like this idea and it seems to be gaining some
traction.

Variational
Gaussian Process Dynamical Systems Andreas
C. Damianou, Michalis Titsias, Neil D. Lawrence They
do nonlinear state space modeling with a Gaussian process time series
(GPTS) on the latent states and a GP-LVM like model on the
observations. This is similar to Turner et. al. (2009) except there
an autoregressive Gaussian process (ARGP) is used on the latent
states. However, using a GPTS on the latent states makes
it easier to apply variational methods to integrate out the pseudo
inputs. That combined with some automatic relevance
determination (ARD) on the GPTS hyper-parameters, allows them to
claim that you need not bother worrying about the right latent
dimension or number of pseudo-inputs: Just select as large of number
as you can handle computationally and the method will automatically
ignore the excess dimensions/pseudo-inputs without over-fitting.
This means they should be able to make a plot of pseudo-inputs/latent
dimensions against performance and see the performance level out for
a sufficiently large number of pseudo-inputs/latent dimensions and
not go down much thereafter.It
would be really cool if they could make the plots to illustrate
that.

Bernhard
Scholkopf gave a key note talk on some of the work on causal
inference he has been doing. The
talk did not seem to distinguish the generative/discriminative model
distinction with "causal and anti-causal learning".
He claimed his work on MNIST was anti-causal while his later work on
image restoration had been causal. It seems discriminative vs
generative would have been better terms to apply to the approaches
where the data and task contained no interventions and really didn't
warrant worrying about causality. Even in the MNIST case it is
not clear it was "anti-causal": did the human draw a
particular image because of the digit label, or did a human labeler
apply a certain label because of the image he found in the data set?
If we drop the causal and anti-causal learning terminology, this
issue becomes irrelevant.

Monday, December 27, 2010

Undirected Grad has a blog post about symbolic regression using a new peice of software called Eureka. I have tried it out and it is pretty effective at uncovering the latent function from the synthetic experiments a tried, such as y = logistic(x^2 + sin(x)).

It was the last year in Vancouver/Whistler. So, luckily the snow conditions were good ;)

On the technical side:

Switched Latent Force Models for Movement SegmentationMauricio Alvarez, Jan Peters, Bernhard Schoelkopf, Neil LawrenceThey modeled an input/output system governed by a linear differential equation where the input was distributed according a switching GP. They took advantage of the fact that the derivative of a function from a GP is also GP distributed, as well as linearity properties. Therefore, the output of the system was also distributed according to a switching GP model. They used the model to segment human motion. I liked it since it was closely related to my ICML paper on GP change point models. They claimed the advantage of their method is that it enforced continuity in the time series across segment switches. Although, this can easily be done in my setup I am glad my paper got a citation ;)

Global seismic monitoring as probabilistic inferenceNimar Arora, Stuart Russell, Paul Kidwell, Erik SudderthThey used graphical models to infer if earthquakes and other seismic events (e.g. nuclear tests) are noise (from local events near a seismic sensor) or from a genuine event, which should be noticed by multiple seismic sensors.

A Bayesian Approach to Concept DriftStephen Bach, Mark MaloofThis paper is also similar to the Adams & MacKay change point framework. They replaced the base model (UPM) with a discriminative classifier (such as Bayesian logistic regression). They admitted to fitting some of the hyper-parameters to the test, which is cheating. However, they tried to justify it by saying that it is inappropriate to try to learn the frequency of concept drifts (change points) from training data. I don't think the argument is coherent.

Predicting Execution Time of Computer Programs Using Sparse Polynomial RegressionLing Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, Mayur NaikThey did an analysis of programs to predict their execution time. The novelty of the paper is that they created features by "splicing" the program; they found small snippets of the program that could be executed quickly. They used the output of these snippets as features for a LASSO regression with polynomial regression. Polynomial basis functions are sensible since the run-time of a program is usually approximately linear, quadratic, or cubic in some aspect of its input. I pointed them to Zoubin's polybayes.m demo as a way of selecting the order of a polynomial from data. Symbolic regression using Eureka might also be illuminating.

Slice sampling covariance hyperparameters of latent Gaussian modelsIain Murray, Ryan AdamsIain presented a some tricks for transforming the sample space in GP classification to drastically improve the convergence of sampling GP hyper-parameters. Iain is a fan of re-parameterizing models to spaces that makes sampling easier. He claims the naive sampling method gets stuck in an "entropic barrier." He says this a third and often ignored, but common, failure mode of MC methods. The are other two are: the sampling method getting stuck in one mode of the posterior and dimensions that are highly correlated.

Heavy-Tailed Process Priors for Selective ShrinkageFabian Wauthier, Michael JordanFabian did GP classification while applying heavy tail noise to the latent GP before squashing the function through a sigmoid/probit. They claim GPC often gives over confident predictions in sparsely sampled areas of the input space. This method claims to alleviate the problem. Since the problem does not occur in synthetic data I asked him what he thought was the underlying model assumption violated. He believes the root cause is the stationarity assumption in most GP kernels is inappropriate in many cases.

Copula ProcessesAndrew Wilson, Zoubin GhahramaniIt was nice to see that Andrew attracted quite a crowd at his poster.

At the workshops I liked:Natively probabilistic computation: principles and applicationsVikash Mansinghka, Navia SystemsVikash argued that his accelerated hardware could do millions of samples per second in Gibbs sampling an MRF (1000x improvement). The hardware restricted the flexibility of what kind of sampling you could do. The loss in performance from lossing that flexibility was compensated for many times over by using the hardware acceleration. He argues that maybe the best approach is to use simple samplers and his accelerated hardware over sophisticated samplers in software.

There was talk about the prospect of moving to analog computation for sampling. A lot of energy is used in CPUs to make them completely deterministic with digital computation, but then in MC methods we artificially introduce randomness. Maybe it is better to do MC computations with analog. However, Vikash said that we must limit the analog computation to very small accelerated units within a digital processor in order for it to be manageable. The analog element would require custom ICs, which requires more funding than he currently has. However, he has selectively reduced the bit precision of many of his computations, which he says can be done when the quantities are random. This saves chip real-estate and power.

Monday, September 20, 2010

To my knowledge this was the first machine learning conference to occur within the arctic circle ~ 68 N. The conference took place on top of the gondola. The key highlight of the conference was the summer bobsled track from the conference center to the village. The food was mostly raindeer (in various forms) and berries ;)

On the technical side:

Kalman Filtering and Smoothing Solutions to Temporal Gaussian ProcessRegression Models:Simo Sarka had a poster where he converted (almost arbitrary) stationary GP time series models into a state space model. He then used to Kalman filter to do O(T) predictions. As opposed to O(T^3) for general GPs and O(T^2) or O(TlogT) with Toeplitz tricks if the time series is in discrete time. Simo's method works in continuous time as well.

Recent directions in nonparametric Bayesian machine learning Zoubin gave a lecture were he made an unapologetic advertisement for NP-Bayes.

Here are my highlights from CBMS: the non-parametric Bayes conference at UC Santa Cruz. It was organized more like a summer school, however.

The conference was dominated by Peter Muller, who gave 10 1.5 hour lectures on non-parametric Bayes. He talked mainly of Dirichlet processes and the generalizations to them: Pitman-Yor, Polya trees, ect. He presented a "graphical model of graphical models" demonstrating the connection between the related models. He went through each model and compared them by their predictive probability function (PPF), which is the one-step-ahead predictive distribution for the models. Notably absent from his unifying view was Gaussian processes.

Michael Jordan gave one lecture where he went through various models various NP Bayes models he has worked with: LDA, IBPs, sticky HMMs, ... He didn't get too technical, but tried to give a high level view of many models motivated by applications such as speaker diarization.

Wes Johnson gave one lecture giving examples of NP Bayes in biology.

Finally, Peter Hoff gave one lecture "Alternative approaches to Bayesian nonparametrics". He gave some examples of how doing Bayesian inference with an unknown Gaussian has a better predictive probability than using a DP-mixture for N <> 100 were referred to "large" and N > 5000 as "huge".

Sunday, July 4, 2010

I recently submitted a paper to Interdisciplinary Graduate Conference (IGC) 2010. I prepared well formatted 8 page LaTeX document. However, the conference was organized by humanities students who had never heard of LaTeX. They wanted a .doc file. I then had to go through the painful process of converting my LaTeX document into a Word one. After that painful experience, I could not resist writing a rant on the process. I have experienced both forms of writing: I used word for every (lab) report in my under-grad. Once in my PhD program I was converted to LaTeX and have not looked back since.

Speed: Anything that requires a mouse and clicking through menus will be slower than one where you can write it out in a few key strokes. This means that writing characters with accents, special symbols, and especially equations will be much faster to write in LaTeX.

MS Word (and even worse open office) can get sluggish when editing large documents with lots of equations and figures.

Security: stored in plain text

MS Word stores its files in a bloated binary form. If a file gets corrupted for whatever reason you could be locked out of your file and many hours of work. Likewise if some bug in MS Word is causing it to crash when opening your file. With plain text source files, if all else fails you can always open and edit the file in a simple text editor.

Separation of context and formatting

The plain text style of LaTeX simply using section and subsection etc allows the writer to simply think about the logical flow of the document without worrying about superficial details such as font sizes and styles

You know all your sections and subsections headings are in the correct font size. This is much harder to check using MS Word.

Integratable with SVN

Being plain text it is easier for SVN (and other revision control systems) to merge files being edited

It also takes up less space on the server storing revisions

It is also possible to use any diff tool to compare revisions

You also know the diff tool will show you ALL changes in the state of the file. There is no such guarantee when using features such as track changes in MS Word.

Control

MS Word often tries to outsmart you. It will automatically capitalize, automatically try to select whole sentences, automatically insert bullet points, and try to infer when you’re done with a sub/superscript when writing equations. Software that tries to out-smart you will often out-dumb you. It tries too hard to infer what you want and often gets it wrong.

A good example is the use of - vs. -- vs. ---. Microsoft assumes most users are not smart enough to infer which type of dash to use in which situation. So MS Word tries to figure it out automatically. It can be really annoying when it gets it wrong. With LaTeX you just write what you want in 1 to 3 key strokes!

There is also the quote directions `` ''. In LaTeX it is specified manually while in word it is automatic, and can be annoying if word infers it wrong.

Notion of state: every change in the file is visible. Nothing is hidden from you in a plain text source file.

There is no hidden meta information

Flexibility

It is easy to search for $x^2$ or \footnote in LaTeX the is no easy way to do the analogous searches using CTRL-F in MS Word

You can also search the file using regular expressions

Macros for a more semantic representation can be written in just 1 line with a few key strokes

For example, I often define \field{} using \mathcal{} and then \R using \field{R}

In MS Word macros are often full blown VB scripts

They should be disabled any way since they are a security risk

More easily scriptable

For instance, I have MATLAB code that exports a matrix of results in MATLAB to a LaTeX table

It would require a full blown C++/VB program in visual studio using all sorts of crazy APIs (and therefore reading tons of documentation) to do the same thing in MS Word

Speed-quality trade-off in formatting

Because MS Word has to reformat the document every keystroke it has to use inferior typesetting methods to prevent the GUI from becoming glacially slow

Since you only recompile after significant changes in LaTeX, it can afford to use more expensive type-setting algorithms (especially for equations) that might take 10 seconds to run.

Interoperable

Since things like bibtex are also plain text it is much easier for third parties to create applications such as Jabref. You don't get stuck using one particular reference manager. If you don't like one there are others to use instead. And if all else fails you can always edit it in notepad.

Therefore, there is no vendor lock-in and you can't get stuck using one piece of software for backward compatibility reasons that may in the future become inferior to the alternatives.

Footnotes are a pain, especially if you have multiple footnotes on the same page

Equation numbering is a pain in Word

It is possible to embed pdf figures in LaTeX which allows for vector graphics and avoids file bloat

However, I beleive new versions of word allow for the insertion of eps figures

Cost

LaTeX is free while MS Office can cost a few hundred dollars

There is open office but that is even worse

I am not a "free-tard" so this is not my top concern

Advantages of MS Word:

MS Word has a grammar checker. To my knowledge, none of the LaTeX editors have a grammar checker.

Of course, the grammar checker should usually be taken with a grain of salt. However, it is good at catching typos such as interchanging it/is/if and they/then, which a spell checker will not find and are easy to glance over when proof reading.

The equation editor is better than it used to be. Writing an equation heavy document in word used to be almost impossible. However, it is now doable, but still much slower than LaTeX.