Here are my high lights from NIPS 2011 in Granada, Spain:

How
biased are maximum entropy models?

Jakob H. Macke, Iain Murray, Peter E. Latham

They show that some of the common approaches to maximum entropy learning (subject to constraints in the data like moments) can severely under-estimate the entropy of the data. One might naively assume max-ent over-estimates the entropy of the data. Iain calls his paper a "health warning" for methodology he says he sees many neuroscientists use.

Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Siwei Lyu

Looked like an interesting paper but the author was MIA at the poster

Statistical Tests for Optimization Efficiency

Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling /0.5

The idea is that in a conjugate gradient (CG) optimization routine for learning parameters you can approximate the derivatives as long as they have the same sign as the true derivatives, i.e. you usually take steps in the right direction. If the objective is of the form

J(theta) + sum_i=1^N f(x_i,y_i,theta)

then you can randomly sub-sample the data when computing the objective and use a statistical test to limit the false positive rate: taking an optimization step in the wrong direction. It would be interesting to extend this to Gaussian process (GP) hyper-parameter optimization where the objective contains a sum over all pairs of data points (if you convert the matrix operations to sums).

Probabilistic amplitude and frequency demodulation

Richard Turner, Maneesh Sahani

Rich extended some of the work on angular distributions with GPs he gave a research talk on a while back. He provides a fully probabilistic interpretation to signal processing frequency analysis methods.

A Collaborative Mechanism for Crowdsourcing Prediction Problems

Jacob D. Abernethy, Rafael M. Frongillo

They describe a prediction market mechanism that would more efficiently combine information from participants in an ML competition. Instead of a winner take all approach like in the NetFlix competition, which ended up being a competition between a few giant ensembles, participants would make bets in a prediction market about how much their contribution would improve the performance if integrated into a prediction system. This alleviates the need for participants to organize themselves into conglomerates, i.e. ensembles. Amos Storkey gave a similar talk at the workshops on using prediction market mechanisms for model combination. I really like this idea and it seems to be gaining some traction.

Variational Gaussian Process Dynamical Systems

Andreas C. Damianou, Michalis Titsias, Neil D. Lawrence

They do nonlinear state space modeling with a Gaussian process time series (GPTS) on the latent states and a GP-LVM like model on the observations. This is similar to Turner et. al. (2009) except there an autoregressive Gaussian process (ARGP) is used on the latent states. However, using a GPTS on the latent states makes it easier to apply variational methods to integrate out the pseudo inputs. That combined with some automatic relevance determination (ARD) on the GPTS hyper-parameters, allows them to claim that you need not bother worrying about the right latent dimension or number of pseudo-inputs: Just select as large of number as you can handle computationally and the method will automatically ignore the excess dimensions/pseudo-inputs without over-fitting. This means they should be able to make a plot of pseudo-inputs/latent dimensions against performance and see the performance level out for a sufficiently large number of pseudo-inputs/latent dimensions and not go down much thereafter. It would be really cool if they could make the plots to illustrate that.

Bernhard Scholkopf gave a key note talk on some of the work on causal inference he has been doing.

The talk did not seem to distinguish the generative/discriminative model distinction with "causal and anti-causal learning". He claimed his work on MNIST was anti-causal while his later work on image restoration had been causal. It seems discriminative vs generative would have been better terms to apply to the approaches where the data and task contained no interventions and really didn't warrant worrying about causality. Even in the MNIST case it is not clear it was "anti-causal": did the human draw a particular image because of the digit label, or did a human labeler apply a certain label because of the image he found in the data set? If we drop the causal and anti-causal learning terminology, this issue becomes irrelevant.

Jakob H. Macke, Iain Murray, Peter E. Latham

They show that some of the common approaches to maximum entropy learning (subject to constraints in the data like moments) can severely under-estimate the entropy of the data. One might naively assume max-ent over-estimates the entropy of the data. Iain calls his paper a "health warning" for methodology he says he sees many neuroscientists use.

Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Siwei Lyu

Looked like an interesting paper but the author was MIA at the poster

Statistical Tests for Optimization Efficiency

Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling /0.5

The idea is that in a conjugate gradient (CG) optimization routine for learning parameters you can approximate the derivatives as long as they have the same sign as the true derivatives, i.e. you usually take steps in the right direction. If the objective is of the form

J(theta) + sum_i=1^N f(x_i,y_i,theta)

then you can randomly sub-sample the data when computing the objective and use a statistical test to limit the false positive rate: taking an optimization step in the wrong direction. It would be interesting to extend this to Gaussian process (GP) hyper-parameter optimization where the objective contains a sum over all pairs of data points (if you convert the matrix operations to sums).

Probabilistic amplitude and frequency demodulation

Richard Turner, Maneesh Sahani

Rich extended some of the work on angular distributions with GPs he gave a research talk on a while back. He provides a fully probabilistic interpretation to signal processing frequency analysis methods.

A Collaborative Mechanism for Crowdsourcing Prediction Problems

Jacob D. Abernethy, Rafael M. Frongillo

They describe a prediction market mechanism that would more efficiently combine information from participants in an ML competition. Instead of a winner take all approach like in the NetFlix competition, which ended up being a competition between a few giant ensembles, participants would make bets in a prediction market about how much their contribution would improve the performance if integrated into a prediction system. This alleviates the need for participants to organize themselves into conglomerates, i.e. ensembles. Amos Storkey gave a similar talk at the workshops on using prediction market mechanisms for model combination. I really like this idea and it seems to be gaining some traction.

Variational Gaussian Process Dynamical Systems

Andreas C. Damianou, Michalis Titsias, Neil D. Lawrence

They do nonlinear state space modeling with a Gaussian process time series (GPTS) on the latent states and a GP-LVM like model on the observations. This is similar to Turner et. al. (2009) except there an autoregressive Gaussian process (ARGP) is used on the latent states. However, using a GPTS on the latent states makes it easier to apply variational methods to integrate out the pseudo inputs. That combined with some automatic relevance determination (ARD) on the GPTS hyper-parameters, allows them to claim that you need not bother worrying about the right latent dimension or number of pseudo-inputs: Just select as large of number as you can handle computationally and the method will automatically ignore the excess dimensions/pseudo-inputs without over-fitting. This means they should be able to make a plot of pseudo-inputs/latent dimensions against performance and see the performance level out for a sufficiently large number of pseudo-inputs/latent dimensions and not go down much thereafter. It would be really cool if they could make the plots to illustrate that.

Bernhard Scholkopf gave a key note talk on some of the work on causal inference he has been doing.

The talk did not seem to distinguish the generative/discriminative model distinction with "causal and anti-causal learning". He claimed his work on MNIST was anti-causal while his later work on image restoration had been causal. It seems discriminative vs generative would have been better terms to apply to the approaches where the data and task contained no interventions and really didn't warrant worrying about causality. Even in the MNIST case it is not clear it was "anti-causal": did the human draw a particular image because of the digit label, or did a human labeler apply a certain label because of the image he found in the data set? If we drop the causal and anti-causal learning terminology, this issue becomes irrelevant.

References:

**State-space inference and learning with Gaussian processes**. In Yee Whye Teh and Mike Titterington, editors,

*13th International Conference on Artificial Intelligence and Statistics*, volume 9 of

*W&CP*, pages 868-875, Chia Laguna, Sardinia, Italy, May 2010. Journal of Machine Learning Research.