Friday, July 27, 2012

NIPS 2011

Here are my high lights from NIPS 2011 in Granada, Spain:

How biased are maximum entropy models? 
Jakob H. Macke, Iain Murray, Peter E. Latham 
They show that some of the common approaches to maximum entropy learning (subject to constraints in the data like moments) can severely under-estimate the entropy of the data.  One might naively assume max-ent over-estimates the entropy of the data. Iain calls his paper a "health warning" for methodology he says he sees many neuroscientists use. 

Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction 
Siwei Lyu 
Looked like an interesting paper but the author was MIA at the poster 

Statistical Tests for Optimization Efficiency 
Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling /0.5
The idea is that in a conjugate gradient (CG) optimization routine for learning parameters you can approximate the derivatives as long as they have the same sign as the true derivatives, i.e. you usually take steps in the right direction.  If the objective is of the form 
J(theta) + sum_i=1^N f(x_i,y_i,theta) 
then you can randomly sub-sample the data when computing the objective and use a statistical test to limit the false positive rate: taking an optimization step in the wrong direction.  It would be interesting to extend this to Gaussian process (GP) hyper-parameter optimization where the objective contains a sum over all pairs of data points (if you convert the matrix operations to sums). 

Probabilistic amplitude and frequency demodulation 
Richard Turner, Maneesh Sahani 
Rich extended some of the work on angular distributions with GPs he gave a research talk on a while back.  He provides a fully probabilistic interpretation to signal processing frequency analysis methods. 

A Collaborative Mechanism for Crowdsourcing Prediction Problems 
Jacob D. Abernethy, Rafael M. Frongillo 
They describe a prediction market mechanism that would more efficiently combine information from participants in an ML competition.  Instead of a winner take all approach like in the NetFlix competition, which ended up being a competition between a few giant ensembles, participants would make bets in a prediction market about how much their contribution would improve the performance if integrated into a prediction system.  This alleviates the need for participants to organize themselves into conglomerates, i.e. ensembles.  Amos Storkey gave a similar talk at the workshops on using prediction market mechanisms for model combination.  I really like this idea and it seems to be gaining some traction. 

Variational Gaussian Process Dynamical Systems 
Andreas C. Damianou, Michalis Titsias, Neil D. Lawrence 
They do nonlinear state space modeling with a Gaussian process time series (GPTS) on the latent states and a GP-LVM like model on the observations. This is similar to Turner et. al. (2009) except there an autoregressive Gaussian process (ARGP) is used on the latent states.   However, using a GPTS on the latent states makes it easier to apply variational methods to integrate out the pseudo inputs.  That combined with some automatic relevance determination (ARD) on the GPTS hyper-parameters, allows them to claim that you need not bother worrying about the right latent dimension or number of pseudo-inputs: Just select as large of number as you can handle computationally and the method will automatically ignore the excess dimensions/pseudo-inputs without over-fitting.  This means they should be able to make a plot of pseudo-inputs/latent dimensions against performance and see the performance level out for a sufficiently large number of pseudo-inputs/latent dimensions and not go down much thereafter. It would be really cool if they could make the plots to illustrate that. 

Bernhard Scholkopf gave a key note talk on some of the work on causal inference he has been doing. 
The talk did not seem to distinguish the generative/discriminative model distinction with "causal and anti-causal learning".  He claimed his work on MNIST was anti-causal while his later work on image restoration had been causal.  It seems discriminative vs generative would have been better terms to apply to the approaches where the data and task contained no interventions and really didn't warrant worrying about causality.  Even in the MNIST case it is not clear it was "anti-causal": did the human draw a particular image because of the digit label, or did a human labeler apply a certain label because of the image he found in the data set?  If we drop the causal and anti-causal learning terminology, this issue becomes irrelevant. 


Ryan Turner, Marc Peter Deisenroth, and Carl Edward Rasmussen. State-space inference and learning with Gaussian processes. In Yee Whye Teh and Mike Titterington, editors, 13th International Conference on Artificial Intelligence and Statistics, volume 9 of W&CP, pages 868-875, Chia Laguna, Sardinia, Italy, May 2010. Journal of Machine Learning Research.

Monday, December 27, 2010


Undirected Grad has a blog post about symbolic regression using a new peice of software called Eureka. I have tried it out and it is pretty effective at uncovering the latent function from the synthetic experiments a tried, such as y = logistic(x^2 + sin(x)).

NIPS 2010 highlights

It was the last year in Vancouver/Whistler. So, luckily the snow conditions were good ;)

On the technical side:

Switched Latent Force Models for Movement Segmentation
Mauricio Alvarez, Jan Peters, Bernhard Schoelkopf, Neil Lawrence
They modeled an input/output system governed by a linear differential equation where the input was distributed according a switching GP. They took advantage of the fact that the derivative of a function from a GP is also GP distributed, as well as linearity properties. Therefore, the output of the system was also distributed according to a switching GP model. They used the model to segment human motion. I liked it since it was closely related to my ICML paper on GP change point models. They claimed the advantage of their method is that it enforced continuity in the time series across segment switches. Although, this can easily be done in my setup I am glad my paper got a citation ;)

Global seismic monitoring as probabilistic inference
Nimar Arora, Stuart Russell, Paul Kidwell, Erik Sudderth
They used graphical models to infer if earthquakes and other seismic events (e.g. nuclear tests) are noise (from local events near a seismic sensor) or from a genuine event, which should be noticed by multiple seismic sensors.

A Bayesian Approach to Concept Drift
Stephen Bach, Mark Maloof
This paper is also similar to the Adams & MacKay change point framework. They replaced the base model (UPM) with a discriminative classifier (such as Bayesian logistic regression). They admitted to fitting some of the hyper-parameters to the test, which is cheating. However, they tried to justify it by saying that it is inappropriate to try to learn the frequency of concept drifts (change points) from training data. I don't think the argument is coherent.

Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, Mayur Naik
They did an analysis of programs to predict their execution time. The novelty of the paper is that they created features by "splicing" the program; they found small snippets of the program that could be executed quickly. They used the output of these snippets as features for a LASSO regression with polynomial regression. Polynomial basis functions are sensible since the run-time of a program is usually approximately linear, quadratic, or cubic in some aspect of its input. I pointed them to Zoubin's polybayes.m demo as a way of selecting the order of a polynomial from data. Symbolic regression using Eureka might also be illuminating.

Slice sampling covariance hyperparameters of latent Gaussian models
Iain Murray, Ryan Adams
Iain presented a some tricks for transforming the sample space in GP classification to drastically improve the convergence of sampling GP hyper-parameters. Iain is a fan of re-parameterizing models to spaces that makes sampling easier. He claims the naive sampling method gets stuck in an "entropic barrier." He says this a third and often ignored, but common, failure mode of MC methods. The are other two are: the sampling method getting stuck in one mode of the posterior and dimensions that are highly correlated.

Heavy-Tailed Process Priors for Selective Shrinkage
Fabian Wauthier, Michael Jordan
Fabian did GP classification while applying heavy tail noise to the latent GP before squashing the function through a sigmoid/probit. They claim GPC often gives over confident predictions in sparsely sampled areas of the input space. This method claims to alleviate the problem. Since the problem does not occur in synthetic data I asked him what he thought was the underlying model assumption violated. He believes the root cause is the stationarity assumption in most GP kernels is inappropriate in many cases.

Copula Processes
Andrew Wilson, Zoubin Ghahramani
It was nice to see that Andrew attracted quite a crowd at his poster.

At the workshops I liked:
Natively probabilistic computation: principles and applications
Vikash Mansinghka, Navia Systems
Vikash argued that his accelerated hardware could do millions of samples per second in Gibbs sampling an MRF (1000x improvement). The hardware restricted the flexibility of what kind of sampling you could do. The loss in performance from lossing that flexibility was compensated for many times over by using the hardware acceleration. He argues that maybe the best approach is to use simple samplers and his accelerated hardware over sophisticated samplers in software.

There was talk about the prospect of moving to analog computation for sampling. A lot of energy is used in CPUs to make them completely deterministic with digital computation, but then in MC methods we artificially introduce randomness. Maybe it is better to do MC computations with analog. However, Vikash said that we must limit the analog computation to very small accelerated units within a digital processor in order for it to be manageable. The analog element would require custom ICs, which requires more funding than he currently has. However, he has selectively reduced the bit precision of many of his computations, which he says can be done when the quantities are random. This saves chip real-estate and power.

Tuesday, October 19, 2010

LaTeX A0 Summary

Following the well known latex-one-pager there is now the LaTeX A0 that combines 16 pages of LaTeX summary into poster.

Monday, September 20, 2010

MLSP 2010 highlights

Here are my highlights from MLSP.

To my knowledge this was the first machine learning conference to occur within the arctic circle ~ 68 N. The conference took place on top of the gondola. The key highlight of the conference was the summer bobsled track from the conference center to the village. The food was mostly raindeer (in various forms) and berries ;)

On the technical side:

Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process
Regression Models:
Simo Sarka had a poster where he converted (almost arbitrary) stationary GP time series models into a state space model. He then used to Kalman filter to do O(T) predictions. As opposed to O(T^3) for general GPs and O(T^2) or O(TlogT) with Toeplitz tricks if the time series is in discrete time. Simo's method works in continuous time as well.

Recent directions in nonparametric Bayesian machine learning Zoubin gave a lecture were he made an unapologetic advertisement for NP-Bayes.

Tom M. Mitchell: Machine Learning for Studying Neural Representations of Word Meanings An interesting talk showing the cutting edge of machine learning applied to fMRI data.

PASS-GP: Predictive Active Set Selection for Gaussian Processes A new approach to sparse GPs involving selecting a subset of data points.

Archetypal Analysis for Machine Learning Mikkel's old NIPS pal an enthusiastic talk on "Archetypal Analysis", which most of the MLSP crowd was unfamiliar with.

CBMS highlights

Here are my highlights from CBMS: the non-parametric Bayes conference at UC Santa Cruz. It was organized more like a summer school, however.

The conference was dominated by Peter Muller, who gave 10 1.5 hour lectures on non-parametric Bayes. He talked mainly of Dirichlet processes and the generalizations to them: Pitman-Yor, Polya trees, ect. He presented a "graphical model of graphical models" demonstrating the connection between the related models. He went through each model and compared them by their predictive probability function (PPF), which is the one-step-ahead predictive distribution for the models. Notably absent from his unifying view was Gaussian processes.

Michael Jordan gave one lecture where he went through various models various NP Bayes models he has worked with: LDA, IBPs, sticky HMMs, ... He didn't get too technical, but tried to give a high level view of many models motivated by applications such as speaker diarization.

Wes Johnson gave one lecture giving examples of NP Bayes in biology.

Finally, Peter Hoff gave one lecture "Alternative approaches to Bayesian nonparametrics". He gave some examples of how doing Bayesian inference with an unknown Gaussian has a better predictive probability than using a DP-mixture for N <> 100 were referred to "large" and N > 5000 as "huge".

The slides are available here:

Sunday, July 4, 2010

Why LaTeX is superior to MS Word

I recently submitted a paper to Interdisciplinary Graduate Conference (IGC) 2010. I prepared well formatted 8 page LaTeX document. However, the conference was organized by humanities students who had never heard of LaTeX. They wanted a .doc file. I then had to go through the painful process of converting my LaTeX document into a Word one. After that painful experience, I could not resist writing a rant on the process. I have experienced both forms of writing: I used word for every (lab) report in my under-grad. Once in my PhD program I was converted to LaTeX and have not looked back since.

  1. Speed: Anything that requires a mouse and clicking through menus will be slower than one where you can write it out in a few key strokes. This means that writing characters with accents, special symbols, and especially equations will be much faster to write in LaTeX.

    1. MS Word (and even worse open office) can get sluggish when editing large documents with lots of equations and figures.

  2. Security: stored in plain text

    1. MS Word stores its files in a bloated binary form. If a file gets corrupted for whatever reason you could be locked out of your file and many hours of work. Likewise if some bug in MS Word is causing it to crash when opening your file. With plain text source files, if all else fails you can always open and edit the file in a simple text editor.

  3. Separation of context and formatting

    1. The plain text style of LaTeX simply using section and subsection etc allows the writer to simply think about the logical flow of the document without worrying about superficial details such as font sizes and styles

    2. You know all your sections and subsections headings are in the correct font size. This is much harder to check using MS Word.

  4. Integratable with SVN

    1. Being plain text it is easier for SVN (and other revision control systems) to merge files being edited

    2. It also takes up less space on the server storing revisions

    3. It is also possible to use any diff tool to compare revisions

      1. You also know the diff tool will show you ALL changes in the state of the file. There is no such guarantee when using features such as track changes in MS Word.

  5. Control

    1. MS Word often tries to outsmart you. It will automatically capitalize, automatically try to select whole sentences, automatically insert bullet points, and try to infer when you’re done with a sub/superscript when writing equations. Software that tries to out-smart you will often out-dumb you. It tries too hard to infer what you want and often gets it wrong.

      1. A good example is the use of - vs. -- vs. ---. Microsoft assumes most users are not smart enough to infer which type of dash to use in which situation. So MS Word tries to figure it out automatically. It can be really annoying when it gets it wrong. With LaTeX you just write what you want in 1 to 3 key strokes!

      2. There is also the quote directions `` ''. In LaTeX it is specified manually while in word it is automatic, and can be annoying if word infers it wrong.

    2. Notion of state: every change in the file is visible. Nothing is hidden from you in a plain text source file.

      1. There is no hidden meta information

  6. Flexibility

    1. It is easy to search for $x^2$ or \footnote in LaTeX the is no easy way to do the analogous searches using CTRL-F in MS Word

    2. You can also search the file using regular expressions

    3. Macros for a more semantic representation can be written in just 1 line with a few key strokes

      1. For example, I often define \field{} using \mathcal{} and then \R using \field{R}

    4. In MS Word macros are often full blown VB scripts

      1. They should be disabled any way since they are a security risk

  7. More easily scriptable

    1. For instance, I have MATLAB code that exports a matrix of results in MATLAB to a LaTeX table

    2. It would require a full blown C++/VB program in visual studio using all sorts of crazy APIs (and therefore reading tons of documentation) to do the same thing in MS Word

  8. Speed-quality trade-off in formatting

    1. Because MS Word has to reformat the document every keystroke it has to use inferior typesetting methods to prevent the GUI from becoming glacially slow

    2. Since you only recompile after significant changes in LaTeX, it can afford to use more expensive type-setting algorithms (especially for equations) that might take 10 seconds to run.

  9. Interoperable

    1. Since things like bibtex are also plain text it is much easier for third parties to create applications such as Jabref. You don't get stuck using one particular reference manager. If you don't like one there are others to use instead. And if all else fails you can always edit it in notepad.

    2. Therefore, there is no vendor lock-in and you can't get stuck using one piece of software for backward compatibility reasons that may in the future become inferior to the alternatives.

  10. Misc

    1. Equations always come out looking crappy

    2. Figure captions aren't proper; MS Word will let them cross page boundaries for instance

    3. Footnotes are a pain, especially if you have multiple footnotes on the same page

    4. Equation numbering is a pain in Word

    5. It is possible to embed pdf figures in LaTeX which allows for vector graphics and avoids file bloat

      1. However, I beleive new versions of word allow for the insertion of eps figures

  11. Cost

    1. LaTeX is free while MS Office can cost a few hundred dollars

      1. There is open office but that is even worse

    2. I am not a "free-tard" so this is not my top concern

Advantages of MS Word:

  1. MS Word has a grammar checker. To my knowledge, none of the LaTeX editors have a grammar checker.

    1. Of course, the grammar checker should usually be taken with a grain of salt. However, it is good at catching typos such as interchanging it/is/if and they/then, which a spell checker will not find and are easy to glance over when proof reading.

  2. The equation editor is better than it used to be. Writing an equation heavy document in word used to be almost impossible. However, it is now doable, but still much slower than LaTeX.