2016 saw the continued success of Deep Learning: superhuman Go playing, superhuman speech transcription, superhuman translation, superhuman lip reading, and maybe-superhuman driving ability (no public benchmarks for self-driving cars yet). These accomplishments are especially remarkable, considering that many AI researchers used to believe that solving the relevant problems required solving general intelligence itself.

The research topics that are in vogue right now (GANs, RL, generative models, unsupervised learning, robotics, alternative ways to train DNNs) are important problems to tackle before we can build general AI. However, they haven’t created significant commercial value in industry yet, in ways that couldn’t plausibly be substituted with traditional supervised learning.

Transfer learning, domain adaptation and semi-supervised learning alleviate the data-hungry requirements of deep learning, and are starting to work really well.

The volume and quality of published work in 2016 was staggering, and it’s hard to come away from the conference feeling like there are any ideas left to work on. Many papers and posters I saw at the conference had multiple groups converging on similar ideas.

GANs

I can’t imagine a higher compliment for Ian Goodfellow than Yann Lecun declaring Generative Adversarial Networks (GANs) to be “the biggest breakthrough in Machine Learning in the last 1-2 decades.”

For those of you who are unfamiliar with GANs, here’s a simple writeup explaining them. GANs were initially developed for generating images, but researchers are exploring their applications in reinforcement learning, domain adaptation, security ML, compression, and more. A more appropriate terminology would be “adversarial methods”, because the network we are trying to train need not be “generative” – it could be a deterministic classifier or policy instead.

Why are GANs so cool? Before they became popular, many researchers formulated learning as an optimization objective, i.e. “minimize a loss or maximize a reward”. GANs break away from that paradigm; instead, adversarial methods solve for a Nash equilibria between at least two “agents”, both of which are allowed to co-adapt to each other.

The research community cares about this because some loss functions are often difficult to specify directly. In many over-parameterized problems, the optimization process will “short-circuit” design flaws in the objective and yield a result that might minimize the objective, but isn’t quite what we really wanted.

For example, if we wanted to minimize some error for image compression/reconstruction, often what we find is that a naive choice of error metric (e.g. euclidean distance to the ground truth label) results in qualitatively bad results. The design flaw is that we don’t have good perceptual similarity metrics for images that are universally applicable for the space of all images. GANs use a second “adversarial” network learn an optimal implicit distance function (in theory).

Ian Goodfellow gave a very nice tutorial on GANs. The main challenge for GANs right now is preventing degenerate equilibria between the discriminator and generator. These issues often manifest themselves as mode collapse (the generator learns a single tight mode over a single sample) or the discriminator “overpowering” the generator. I’m not worried about this – I think with time, intuitive tricks and recipes from experienced researchers will make them easier to use. People used to think that only Geoff Hinton could train DNNs, and that ended up not being the case.

Here’s a quote from Maciej Cegłowski’s an eminently hilarious, bookmark-bar-tier blog post about AI existential alarmism:

With no way to define intelligence (except just pointing to ourselves), we don’t even know if it’s a quantity that can be maximized. For all we know, human-level intelligence could be a tradeoff. Maybe any entity significantly smarter than a human being would be crippled by existential despair, or spend all its time in Buddha-like contemplation.

Does intelligent life arise from optimization? Or is it an equilibrium? Or to quote Yann, “a bunch of hacks that just happen to work?” AI research is not ready to answer these questions yet, but I am optimistic that research on GANs will tease out what kinds of problems are better-suited for expressing as an equilibrium-finding vs. minimization problem.

Deep RL

People usually regard Reinforcement Learning (RL) as pertaining to playing video games and controlling robots, but RL methods are far more general than that – any ML problem with dependencies across space or time can benefit from RL techniques (i.e. sequential pixels). RL allows us to think about non-stationary data distributions as “environments” that respond to an agent’s behavior. RL is applied creatively to a lot of ML problems nowadays, such as translation.

Pieter Abbeel and John Schulman gave a very nice tutorial on policy gradient methods, and John gave a talk on the “nuts and bolts” of Deep RL research. The nuts and bolts talk is jam-packed with great advice that would have taken me years to figure out on my own, and I highly recommend it for anyone who is working on RL projects.

Curiously, a recent paper by Chelsea Finn showed that certain classes of inverse RL problems have a mathematically equivalent GAN formulation!

Someone asked the Deep RL panelists what some of the most challenging problems were in RL. Rich Sutton and Raia Hadsell mentioned learning of subtasks/subgoals, i.e. hierarchical representations of the task at hand. John Schulman mentioned an interesting comment about how the Bellman discount factor in the discrete MDP setting makes it difficult to plan more than >100 time steps into the future, even when r=0.99. Finally, Chelsea Finn discussed sample complexity: it’s annoying that we have to learn every RL task from scratch, and some kind of inter-task transfer learning is needed. After all, humans can quickly learn to play new games because we’ve played other games before.

Bayesian Deep Learning

I think one of the most promising areas of research right now is the marriage of deep neural nets with graphical models, referred to some as “Bayesian deep learning.” It’s probably the closest thing to a theoretical understanding of deep learning that generalizes to how neural nets are used in practice.

Kyle Cranmer gave a phenomenal talk on how likelihood-free inference methods were used to find evidence of the Higgs Boson. A lot of it went over my head, but what little I understood from it was that they used a variational model to approximate the simulation of a 11-story particle detector at atomic scale. Fig 1

There were also a whole bunch of interesting variational inference techniques presented at the conference:

Symmetrized Variational Inference – a way to avoid mode collapse with Gaussian VAEs

Rejection Sampling Variational inference – reparameterization trick for a rejection sampler, allowing us to train gamma, beta, Dirichlet and many other variational distributions! This is huge.

Robust Inference with Variational Bayes – improving the robustness of a variational posterior to choice of prior and likelihood.

ELBO surgery – ELBO is the most common variational lower bound used to train VAEs, and this paper proposes an alternative way to write the ELBO: split up the KL(q(z|x)|p(z)) term into an index-code mutual information term and marginal KL term KL(q(z)|p(z)). Experiments show that the latter term is actually quite large, which suggests that we should focus on improving priors. This work is quite important because it provides a easy-to-implement debugging tool for anyone training VAEs.

Operator Variational Inference, Normalizing Flows, and Real-NVP enable richer families of posterior approximations.

Variational lossy autoencoders

Open Research

There were a couple exciting software announcements at NIPS this year. First was OpenAI’s announcement of Universe, a software interface that bridges all kinds of software to OpenAI Gym. This means we can bring RL to interact with environments such as the full-featured Chrome Browser, or desktop applications like Photoshop. Think of this as “AppleScript on steroids”.

Deepmind released Deepmind Lab for FPS + 3D agent research in their Labyrinth environment, and earlier this year, Microsoft released Project Malmo for training AI agents in Minecraft. These are really interesting environments, and typically outside the capabilities of a grad student to implement on their own.

These are just a few of the high-profile software releases; many other tech companies and academic institutions are releasing benchmark software, deep learning frameworks, and data all the time.

People often cite 1) compute, 2) data, and 3) algorithms as the reason why DL is taking off, and I think we should add 4) the unprecedented accessibility of ML research. Tons of people are already using OpenAI gym for RL projects, and many research codes are being made public for other people to build on top of. The quality and scope of research benefits dramatically from students not having to write thousands of lines of support code, and instead focus on their research problem.

It’s great that large tech companies are making such great software available to the broader community, and this kind of work that is more impactful than any research paper I saw at the conference this year.

Talks and Papers that I Liked

I had nowhere enough mental capacity to see all the papers and talks at NIPs, but I did write down some talks, papers, and posters that I came away really inspired by. Here’s a non-exhaustive list.

Andrew Ng gave a practical tutorial on how to debug ML research projects. This is a must-watch for beginners in the field. Here is a video of a similar talk he gave at the Deep Learning summer school.

Fixed Points of Belief Propagation – the authors formulate the convergence condition of belief propagation (BP) as the solution of polynomial equations. This makes me wonder if these methods could be used to bridge nonlinear dynamical systems with statistical inference.

Deep Networks with Stochastic Depth – layer-wise dropout allows for training ResNets with >1000 layers.

Layer Norm – like batch norm, layer norm is used to prevent covariance shift between successive deep layers. The advantage is that layer norm does not introduce gradient variance from other minibatch elements.

Deep Successor Reinforcement Learning – I really liked this paper because it gets at model-based + model-free ideas. The motivation is that model-free methods like DQN take a long time to learn adequate low-level features because we are often trying to train a CNN to “see” using a single value scalar that is not directly related to image features. The idea is to learn “successor features” in an unsupervised manner (using rich supervision provided by model prediction error), and then separately learn a simple linear value function on top of these features. The paper is short and well-written.

The Intelligent Biosphere by Drew Purves (couldn’t find the talk slides). This talk was a bit of an oddball among the other mathy ones at NIPS (I thought the drawings were distracting and failed to capture the nuance of what Drew was saying), but I think the message is crucial for AI researchers who want to take a step back and think about what it means for a system to be “intelligent”. The basic idea is that “intelligence” extends beyond brains; something as simple as self-replicating RNA exhibit intelligent behavior at the evolutionary scale. The natural world is fractal, cyclic, and fuzzy, as opposed to the simplicity of our current simulated environments for benchmarking AI. The implication is that there may be challenges learning in these environments that don’t carry over from Atari games. One interesting comment Drew made was that in a biosphere, every organism is a resource to another organism. That is, learning and adaptation of each organism is not independent of other organisms (reminds me of GANs)! If you’re interested in these environment + intelligence ideas, similar arguments have been made in ethology and neuroethology texts.

Input Convex Neural Networks – the authors presented a very interesting follow-up workshop poster at the Deep RL workshop (I can’t find a paper), where they were able to solve continuous Q functions by parameterizing the Q-function to be convex with respect to the action vector of the agent (and non-convex with respect to the input state). This allows for efficient DQN policies in continuous action domains.

Using Fast Weights to Attend to the Recent Past – basically an alternative to LSTM that seems to work a lot better.

Phased LSTM – a way to learn an adaptive time scale for LSTM architectures, where the hidden state is updated at a learned frequency. This is ideal for learning from data with arbitrary sampling rates, such as from event-based vision sensors (which naturally solve some hard computer vision problems like aliasing and bandwidth constraints). It also happens that this Zurich lab happens to make a really awesome event-based camera and I’m really excited to see how they apply ML to it.

Residual Networks Behave Like Ensembles of Relatively Shallow Networks – the title says it all. I thought the experiments could have been a lot better, but regardless, this work could potentially have a huge impact in new ways to train very deep ResNets quickly.

Graph Convolutional Networks – generalization of convolutions to allow DNN processing on graph data structures (the kind with edges and vertices). This work is important for getting DL methods to learn approximate solutions to NP-hard graph problems.

Minimizing Quadratic Functions in Constant Time – I did a double-take when I saw the title of this poster. I don’t understand the math very well, but the algorithm is very simple: subsample a n×n matrix to a k×k matrix and solve it. A simple scaling of the resulting minimized value yields an approximation to the minimized objective of the original system! In problems where we need to branch-and-bound a large search space (an optimal portfolio), we might find the value of a solution more informative than the solution itself.

Predictions for 2017

SGD + backprop will still be the best method for training deep neural nets in 2017.

Really good transfer learning, domain adaptation, and/or multitask learning will be one of the major accomplishments in 2017 (this was suggested by Andrew Ng during his tutorial).

Ensemble methods (e.g. training 10 versions of a network and averaging the predictions) will squeeze out 2-5% performance on all kinds of ML tasks.

## Leave a Reply