Now in its 37th year, ICML (The International Conference on Machine Learning) is known for bringing cutting-edge research on all aspects of machine learning to the fore. This year, 1088 papers have been accepted from 4990 submissions. Here are a few interesting works to look at ICML 2020, which will be held between 13th and 18th of July.

Rethinking Batch Normalization for Meta-Learning

Meta-learning relies on deep networks, which makes batch normalization an essential component of meta-learning pipelines. However, there are several challenges that can render conventional batch normalization ineffective, giving rise to the need to rethink normalization in this setting. The authors evaluate a range of approaches to batch normalization for meta-learning scenarios and develop a novel approach — TaskNorm. Experiments demonstrate that the choice of batch normalization has a dramatic effect on both classification accuracy and training time for both gradient-based and gradient-free meta-learning approaches and TaskNorm is found to consistently improve performance.

Link to paper

Do RNN and LSTM have Long Memory?

This paper raises the question – do RNN and LSTM have long memory? The authors try to answer it partially by proving that RNN and LSTM do not have long memories from a statistical perspective. They introduce a new definition for long memory networks that requires the model weights to decay at a polynomial rate. To verify this, RNN and LSTM are converted into long memory networks, and their superiority is illustrated in modelling and long term dependence of various datasets.

Link to paper

Generative Pretraining from Pixels

Inspired by progress in unsupervised representation learning for natural language, the researchers at OpenAI examine whether similar models can learn useful representations for images. They train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, they find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. 

Link to paper

Improving the Gating Mechanism of RNNs

Gating mechanisms are widely used in neural network models, where they allow gradients to backpropagate more easily through depth or time. In this work, the authors address a key problem of delays by presenting two modifications to the standard gating mechanism without the need for additional hyperparameters and improve the learnability of the gates when they are close to saturation. They show that their simple gating mechanisms robustly improve the performance of recurrent models on image classification, language modelling, and reinforcement learning.

Link to paper

What Can Learned Intrinsic Rewards Capture?

The objective of a reinforcement learning agent is to behave so as to maximise the reward. In this paper, the authors instead consider the proposition that the reward function itself can be a good locus of learned knowledge. To investigate this, they proposed a scalable meta-gradient framework for learning useful intrinsic reward functions across multiple lifetimes of experience and show that it is feasible to learn and capture knowledge about long-term exploration and exploitation into a reward function. 

Link to paper

Reverse-Engineering Deep ReLU Networks

This work investigates the commonly assumed notion that neural networks cannot be recovered from its outputs, as they depend on its parameters in a highly nonlinear way. The authors claim that by observing only its output, one can identify the architecture, weights, and biases of an unknown deep ReLU network. By dissecting the set of region boundaries into components associated with particular neurons, the researchers show that it is possible to recover the weights of neurons and their arrangement within the network.

Link to paper

A Free-Energy Principle for Representation Learning

This paper employs a formal connection of machine learning with thermodynamics to characterize the quality of learnt representations for transfer learning. It discusses how rate, distortion and classification loss of a model lie on a convex, so-called equilibrium surface. Dynamical processes are prescribed to traverse this surface under constraints. The authors demonstrate how this process can be used for transferring representations from a source dataset to a target dataset while keeping the classification loss constant. 

Link to paper

Deep Divergence Learning

This paper introduces deep Bregman divergences, which are based on learning and parameterizing functional Bregman divergences using neural networks. The authors describe a deep learning framework for learning general functional Bregman divergences and show in experiments that this method yields superior performance on benchmark datasets as compared to existing deep metric learning approaches. This work also includes discussion on novel applications, including a semi-supervised distributional clustering problem, and a new loss function for unsupervised data generation.

Link to paper


This work proposes Feature Quantization (FQ) for the discriminator, to embed both true and fake data samples into a shared discrete space. This method, state the authors,  can be easily plugged into existing GAN models, with little computational overhead in training. They apply FQ to BigGAN for image generation, StyleGAN for face synthesis, and U-GAT-IT for unsupervised image-to-image translation. Results show that FQ-GAN can improve Frechet-Inception Distance (FID) scores by a large margin on a variety of tasks, achieving new state-of-the-art performance.

Link to paper


The Log Expected Empirical Prediction (LEEP) is a new measure to evaluate the transferability of representations learned by classifiers. Even for small or imbalanced data, LEEP can predict the performance and convergence speed of both transfer and meta-transfer learning methods. The authors state that LEEP outperforms recently proposed transferability measures such as negative conditional entropy. When transferring from ImageNet to CIFAR100, LEEP can achieve up to 30% improvement compared to the best competing methods.

Link to paper

Source: Sergei Ivanov

Check the full list of accepted papers here.