Top 10 Papers From ICLR 2020 Conference

International Conference on Learning Representations (ICLR), concluded last week, is one of the major AI conferences that take place every year. This year, ICLR went virtual because of the demanding circumstances. 687 out of 2594 papers made it to ICLR 2020 — a 26.5% acceptance rate. 

Here are a few of the top papers at ICLR:


Increasing model size when pretraining natural language representations often result in improved performance on downstream tasks. However, there are GPU/TPU memory limitations and longer training times. To address these problems, this work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. These proposed methods led to models that scale much better compared to the original BERT. The authors also use a self-supervised loss that focuses on modelling inter-sentence coherence and shows it consistently helps downstream tasks with multi-sentence inputs. As a result, this model establishes new state-of-the-art results on the GLUE, RACE, and squad benchmarks while having fewer parameters compared to BERT-large. 

Check the paper here.

Plug and Play Language Models

Plug and Play Language Models (PPLM) combines a pre-trained language model with one or more simple attribute classifiers that guide text generation without any further training. The attribute models consist of a user-specified bag of words or a single learned layer with 100,000 times fewer parameters. Model samples demonstrate control over sentiment styles, and extensive automated and human-annotated evaluations show attribute alignment and fluency. 

Check the paper here.

Meta-Learning without Memorization

Meta-learning is famous for leveraging data from previous tasks to enable efficient learning of new tasks. However, most meta-learning algorithms require meta-training tasks to be mutually exclusive, such that no single model can solve all of the tasks at once. 

In this paper, the authors address this challenge by designing a meta-regularization objective using information theory that places precedence on data-driven adaptation. By doing so, this algorithm successfully uses data from non-mutually-exclusive tasks to efficiently adapt to novel tasks.

Check paper here.

Reformer: The Efficient Transformer

This paper introduces two techniques to improve the efficiency of Transformers. Furthermore, the authors used reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Check paper here.

Understanding the Effectiveness of MAML

Model Agnostic Meta-Learning (MAML), a method that consists of two optimisation loops, with the outer loop finding a meta-initialisation, from which the inner loop can efficiently learn new tasks. Despite MAML’s popularity, its effectiveness is still questioned. The authors investigated this question, via ablation studies and analysis of the latent representations. The results show that feature reuse is the dominant factor and this led to ANIL (Almost No Inner Loop) algorithm, a simplification of MAML where the inner loop is removed for all but the (task-specific) head of the underlying neural network. 

Check the paper here.

Your Classifier is Secretly an Energy-Based Model

This paper proposes attempts to reinterpret a standard discriminative classifier as an energy-based model. In this setting, wrote the authors, the standard class probabilities can be easily computed as well as for unnormalized values. They demonstrated that energy-based training of the joint distribution improves calibration, robustness, handout-of-distribution detection while also enabling our models to generate samples rivalling the quality of recent GAN approaches. This work improves upon recently proposed techniques for scaling up the training of energy-based models and is the first to achieve performance rivalling the state-of-the-art in both generative and discriminative learning within one hybrid model.

Check the paper here.

Lottery Tickets In Reinforcement Learning AND NLP

The authors evaluated whether “winning ticket” initializations exist in two different domains: natural language processing (NLP) and reinforcement learning (RL). For NLP, they examined both recurrent LSTM models and large-scale Transformer models. Whereas for RL, a number of discrete-action space tasks were analysed. Results suggest that the lottery ticket hypothesis is not restricted to supervised learning of natural images, but rather represents a broader phenomenon in DNNs.

Check the paper here.

On Identifiability in Transformers

In this paper, the authors delved deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, they examined the identifiability of attention weights, token embeddings, and the aggregation of context into hidden tokens. The authors demonstrated that, for sequences longer than the attention head dimension, attention weights are not identifiable. They also propose effective attention as a complementary tool for improving explanatory interpretations based on attention. Overall, this work shows that self-attention distributions are not directly interpretable and present tools to better understand and further investigate Transformer models. 

Check the paper here.

Neural Tangents

Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space.

Check the paper here.

Generalization through Memorization

The authors in this work show how to efficiently scale up to larger training sets and allow for effective domain adaptation, by simply varying the nearest neighbor datastore, without further training. The model is particularly helpful in predicting rare patterns and the results strongly suggest that learning similarity between sequences of text is easier than predicting the next word.

Check the paper here.

There are other interesting works as well, covering which, is beyond the scope of this article. Please check the full list of papers here.

Download our Mobile App

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can Apple Save Meta?

The iPhone kicked off the smartphone revolution and saved countless companies. Could the Pro Reality headset do the same for Meta?