International Conference on Learning Representations (ICLR), concluded last week, is one of the major AI conferences that take place every year. This year, ICLR went virtual because of the demanding circumstances. 687 out of 2594 papers made it to ICLR 2020 — a 26.5% acceptance rate.
Here are a few of the top papers at ICLR:
ALBERT: A Lite BERT
Increasing model size when pretraining natural language representations often result in improved performance on downstream tasks. However, there are GPU/TPU memory limitations and longer training times. To address these problems, this work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. These proposed methods led to models that scale much better compared to the original BERT. The authors also use a self-supervised loss that focuses on modelling inter-sentence coherence and shows it consistently helps downstream tasks with multi-sentence inputs. As a result, this model establishes new state-of-the-art results on the GLUE, RACE, and squad benchmarks while having fewer parameters compared to BERT-large.
Check the paper here.
Plug and Play Language Models
Plug and Play Language Models (PPLM) combines a pre-trained language model with one or more simple attribute classifiers that guide text generation without any further training. The attribute models consist of a user-specified bag of words or a single learned layer with 100,000 times fewer parameters. Model samples demonstrate control over sentiment styles, and extensive automated and human-annotated evaluations show attribute alignment and fluency.
Check the paper here.
Meta-Learning without Memorization
Meta-learning is famous for leveraging data from previous tasks to enable efficient learning of new tasks. However, most meta-learning algorithms require meta-training tasks to be mutually exclusive, such that no single model can solve all of the tasks at once.
In this paper, the authors address this challenge by designing a meta-regularization objective using information theory that places precedence on data-driven adaptation. By doing so, this algorithm successfully uses data from non-mutually-exclusive tasks to efficiently adapt to novel tasks.
Check paper here.
Reformer: The Efficient Transformer
This paper introduces two techniques to improve the efficiency of Transformers. Furthermore, the authors used reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Check paper here.
Understanding the Effectiveness of MAML
Model Agnostic Meta-Learning (MAML), a method that consists of two optimisation loops, with the outer loop finding a meta-initialisation, from which the inner loop can efficiently learn new tasks. Despite MAML’s popularity, its effectiveness is still questioned. The authors investigated this question, via ablation studies and analysis of the latent representations. The results show that feature reuse is the dominant factor and this led to ANIL (Almost No Inner Loop) algorithm, a simplification of MAML where the inner loop is removed for all but the (task-specific) head of the underlying neural network.
Check the paper here.
Your Classifier is Secretly an Energy-Based Model
This paper proposes attempts to reinterpret a standard discriminative classifier as an energy-based model. In this setting, wrote the authors, the standard class probabilities can be easily computed as well as for unnormalized values. They demonstrated that energy-based training of the joint distribution improves calibration, robustness, handout-of-distribution detection while also enabling our models to generate samples rivalling the quality of recent GAN approaches. This work improves upon recently proposed techniques for scaling up the training of energy-based models and is the first to achieve performance rivalling the state-of-the-art in both generative and discriminative learning within one hybrid model.
Check the paper here.
Lottery Tickets In Reinforcement Learning AND NLP
The authors evaluated whether “winning ticket” initializations exist in two different domains: natural language processing (NLP) and reinforcement learning (RL). For NLP, they examined both recurrent LSTM models and large-scale Transformer models. Whereas for RL, a number of discrete-action space tasks were analysed. Results suggest that the lottery ticket hypothesis is not restricted to supervised learning of natural images, but rather represents a broader phenomenon in DNNs.
Check the paper here.
On Identifiability in Transformers
In this paper, the authors delved deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, they examined the identifiability of attention weights, token embeddings, and the aggregation of context into hidden tokens. The authors demonstrated that, for sequences longer than the attention head dimension, attention weights are not identifiable. They also propose effective attention as a complementary tool for improving explanatory interpretations based on attention. Overall, this work shows that self-attention distributions are not directly interpretable and present tools to better understand and further investigate Transformer models.
Check the paper here.
Neural Tangents
Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space.
Check the paper here.
Generalization through Memorization
The authors in this work show how to efficiently scale up to larger training sets and allow for effective domain adaptation, by simply varying the nearest neighbor datastore, without further training. The model is particularly helpful in predicting rare patterns and the results strongly suggest that learning similarity between sequences of text is easier than predicting the next word.
Check the paper here.
There are other interesting works as well, covering which, is beyond the scope of this article. Please check the full list of papers here.