Last updated February 23, 2020
In AI Origins & Evolution

Facebook Explains How Nearest Neighbour Search Is An Effective Approach For Language Modelling In The Long Tail

Share

Published on February 24, 2020

by Ambika Choudhury

With an aim to break down language barriers across the globe for everyone to understand and communicate with anyone, the researchers at Facebook AI Research (FAIR) work on complex problems to deploy robust language translation solutions. It spans the topics such as deep learning, natural language processing, text normalisation, word sense disambiguation and much more.

Recently, the researchers at Facebook AI Research presented a new language modelling approach known as kNN-LM, which is based on the hypothesis that the representation learning problem may be easier than the prediction problem. Usually, a natural language model resolves two subproblems, which are mapping sentence prefixes to fixed-sized representations and the other one is to utilise these representations to predict the next word in a certain text.

This approach extends a pre-trained LM by linearly interpolating its next word distribution with a k-nearest neighbours (kNN) model. The nearest neighbours are then computed according to the distance in the pre-trained embedding space and can be extracted from text collection, including the original language model (LM) training data. According to the researchers, this approach allows the rare patterns to be memorised explicitly, rather than implicitly in model parameters.

How It Works

The kNN-LM involves augmenting such as pre-trained LM with the nearest neighbours retrieval mechanism, without any additional training, which means the representations learned by the LM remain unchanged through the process.

One crucial point of this kNN-LM approach is that it is compatible with any model which produces fixed-size context representations. The researchers used decoder-only transformers for language modelling, and since kNN-LM makes no changes to the underlying LM, the researchers used the exact architecture for creating a kNN-LM for inference.

According to the researchers, kNN-LM improves performance because of the following three points

With an implicit notion of similarity, the Transformer LM is efficient at learning a representation function for contexts.
While the Transformer has the capacity to memorise all the training examples, doing so causes its representation to generalise less effectively
The kNN-LM allows the model to memorise the training data while retaining an effective similarity function.

Dataset Used

The researchers used several datasets for this project, they are mentioned below

Wikitext-103, which is a standard benchmark for autoregressive language modelling with a 250K word-level vocabulary consists of 103M tokens of Wikipedia in the training set and 250K tokens in each of the test sets.
Books dataset, which is the Toronto Books Corpus.
Wiki-3B is an English Wikipedia dataset which contains about 2.87B tokens.
Wiki-100M is a random 100M token subset of Wiki-3B Corpus.

Advantages of This Model:

This approach has implications for efficiently scaling up to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbour datastore, again without further training
The model is specifically helpful in predicting rare patterns, such as factual knowledge, names, and near-duplicate sentences from the training set.,
It also improves performance when the same training data is used for learning the prefix representations and the kNN model, strongly suggesting that the prediction problem is more challenging than previously appreciated.

Wrapping Up

Over the years, the researchers at FAIR have performed several remarkable projects related to Natural language processing (NLP) and Natural Language Understanding (NLU). The researchers introduced the kNN-ML model, which significantly outperform standard language models by directly querying training examples at test time and can be applied to any neural language model. According to the researchers, the success of this method suggests that learning similarity functions between contexts may be an easier problem than predicting the next word from some given context.

Read the paper here.

Access all our open Survey & Awards Nomination forms in one place