MITB Banner

What’s Generative Insertion Transformer?

GIT is pre-trained using the BERT encoder and KERMIT objective on an unsupervised LM task.

Share

Continuous annotation of user data is a challenge while deploying NLU techniques at scale in commercial applications. Models must be re-trained and updated to keep the performance at an optimal level. However, the process is expensive, labour-intensive, and time-consuming. Furthermore, with the rising concerns around privacy, manual review of user data needed for annotation is not ideal. 

Researchers at Amazon and the University of Massachusetts Lowell have proposed a generative model to produce labelled synthetic data. The idea is to improve model robustness and performance by generating synthetic utterances and augmenting the original training data.

Synthetic augmentation with GIT

The Generative Insertion Transformer (GIT) is based on a non-autoregressive insertion transformer model that extends the idea to solve the inverse NLU problem by producing valid labelled data utterance that matches the annotation with a given template.

Source: amazon.science

In this generative model, the decoder generates a sequence by inserting tokens between previously generated tokens. The carrier tokens are inserted between labels in the template iteratively. The insertion process at each position in the utterance is independent of every other position and stops when the EOS token is generated at all positions, resulting in a fully annotated synthetic utterance that can be directly augmented with real data for model building purposes.

The process can be divided into three sections:

Pretraining: GIT is pre-trained using the BERT encoder and KERMIT objective on an unsupervised LM task: Given a sentence with masked tokens, GIT is trained to insert the masked tokens. Two tests are configured on this model:

  1. Pre-training using only English Wikipedia
  2. Pre-training using an internal corpus of 800M unlabeled utterances randomly sampled from de-identified Alexa requests, using English Wikipedia pre-trained models as initialization.

Fine-tuning: The pre-trained GIT model is then fine-tuned for each domain using annotated real data. A template is provided as model input for each utterance and the complete utterance as output. During training, at each insertion slot, there are multiple candidate tokens from the ground truth, unlike autoregressive generation, which entails a single token per generation step. The ground truth distribution sets non-candidate token probabilities to 0 and uniformly weighs all candidate token probabilities.

Generation: To generate synthetic data for NLU, a template is constructed that contains the desired intent, slot types, and slot values for the synthetic example. This priming sequence is provided as an input to the decoder, which inserts carrier tokens in an iterative manner to form a coherent utterance. The generation process addresses both the label projection and entity control challenges. Templates used in inference are constructed from the reduced real data.

Performance

To study the effectiveness of synthetically generated data, the NLU model performance was evaluated in a reduced data regime. For each domain, multiple IC-NER models are built using all real data, a reduced set of real data and a combination of real and synthetic data. All models within a domain share the same training hyper-parameters, including architecture and encoder. They differ only in training data composition. 

Conclusion


The researchers demonstrated DA using GIT as a feasible data generation technique to mitigate reduced annotation volumes for IC and NER tasks. The NLU models trained on 33% real data and synthetic data performed on par with models trained on full real data. Further, on domains with the highest SemER regressions, the quality of synthetic data was improved by filtering them with model confidence scores. Among domains that benefit from synthetic data, appropriate carrier token insertion enhanced utterances’ semantics and their value as training samples. The future represents data generation with entities replaced through knowledge base sampling. Such finer control over entities supports new feature expansion and enhances customer privacy.

Share
Picture of Kartik Wali

Kartik Wali

A writer by passion, Kartik strives to get a deep understanding of AI, Data analytics and its implementation on all walks of life. As a Senior Technology Journalist, Kartik looks forward to writing about the latest technological trends that transform the way of life!
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.