Meta’s answer to GitHub Copilot – InCoder

InCoder is trained to maximise the likelihood of a corpus of code and can infill blocks of code conditioned on the arbitrary left and right contexts.

Share

Published on April 25, 2022

by Avi Gopani

The InCoder, a unified generative model that can perform both editing and program synthesis, is the result of a collaboration between Facebook AI Research, University of Washington, UC Berkeley, TTI-Chicago and Carnegie Mellon. The 6.7 billion parameters decoder-only Transformer model can both extend and insert/infill code.

Large language models trained on huge code repositories generate code left-to-right, predicting their direct application to many ubiquitous code editing tasks, like fixing bugs, adding comments, or re-naming variables. InCoder is trained to maximise the likelihood of a corpus of code and can infill blocks of code conditioned on the arbitrary left and right contexts.

InCoder

The InCoder can generate code files from a large corpus of permissively licensed code, where parts of code are randomly masked and moved

to the end of each file, permitting code infilling with bidirectional context. The team claimed InCoder the first large generative code model that can infill arbitrary code regions. InCoder learns to infill blocks by randomly replacing spans of code with a sentinel token and moving them to the end of the sequence. The model is trained to predict all tokens in the complete sequence and edit the code by replacing spans with sentinel tokens during inference. This prompts the model with the new sequence and triggers it to generate new tokens to replace the masked spans. The model follows a unified approach for both program synthesis (via left-to-right generation) and editing (via infilling).

Causal masking procedure

Neural model for generating codes either a left-to-right (causal) autoregressive language

modeling objective or a masked language modeling objective. The former model conditions only on context to the left of the generated tokens, preventing infilling, but can autoregressively generate entire documents. The latter conditions on both the left and right contexts so it can infill a masked region, but their training objective is limited to generating only a small part of the document. The InCoder team proposes a causal masking objective that combines the strengths of both causal and masked language models.

The causal masking procedure samples several spans of contiguous tokens in each document to mask. Here, the length of each span is sampled uniformly from the length of the document. In case of an overload, the set of sampled spans is rejected and resampled.

InCoder training

The models are trained on two major corpuses:

(1) Public code

(2) StackOverflow questions, answers, and comments.

The model focused hugely on Python language and included code files from 28 total languages and all of StackOverflow’s content. First, the training data was filtered and deduplicated to create a corpus of 159 GB of code (52 GB of Python and 57 GB of content from StackOverflow). Additionally, code files and repository metadata from GitHub and GitLab were used, with 670,000 public non-fork repositories detected to contain Python, JavaScript, or Jupyter Notebooks files. Then, they also used codes from 28 languages and text and code preprocessed from Jupyter notebooks. The second component of the corpus consists of questions, answers, and comments from StackOverflow.

Fairseq architecture

The InCoder leverages the Fairseq architecture. Fairseq is a sequence modelling toolkit for training custom models for translation, summarisation, and other text generation tasks. In the InCoder, it was used for improving memory efficiency through fully sharding model states. The toolkit features multi-GPU training on one or across multiple machines and lightning-fast beam search generation on both CPU and GGPU. The Incoder was trained on 248 V100 GPUs for 24 days. The team leveraged the causal masking implementation in Fairseq with PyTorch as the underlying library. Additional information on the architectural training includes an eight per-GPU batch size with a maximum token sequence length of 2048.

Model performance

Using causal masking objective while training a generative model of code results in strong zero-shot performance on many challenging and practical code infilling and editing tasks, said the team. The model showed “comparable performance to similarly-resourced models on standard left-to-right language-to-code synthesis benchmarks in ablation and comparison experiments. Additionally, they evaluated the model in a zero-shot setting with complex tasks such as type inference, comment generation, and variable re-naming. The model showed an improved performance on the tasks, and thus, its ability to condition on bidirectional context.

Access all our open Survey & Awards Nomination forms in one place

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.