The InCoder, a unified generative model that can perform both editing and program synthesis, is the result of a collaboration between Facebook AI Research, University of Washington, UC Berkeley, TTI-Chicago and Carnegie Mellon. The 6.7 billion parameters decoder-only Transformer model can both extend and insert/infill code.
Large language models trained on huge code repositories generate code left-to-right, predicting their direct application to many ubiquitous code editing tasks, like fixing bugs, adding comments, or re-naming variables. InCoder is trained to maximise the likelihood of a corpus of code and can infill blocks of code conditioned on the arbitrary left and right contexts.
The InCoder can generate code files from a large corpus of permissively licensed code, where parts of code are randomly masked and moved
Sign up for your weekly dose of what's up in emerging technology.
to the end of each file, permitting code infilling with bidirectional context. The team claimed InCoder the first large generative code model that can infill arbitrary code regions. InCoder learns to infill blocks by randomly replacing spans of code with a sentinel token and moving them to the end of the sequence. The model is trained to predict all tokens in the complete sequence and edit the code by replacing spans with sentinel tokens during inference. This prompts the model with the new sequence and triggers it to generate new tokens to replace the masked spans. The model follows a unified approach for both program synthesis (via left-to-right generation) and editing (via infilling).
Causal masking procedure
Neural model for generating codes either a left-to-right (causal) autoregressive language
modeling objective or a masked language modeling objective. The former model conditions only on context to the left of the generated tokens, preventing infilling, but can autoregressively generate entire documents. The latter conditions on both the left and right contexts so it can infill a masked region, but their training objective is limited to generating only a small part of the document. The InCoder team proposes a causal masking objective that combines the strengths of both causal and masked language models.
The causal masking procedure samples several spans of contiguous tokens in each document to mask. Here, the length of each span is sampled uniformly from the length of the document. In case of an overload, the set of sampled spans is rejected and resampled.
The models are trained on two major corpuses:
(1) Public code
(2) StackOverflow questions, answers, and comments.
The InCoder leverages the Fairseq architecture. Fairseq is a sequence modelling toolkit for training custom models for translation, summarisation, and other text generation tasks. In the InCoder, it was used for improving memory efficiency through fully sharding model states. The toolkit features multi-GPU training on one or across multiple machines and lightning-fast beam search generation on both CPU and GGPU. The Incoder was trained on 248 V100 GPUs for 24 days. The team leveraged the causal masking implementation in Fairseq with PyTorch as the underlying library. Additional information on the architectural training includes an eight per-GPU batch size with a maximum token sequence length of 2048.
Using causal masking objective while training a generative model of code results in strong zero-shot performance on many challenging and practical code infilling and editing tasks, said the team. The model showed “comparable performance to similarly-resourced models on standard left-to-right language-to-code synthesis benchmarks in ablation and comparison experiments. Additionally, they evaluated the model in a zero-shot setting with complex tasks such as type inference, comment generation, and variable re-naming. The model showed an improved performance on the tasks, and thus, its ability to condition on bidirectional context.