Meta’s answer to GitHub Copilot – InCoder

InCoder is trained to maximise the likelihood of a corpus of code and can infill blocks of code conditioned on the arbitrary left and right contexts.

The InCoder, a unified generative model that can perform both editing and program synthesis, is the result of a collaboration between Facebook AI Research, University of Washington, UC Berkeley, TTI-Chicago and Carnegie Mellon. The 6.7 billion parameters decoder-only Transformer model can both extend and insert/infill code.

Large language models trained on huge code repositories generate code left-to-right, predicting their direct application to many ubiquitous code editing tasks, like fixing bugs, adding comments, or re-naming variables. InCoder is trained to maximise the likelihood of a corpus of code and can infill blocks of code conditioned on the arbitrary left and right contexts. 

InCoder

The InCoder can generate code files from a large corpus of permissively licensed code, where parts of code are randomly masked and moved

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

to the end of each file, permitting code infilling with bidirectional context. The team claimed InCoder the first large generative code model that can infill arbitrary code regions. InCoder learns to infill blocks by randomly replacing spans of code with a sentinel token and moving them to the end of the sequence. The model is trained to predict all tokens in the complete sequence and edit the code by replacing spans with sentinel tokens during inference. This prompts the model with the new sequence and triggers it to generate new tokens to replace the masked spans. The model follows a unified approach for both program synthesis (via left-to-right generation) and editing (via infilling).

Causal masking procedure

Neural model for generating codes either a left-to-right (causal) autoregressive language

modeling objective or a masked language modeling objective. The former model conditions only on context to the left of the generated tokens, preventing infilling, but can autoregressively generate entire documents. The latter conditions on both the left and right contexts so it can infill a masked region, but their training objective is limited to generating only a small part of the document. The InCoder team proposes a causal masking objective that combines the strengths of both causal and masked language models. 

The causal masking procedure samples several spans of contiguous tokens in each document to mask. Here, the length of each span is sampled uniformly from the length of the document. In case of an overload, the set of sampled spans is rejected and resampled.

InCoder training

The models are trained on two major corpuses: 

(1) Public code 

(2) StackOverflow questions, answers, and comments.

The model focused hugely on Python language and included code files from 28 total languages and all of StackOverflow’s content. First, the training data was filtered and deduplicated to create a corpus of 159 GB of code (52 GB of Python and 57 GB of content from StackOverflow). Additionally, code files and repository metadata from GitHub and GitLab were used, with 670,000 public non-fork repositories detected to contain Python, JavaScript, or Jupyter Notebooks files. Then, they also used codes from 28 languages and text and code preprocessed from Jupyter notebooks. The second component of the corpus consists of questions, answers, and comments from StackOverflow. 

Fairseq architecture

The InCoder leverages the Fairseq architecture. Fairseq is a sequence modelling toolkit for training custom models for translation, summarisation, and other text generation tasks. In the InCoder, it was used for improving memory efficiency through fully sharding model states. The toolkit features multi-GPU training on one or across multiple machines and lightning-fast beam search generation on both CPU and GGPU. The Incoder was trained on 248 V100 GPUs for 24 days. The team leveraged the causal masking implementation in Fairseq with PyTorch as the underlying library. Additional information on the architectural training includes an eight per-GPU batch size with a maximum token sequence length of 2048. 

Model performance

Using causal masking objective while training a generative model of code results in strong zero-shot performance on many challenging and practical code infilling and editing tasks, said the team. The model showed “comparable performance to similarly-resourced models on standard left-to-right language-to-code synthesis benchmarks in ablation and comparison experiments. Additionally, they evaluated the model in a zero-shot setting with complex tasks such as type inference, comment generation, and variable re-naming. The model showed an improved performance on the tasks, and thus, its ability to condition on bidirectional context. 

More Great AIM Stories

Avi Gopani
Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM