MITB Banner

Meta’s answer to GitHub Copilot – InCoder

InCoder is trained to maximise the likelihood of a corpus of code and can infill blocks of code conditioned on the arbitrary left and right contexts.

Share

The InCoder, a unified generative model that can perform both editing and program synthesis, is the result of a collaboration between Facebook AI Research, University of Washington, UC Berkeley, TTI-Chicago and Carnegie Mellon. The 6.7 billion parameters decoder-only Transformer model can both extend and insert/infill code.

Large language models trained on huge code repositories generate code left-to-right, predicting their direct application to many ubiquitous code editing tasks, like fixing bugs, adding comments, or re-naming variables. InCoder is trained to maximise the likelihood of a corpus of code and can infill blocks of code conditioned on the arbitrary left and right contexts. 

InCoder

The InCoder can generate code files from a large corpus of permissively licensed code, where parts of code are randomly masked and moved

to the end of each file, permitting code infilling with bidirectional context. The team claimed InCoder the first large generative code model that can infill arbitrary code regions. InCoder learns to infill blocks by randomly replacing spans of code with a sentinel token and moving them to the end of the sequence. The model is trained to predict all tokens in the complete sequence and edit the code by replacing spans with sentinel tokens during inference. This prompts the model with the new sequence and triggers it to generate new tokens to replace the masked spans. The model follows a unified approach for both program synthesis (via left-to-right generation) and editing (via infilling).

Causal masking procedure

Neural model for generating codes either a left-to-right (causal) autoregressive language

modeling objective or a masked language modeling objective. The former model conditions only on context to the left of the generated tokens, preventing infilling, but can autoregressively generate entire documents. The latter conditions on both the left and right contexts so it can infill a masked region, but their training objective is limited to generating only a small part of the document. The InCoder team proposes a causal masking objective that combines the strengths of both causal and masked language models. 

The causal masking procedure samples several spans of contiguous tokens in each document to mask. Here, the length of each span is sampled uniformly from the length of the document. In case of an overload, the set of sampled spans is rejected and resampled.

InCoder training

The models are trained on two major corpuses: 

(1) Public code 

(2) StackOverflow questions, answers, and comments.

The model focused hugely on Python language and included code files from 28 total languages and all of StackOverflow’s content. First, the training data was filtered and deduplicated to create a corpus of 159 GB of code (52 GB of Python and 57 GB of content from StackOverflow). Additionally, code files and repository metadata from GitHub and GitLab were used, with 670,000 public non-fork repositories detected to contain Python, JavaScript, or Jupyter Notebooks files. Then, they also used codes from 28 languages and text and code preprocessed from Jupyter notebooks. The second component of the corpus consists of questions, answers, and comments from StackOverflow. 

Fairseq architecture

The InCoder leverages the Fairseq architecture. Fairseq is a sequence modelling toolkit for training custom models for translation, summarisation, and other text generation tasks. In the InCoder, it was used for improving memory efficiency through fully sharding model states. The toolkit features multi-GPU training on one or across multiple machines and lightning-fast beam search generation on both CPU and GGPU. The Incoder was trained on 248 V100 GPUs for 24 days. The team leveraged the causal masking implementation in Fairseq with PyTorch as the underlying library. Additional information on the architectural training includes an eight per-GPU batch size with a maximum token sequence length of 2048. 

Model performance

Using causal masking objective while training a generative model of code results in strong zero-shot performance on many challenging and practical code infilling and editing tasks, said the team. The model showed “comparable performance to similarly-resourced models on standard left-to-right language-to-code synthesis benchmarks in ablation and comparison experiments. Additionally, they evaluated the model in a zero-shot setting with complex tasks such as type inference, comment generation, and variable re-naming. The model showed an improved performance on the tasks, and thus, its ability to condition on bidirectional context. 

Share
Picture of Avi Gopani

Avi Gopani

Avi Gopani is a technology journalist that seeks to analyse industry trends and developments from an interdisciplinary perspective at Analytics India Magazine. Her articles chronicle cultural, political and social stories that are curated with a focus on the evolving technologies of artificial intelligence and data analytics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.