Last updated February 27, 2020
In AI Origins & Evolution

Microsoft Introduces First Bimodal Pre-Trained Model for Natural Language Generation

Published on February 28, 2020
by Ambika Choudhury

Design by Microsoft Ignite 2017, Orlando, FL

Over these few years, large pre-trained models such as BERT, ELMo, XLNet, among others, have brought significant improvements on almost every natural language processing (NLP) tasks in organisations. Microsoft has been doing a lot of research around natural language processing (NLP) and natural language understanding (NLU) for a few years now.

The Natural Language Processing group at Microsoft focuses on developing efficient algorithms for processing text to design and build software that will analyse, understand, and generate languages that humans use naturally. Recently, the researchers at the tech giant developed a bimodal pre-trained model for natural language (NL) and programming language (PL) like Python, Java, JavaScript, etc., known as CodeBERT.

About the Model

CodeBERT captures the semantic connection between natural language and programming language and produces general-purpose representations that can broadly support NL-PL understanding tasks such as natural language code search and generation tasks such as code documentation generation.

The model has been evaluated on two NL-PL applications, including the mentioned natural language code search and code documentation generation by fine-tuning model parameters. Following which, it achieved state-of-the-art performance on both natural language code search and code documentation generation.

How It Works

The researchers developed the model with multi-layer transformer-based neural architecture, which is adopted in a majority of large pre-trained models. In order to make use of both bimodal instances of NL-PL pairs and a large amount of available unimodal codes, CodeBERT is trained with a hybrid objective function, which includes standard masked language modelling (MLM) and replaced token detection (RTD).

The MLM objective is to predict the original tokens, which are masked out, and the RTD objective is developed for efficiently learning pre-trained models for natural language. The bimodal data refers to parallel data of natural language-code pairs, and unimodal data stands for codes without paired natural language texts and natural language without paired codes.

The researcher developed the model using the same model architecture as RoBERTa-base, developed by Facebook. They further trained CodeBERT using a large dataset from Github code repositories in 6 programming languages, which are Python, Java, JavaScript, PHP, Ruby and Go, where each bimodal data point is an individual function with paired documentation, and each unimodal code is a function without paired documentation. The dataset includes 2.1M bimodal datapoints and 6.4M unimodal codes across the 6 programming languages.

Key-Points of CodeBERT

CodeBERT is the first large bimodal pre-trained model for natural language and programming language.
This model has outperformed Facebook’s RoBERTa model.
CodeBERT provides good initialisation for learning downstream tasks.
The model achieved state-of-the-art performance on both natural language code search and code to documentation generation with natural language-programming language applications such as natural language code search, code documentation generation, etc.
CodeBERT performs better than baselines on almost all languages on both NL and PL probing.

Wrapping Up

According to the researchers, CodeBERT, the pre-trained model for programming and natural languages performed better than previous pre-trained models on NL-PL probing. It consistently outperformed RoBERTa. RoBERTa is an optimised method for pretraining self-supervised NLP systems by Facebook.

Last year, Microsoft open-sourced the intelligent conversation engine: Code and Pre-trained Systems, or Icecaps, which is a toolkit that not only allows researchers and developers to imbue their chatbots with different personas but also to incorporate other natural language processing features that emphasize conversation modelling.

Read the paper here.

Access all our open Survey & Awards Nomination forms in one place >>

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

Microsoft Introduces First Bimodal Pre-Trained Model for Natural Language Generation

Design by Microsoft Ignite 2017, Orlando, FL

About the Model

How It Works

Key-Points of CodeBERT

Wrapping Up

Ambika Choudhury

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.