Salesforce Open Sources CodeT5: An ML Model That Generates Code

CodeT5 is built on a similar architecture to Google’s T5 (Text-to-Text Transfer Transformer) framework but comes with a better code understanding.

Salesforce recently open-sourced a machine learning model known as CodeT5, which can understand and generate code in real-time. The team claims that CodeT5 achieves state-of-the-art performance on coding tasks, including code defect detection, predicting whether code is vulnerable to exploits, and clone detection, which predicts whether two code snippets have the same functionality. 

The existing code pre-training methods have had two major limitations. First, they often rely on either an encoder-only model similar to BERT or a decoder-only model like GPT, which is not optimal for generation and understanding tasks. Second, most simply adopt the conventional NLP pre-training techniques on source code by regarding it as a sequence of tokens like natural language (NL). Unfortunately, this largely ignores the rich structural information in a programming language (PL), which is vital to understand code semantics fully.

To address these limitations, CodeT5, an identifier-aware unified pre-trained encoder-decoder model, comes to the rescue.

Image Source: CodeT5’s GitHub 

CodeT5 is built on a similar architecture to Google’s T5 (Text-to-Text Transfer Transformer) framework but comes with a better code understanding. A more unified model for natural language processing tasks is presented. It reframes text-to-text, where the input and output data are always strings of texts. This allows any task to be applied by this uninformed model.

The research team of CodeT5 had over 8.35 million examples to train the AI on, including user-written comments from open source GitHub repositories. While training, the largest and most capable version of CodeT5, which had 220 million parameters, took 12 days on a cluster of 16 Nvidia A100 GPUs with 40GB memory.

The coding assistant is packed with three code intelligence capabilities:

  • Text-to-code generation: Can generate code based on the natural language description.
  • Code autocompletion: Can complete the whole function of code given the target function name
  • Code summarization: It can generate the summary of a function in natural language description

However, CodeT5 has not yet been tested on APPS, thus leaving possibilities of generating unreliable code. Researchers also acknowledged one major drawback being encoding stereotypes like race or gender from text comments in datasets used to train.

Download our Mobile App

Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.