What is IBM’s Project CodeNet?

IBM’s Project CodeNet can help extract this context with a sequence-to-sequence model.
IBM Project CodeNet

At its recently concluded Think 2021  conference, IBM introduced Project CodeNet to develop machine learning models that can help in programming. The large dataset consists of 14 million code samples and 500 million lines of code in over 55 different languages, including C++, Java, Go, Python, COBOL, Pascal, and Fortran.

Project CodeNet

Modern computer programs have millions of lines of code and are hard to debug, maintain, update, and document. The use of artificial intelligence to write codes has been an important area of research for many years. However, it is easier said than done. The fact that programming languages have context poses a major challenge. Every line of code must be contextual. Understanding the context is a tricky and time-consuming task. The challenge gets bigger with larger programs as context can be related to multiple libraries of code.

IBM’s Project CodeNet can help extract this context with a sequence-to-sequence model. As per IBM’s team, this method is more significant in machine understanding of code instead of machine processing of code.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Project CodeNet has code samples curated from open programming competitions over the years, including challenges posted on coding platforms such as AIZU and AtCoder containing correct and incorrect answers to the challenges. It consists of high-quality metadata and annotations.

It also has a rich set of information in terms of code size, memory footprint, CPU run time, etc. The team used reinforcement learning techniques for code translation by determining the equivalence of two code samples in different languages by curating sample input and output from the problem description (it contains problem statement, input and output format).


Download our Mobile App



Available as part of the dataset in CodeNet, users can execute these accepted codes samples to extract additional information and verify outputs from the generative AI models for correctness. This feature is handy when translating from one language to another. The dataset from Project CodeNet can be used for code search and clone detection. Since these code samples are labelled with their acceptance status, AI techniques can be used to distinguish correct from incorrect codes. Samples are also labelled with CPU run time and memory footprint to understand regression and prediction.

“Given its wealth of programs written in a multitude of languages, we believe Project CodeNet can serve as a benchmark dataset for source-to-source translation and do for AI and code what the ImageNet dataset did years ago for computer vision,” the team said.

Teaching AI to code

IBM’s Project CodeNet is the latest stab at teaching AI to code

Earlier, researchers from Microsoft and the University of Cambridge developed DeepCoder to solve fundamental programming problems. The team used a technique called program synthesis, in which the tool creates new programs by collating lines of code sourced from existing software. It requires a list of inputs and outputs for each code fragment to determine which pieces of code can be used.

In 2019, MIT introduced SketchAdapt, a program-writing AI. SketchAdapt is trained on tens of thousands of program examples and can compose short, high-level programs. The tool knows when to switch from statistical pattern-matching to less efficient yet more versatile symbolic reasoning mode.

One of the use cases of GPT-3 is code development. This language model from OpenAI can assist users in building their applications with text prompts; the system uses user input to generate code. This is particularly useful in cases where rapid prototyping of applications is required. Companies such as debuild.co are already using GPT-3 for accelerating the application development process.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.