At its recently concluded Think 2021 conference, IBM introduced Project CodeNet to develop machine learning models that can help in programming. The large dataset consists of 14 million code samples and 500 million lines of code in over 55 different languages, including C++, Java, Go, Python, COBOL, Pascal, and Fortran.
Modern computer programs have millions of lines of code and are hard to debug, maintain, update, and document. The use of artificial intelligence to write codes has been an important area of research for many years. However, it is easier said than done. The fact that programming languages have context poses a major challenge. Every line of code must be contextual. Understanding the context is a tricky and time-consuming task. The challenge gets bigger with larger programs as context can be related to multiple libraries of code.
Sign up for your weekly dose of what's up in emerging technology.
IBM’s Project CodeNet can help extract this context with a sequence-to-sequence model. As per IBM’s team, this method is more significant in machine understanding of code instead of machine processing of code.
Project CodeNet has code samples curated from open programming competitions over the years, including challenges posted on coding platforms such as AIZU and AtCoder containing correct and incorrect answers to the challenges. It consists of high-quality metadata and annotations.
It also has a rich set of information in terms of code size, memory footprint, CPU run time, etc. The team used reinforcement learning techniques for code translation by determining the equivalence of two code samples in different languages by curating sample input and output from the problem description (it contains problem statement, input and output format).
Available as part of the dataset in CodeNet, users can execute these accepted codes samples to extract additional information and verify outputs from the generative AI models for correctness. This feature is handy when translating from one language to another. The dataset from Project CodeNet can be used for code search and clone detection. Since these code samples are labelled with their acceptance status, AI techniques can be used to distinguish correct from incorrect codes. Samples are also labelled with CPU run time and memory footprint to understand regression and prediction.
“Given its wealth of programs written in a multitude of languages, we believe Project CodeNet can serve as a benchmark dataset for source-to-source translation and do for AI and code what the ImageNet dataset did years ago for computer vision,” the team said.
Teaching AI to code
IBM’s Project CodeNet is the latest stab at teaching AI to code.
Earlier, researchers from Microsoft and the University of Cambridge developed DeepCoder to solve fundamental programming problems. The team used a technique called program synthesis, in which the tool creates new programs by collating lines of code sourced from existing software. It requires a list of inputs and outputs for each code fragment to determine which pieces of code can be used.
In 2019, MIT introduced SketchAdapt, a program-writing AI. SketchAdapt is trained on tens of thousands of program examples and can compose short, high-level programs. The tool knows when to switch from statistical pattern-matching to less efficient yet more versatile symbolic reasoning mode.
One of the use cases of GPT-3 is code development. This language model from OpenAI can assist users in building their applications with text prompts; the system uses user input to generate code. This is particularly useful in cases where rapid prototyping of applications is required. Companies such as debuild.co are already using GPT-3 for accelerating the application development process.