The intersection of code generation tools and large language models (LLMs) is pushing the frontiers of artificial intelligence. Though tech giants have come up with cutting edge models like BERT, Codex, etc, access to such models has been limited. Last year, Carnegie Mellon University researchers developed PolyCoder, a model based on OpenAI’s GPT-2 and trained on 249GB of code across 12 programming languages. Polycode’s core is written in C++. All platform-specific functionality is abstracted into a cross-platform core and implemented natively on each platform, so the same C++ code will compile on each supported platform out of the box. But how does PolyCoder stack up against large language models like Codex and GPT-Neox-20B?
PolyCoder vs Codex: open-source vs proprietary
PolyCoder tested against various language models such as masked language models, encoder-decoder models and left to right auto-regressive models. While some models are pretrained on exclusive GitHub code, others are trained on ‘The Pile’, a large repository consisting of an amalgamation of natural language texts, code from various languages and software documentations.
Source: arxiv.org
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
The AI-engines were tested on a set of evaluations based on their extrinsic and intrinsic values.
Extrinsic evaluation: One of the most common ways to test a model is to try to generate code based on natural language prompts. All models are evaluated on the HumanEval dataset that consists of 164 prompts with description in the form of code, comments, etc. A random sample of 100 examples was taken to evaluate each engine.
Download our Mobile App
Source: arxiv.org
Intrinsic Evaluation: Each language model’s perplexity is compared using an unknown GitHub repository to evaluate its intrinsic performance. The characteristics of the dataset are rendered unknown to prevent data leakage from the training to the test set. To ensure accuracy, a sample of 100 random files are used for each of the 12 coding languages in the evaluation dataset. Perplexity across different tokenisation methods is compared using Pygments to equally normalize the log-likelihood sum of each model.
Source: arxiv.org
When compared to GPT-Neo (2.7B), PolyCoder exhibits fewer Python tokens, but increased code tokens in other programming languages. PolyCoder is a better candidate for transitioning from other languages to Python. Meaning, in the future natural language as well as code from different languages, can be used as a prompt for development. In the intrinsic evaluation, PolyCoder outperformed Codex and all other models in the C language. It delivered superior performance in comparison to similarly-sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala and TypeScript.
Codex
Last year, OpenAI released an improved version of Codex, an AI system that translates natural language to code. Codex powers AI pair programmer GitHub Copilot and is proficient in more than a dozen programming languages. The AI system can interpret simple commands in natural language and execute them on the user’s behalf.
Future of PolyCoder
Deepmind recently launched AlphaCode with 41.4 billion parameters and is among the first AI-based engines that can generate code at a competitive level. AlphaCode demonstrated its capabilities in programming contests hosted by Codeforces scoring top 54.3 percentile against human programmers. However, AlphaCode is not open-sourced. The researchers at Carnegie Mellon University hope their efforts with PolyCoder would encourage the giants to follow suit and act as a catalyst for AI research and the democratisation of LLMs.
The performance of LLMs is generally based on training time and model size. The results showed training on natural language and coding language improves the performance of GPT-Neo over PolyCoder. However, with respect to the C programming language, PolyCoder showed a lower level of perplexity against all models including Codex.