PolyCoder vs OpenAi Codex: A comparison between these code generation tools

PolyCoder delivered superior performance in comparison to similarly sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala and TypeScript.

The intersection of code generation tools and large language models (LLMs) is pushing the frontiers of artificial intelligence. Though tech giants have come up with cutting edge models like BERT, Codex, etc, access to such models has been limited. Last year, Carnegie Mellon University researchers developed PolyCoder, a model based on OpenAI’s GPT-2 and trained on 249GB of code across 12 programming languages. Polycode’s core is written in C++. All platform-specific functionality is abstracted into a cross-platform core and implemented natively on each platform, so the same C++ code will compile on each supported platform out of the box. But how does PolyCoder stack up against large language models like Codex and GPT-Neox-20B?

PolyCoder vs Codex: open-source vs proprietary

PolyCoder tested against various language models such as masked language models, encoder-decoder models and left to right auto-regressive models. While some models are pretrained on exclusive GitHub code, others are trained on ‘The Pile’, a large repository consisting of an amalgamation of natural language texts, code from various languages and software documentations. 

parameter comparison of polycoder

Source: arxiv.org


Sign up for your weekly dose of what's up in emerging technology.

The AI-engines were tested on a set of evaluations based on their extrinsic and intrinsic values. 

Extrinsic evaluation: One of the most common ways to test a model is to try to generate code based on natural language prompts. All models are evaluated on the HumanEval dataset that consists of 164 prompts with description in the form of code, comments, etc. A random sample of 100 examples was taken to evaluate each engine.

performance comparison

Source: arxiv.org

Intrinsic Evaluation: Each language model’s perplexity is compared using an unknown GitHub repository to evaluate its intrinsic performance. The characteristics of the dataset are rendered unknown to prevent data leakage from the training to the test set. To ensure accuracy, a sample of 100 random files are used for each of the 12 coding languages in the evaluation dataset. Perplexity across different tokenisation methods is compared using Pygments to equally normalize the log-likelihood sum of each model.

model performance of polycoder

Source: arxiv.org

When compared to GPT-Neo (2.7B), PolyCoder exhibits fewer Python tokens, but increased code tokens in other programming languages. PolyCoder is a better candidate for transitioning from other languages to Python. Meaning, in the future natural language as well as code from different languages, can be used as a prompt for development. In the intrinsic evaluation, PolyCoder outperformed Codex and all other models in the C language. It delivered superior performance in comparison to similarly-sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala and TypeScript.


Last year, OpenAI released an improved version of Codex, an AI system that translates natural language to code. Codex powers AI pair programmer GitHub Copilot and is proficient in more than a dozen programming languages. The AI system can interpret simple commands in natural language and execute them on the user’s behalf.

Future of PolyCoder

Deepmind recently launched AlphaCode with 41.4 billion parameters and is among the first AI-based engines that can generate code at a competitive level. AlphaCode demonstrated its capabilities in programming contests hosted by Codeforces scoring top 54.3 percentile against human programmers. However, AlphaCode is not open-sourced. The researchers at Carnegie Mellon University hope their efforts with PolyCoder would encourage the giants to follow suit and act as a catalyst for AI research and the democratisation of LLMs.

The performance of LLMs is generally based on training time and model size. The results showed training on natural language and coding language improves the performance of GPT-Neo over PolyCoder. However, with respect to the C programming language, PolyCoder showed a lower level of perplexity against all models including Codex.

More Great AIM Stories

Kartik Wali
A writer by passion, Kartik strives to get a deep understanding of AI, Data analytics and its implementation on all walks of life. As a Senior Technology Journalist, Kartik looks forward to writing about the latest technological trends that transform the way of life!

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM