PolyCoder vs OpenAi Codex: A comparison between these code generation tools

PolyCoder delivered superior performance in comparison to similarly sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala and TypeScript.

The intersection of code generation tools and large language models (LLMs) is pushing the frontiers of artificial intelligence. Though tech giants have come up with cutting edge models like BERT, Codex, etc, access to such models has been limited. Last year, Carnegie Mellon University researchers developed PolyCoder, a model based on OpenAI’s GPT-2 and trained on 249GB of code across 12 programming languages. Polycode’s core is written in C++. All platform-specific functionality is abstracted into a cross-platform core and implemented natively on each platform, so the same C++ code will compile on each supported platform out of the box. But how does PolyCoder stack up against large language models like Codex and GPT-Neox-20B?

PolyCoder vs Codex: open-source vs proprietary

PolyCoder tested against various language models such as masked language models, encoder-decoder models and left to right auto-regressive models. While some models are pretrained on exclusive GitHub code, others are trained on ‘The Pile’, a large repository consisting of an amalgamation of natural language texts, code from various languages and software documentations. 

parameter comparison of polycoder

Source: arxiv.org

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The AI-engines were tested on a set of evaluations based on their extrinsic and intrinsic values. 

Extrinsic evaluation: One of the most common ways to test a model is to try to generate code based on natural language prompts. All models are evaluated on the HumanEval dataset that consists of 164 prompts with description in the form of code, comments, etc. A random sample of 100 examples was taken to evaluate each engine.

Download our Mobile App

performance comparison

Source: arxiv.org

Intrinsic Evaluation: Each language model’s perplexity is compared using an unknown GitHub repository to evaluate its intrinsic performance. The characteristics of the dataset are rendered unknown to prevent data leakage from the training to the test set. To ensure accuracy, a sample of 100 random files are used for each of the 12 coding languages in the evaluation dataset. Perplexity across different tokenisation methods is compared using Pygments to equally normalize the log-likelihood sum of each model.

model performance of polycoder

Source: arxiv.org

When compared to GPT-Neo (2.7B), PolyCoder exhibits fewer Python tokens, but increased code tokens in other programming languages. PolyCoder is a better candidate for transitioning from other languages to Python. Meaning, in the future natural language as well as code from different languages, can be used as a prompt for development. In the intrinsic evaluation, PolyCoder outperformed Codex and all other models in the C language. It delivered superior performance in comparison to similarly-sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala and TypeScript.


Last year, OpenAI released an improved version of Codex, an AI system that translates natural language to code. Codex powers AI pair programmer GitHub Copilot and is proficient in more than a dozen programming languages. The AI system can interpret simple commands in natural language and execute them on the user’s behalf.

Future of PolyCoder

Deepmind recently launched AlphaCode with 41.4 billion parameters and is among the first AI-based engines that can generate code at a competitive level. AlphaCode demonstrated its capabilities in programming contests hosted by Codeforces scoring top 54.3 percentile against human programmers. However, AlphaCode is not open-sourced. The researchers at Carnegie Mellon University hope their efforts with PolyCoder would encourage the giants to follow suit and act as a catalyst for AI research and the democratisation of LLMs.

The performance of LLMs is generally based on training time and model size. The results showed training on natural language and coding language improves the performance of GPT-Neo over PolyCoder. However, with respect to the C programming language, PolyCoder showed a lower level of perplexity against all models including Codex.

Sign up for The AI Forum for India

Analytics India Magazine is excited to announce the launch of AI Forum for India – a community, created in association with NVIDIA, aimed at fostering collaboration and growth within the artificial intelligence (AI) industry in India.

Kartik Wali
A writer by passion, Kartik strives to get a deep understanding of AI, Data analytics and its implementation on all walks of life. As a Senior Technology Journalist, Kartik looks forward to writing about the latest technological trends that transform the way of life!

Our Upcoming Events

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023

21 Jul, 2023 | New York
MachineCon USA 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

GPT-4: Beyond Magical Mystery

The OpenAI CEO believes that by ingesting human knowledge, the model is acquiring a form of reasoning capability that could be additive to human wisdom in some senses.