Code Llamas Fight Over GPT-4

Wizard LM compares wrong HumanEval numbers

Share

Published on August 29, 2023

by Siddharth Jindal

A few days ago, fine-tuned Code Llama-based models WizardCoder 34B by Wizard LM and Phind were released. And currently, both of them are engaged in a heated argument over whether Phind used Wizard LM’s WizardCoder-style dataset to train their V1 model. However, Phind dismissed the claims, but the debate is still on!

Taking it a step further, Wizard LM recently initiated a discussion on X to engage in a debate and find a resolution regarding which party’s perspective is accurate.

Everybody is rigorously evaluating OpenAI’s HumanEval, trying to beat GPT-4 on various tasks. Just two days after the launch of Code Llama, Wizard LM introduced WizardCoder 34B, a fine-tuned version based on Code Llama. The company proudly claimed that WizardCoder 34B performed even better than GPT-4, ChatGPT-3.5, and Claude-2 on HumanEval, with a pass rate of 73.2% in the first try.

It seemed like Wizard LM attempted to deceive developers by cleverly omitting the fact that it had compared the 73.2% score with the HumanEval rating of GPT-4’s March version, rather than the August version, where GPT-4 achieved an 82%, which Wizard LM calculated. Notably, HumanEval results of GPT-4 and ChatGPT-3.5 are 67.0 and 48.1, respectively, as per GPT4- Technical Report (2023/03/15) — this seems odd that HumanEval of GPT-4 by OpenAI is lower than that of Wizard LM.

https://twitter.com/cto_junior/status/1695399872151622009

However, Wizard LM isn’t the only player in this race. Another startup, Phind, also claimed that their fine-tuned versions, CodeLlama-34B and CodeLlama-34B-Python, achieved pass rates of 67.6% and 69.5% on HumanEval, using their own Phind dataset. These numbers are almost equivalent to GPT-4’s.

Obsession with GPT-4

It clearly shows that the open source community considers GPT-4 to be the ultimate benchmark. Pick up any research paper based on LLMs by Meta, they compare their results with GPT-based models, particularly, OpenAI’s HumanEvals.

Ironically, Meta needs OpenAI and vice versa. In the paper ‘Code Llama: Open Foundation Models for Code’, the word ‘GPT’ was used 37 times, on the other hand OpenAI didn’t use the word ‘Meta’ or ‘LLaMA’ in their ‘GPT-4 Technical Report’. What would happen if the open-source community stopped comparing itself with closed source models? Apparently the evaluation metrics created by OpenAI give purpose to the existence of open source models, otherwise, it would be difficult to assess their performance and position.

In the Code Llama research paper, Meta did not utilize any evaluation metric of its own making. Besides HumanEval, the only other metric employed was MBPP(Mostly Basic Python Programming), which Google created. Another important thing to note is that GPT-4 does more than just coding tasks. On the other hand, Meta is creating models meant for specific tasks, and they’re trying to surpass GPT-4 in those particular tasks.

If a model is designed specifically for coding, there’s a good chance it might outperform GPT-4. Phind’s performance is also pretty much the same as GPT-4’s on HumanEval. Moreover, there’s a strong likelihood that Code Llama was trained using datasets generated by GPT-4. Otherwise, it would be quite challenging for an open-source model to come close to competing with GPT-4.

Is HumanEval enough?

A discussion has been going on Reddit whether HumanEval is a suitable parameter to measure the efficiency of the coding abilities of large language models. The thread says HumanEval solving 160 programming questions in Python is not everything one would expect from a code model and real-world usage of code models is not captured by a single number based on 160 programs.

The thread further said that factors like code explanation, docstring generation, code infilling, SO questions, writing tests, etc, are not captured by HumanEval. One of the users of X expressed the same sentiment and said “sadly real life performance is still way beyond GPT-4 for Python Code”. “Tried different, real life examples for creating minimal flask microservices (which I test on a bunch of LLMs) and GPT-4 still outperforms all open-source LLMs,” he added, praising GPT-4 capabilities on real world usage.

Interestingly, Can Xu, a senior researcher at Wizard LM replied that he would look into it and try to improve the model. “Thank you for pointing out the points of potential improvement, we will work on the real life examples soon,” Xu said.

In another conversation, an X user expressed that he finds that these benchmarks for models tend to be poor metrics for measuring how well they perform in actual real-world workflows. Phind cofounder Michael Royzen replied to this saying it was an early experiment to reproduce (and exceed) the “Unnatural CodeLlama” results from the paper. He said more work will be done in the future to make these models production-ready. “In the future, we’ll have a Mixture of Experts of different Code Llama models and I think that those will be competitive in real-world workflows,” Royzen added optimistically.

While open-source models might not yet match GPT-4’s standards and are striving to catch up, it’s heartening to see that they’re openly discussing with the community and acknowledging their shortcomings. The discussion between Wizard LM and Phind on X is a good sign and it shows that the open-source community is pretty dedicated.

This transparency in the open-source community is a positive step towards ‘responsible AI’. In contrast, OpenAI keeps its trade secrets hidden, leaving everyone guessing about their upcoming plans.

[Updated: 2 p.m, August 30th, 2023] The article has been updated to include latest developments from Wizard LM.

Access all our open Survey & Awards Nomination forms in one place

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

7 Models Based on Llama 2

Shyam Nandan Upadhyay 30/07/2023

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Meet the Creator of Sanskriti Bench, Building Cultural AI for India with Hugging Face and GitHub

Mohit Pandey

Currently, he is aiming for 500 questions per language and per region of the country, starting with 10 languages, which can be augmented using language models in later versions.