Meta’s Code Llama is Here, But Unnaturally

Meta has decided to not release its most powerful code generation model because it is trained on 'Unnatural Instructions'

Share

Published on August 25, 2023

by Mohit Pandey

Listen to this story

Within a few months of the launch of LLaMA, Meta caught up with OpenAI in almost every aspect except coding. Now, the company has finally released its code generation model called Code Llama, which generates code based on both code and natural language prompts. The best part is that just like Llama 2, Code Llama is open source and also available for commercial use.

Code Llama is built upon the foundation of Llama 2, which has been fine-tuned with specialised code-related datasets. The company announced four versions of Code Llama — Code Llama, Code Llama Instruct, Code Llama Python, and Unnatural Code Llama, each with varying capacities: 7B, 13B, and 34B parameters. However, the release only includes the first three versions of Code Llama except Unnatural Code Llama.

Code Llama models can effectively process up to 100,000 tokens of context, resulting in more relevant code generation. This proves useful for understanding large code bases and debugging extensive code. Developers can input substantial portions of their codebase to receive assistance in resolving issues and comprehending intricate coding challenges. The 7B model can run on a single GPU for lower latency and real-time code completion.

Extensive benchmark testing validates Code Llama’s prowess. In contrast to other code-specific AI models, Code Llama’s 34B model achieves an impressive score of 53.7% on HumanEval and 56.2% on Mostly Basic Python Programming (MBPP), rivalling even ChatGPT’s performance.

One of the interesting parts about Code Llama’s dataset is the selection of Unnatural Instructions, which is a dataset created using existing AI models. And surprisingly enough, the company has decided to not release the Unnatural model, which is a version of Python 34B on 15,000 unnatural instructions. This was the most powerful version of Code Llama, according to the paper.

What is Unnatural code?

In December 2022, Meta AI with Tel Aviv University published the paper named Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. The paper talks about how Meta created a large dataset of creative and diverse instructions and collected 64,000 examples by prompting a language model. This was then further prompted to create a total of 240,000 examples on inputs and outputs, which contained only a little amount of noise.

Looks very nice on initial skim!
But about this "Unnatural Code Llama"… https://t.co/2r5FSESYJ7
— Andrej Karpathy (@karpathy) August 24, 2023

Basically, Meta AI created a synthetic dataset of code which is entirely automated. According to the paper, using the dataset, Meta’s models were able to outperform other models like ChatGPT in several tasks related to natural language processing. The same method has been now applied to Code Llama with a coding dataset.

Interestingly, according to the Unnatural Instructions paper, the data generation model used text-davinci-002 and GPT-3 for generating input and output data. Though there is no specific mention of using OpenAI’s GPT for Code Llama model by Meta, there is a high possibility that it would be a mix of code generated by Llama 2 and possibly GPT-4 as well. This might be one of the reasons to not release the Unnatural Code Llama model.

Synthetic data is too precious, but too problematic

Just like everyone was trying to mimic ChatGPT’s success by training their models on its output data, the same is happening with code generating models. There is a high possibility that the Unnatural model would be trained on GPT-4, or more specifically OpenAI’s Codex output, through GitHub Copilot. This would get Meta into legal trouble with OpenAI as the company clearly restricts training on GPT output now.

It is, however, clear that synthetic data is proving to be the winner when it comes to expanding the capabilities of generative models. Being trained on only 15,000 synthetic dataset, the unreleased Unnatural Code Llama tested out to be the most powerful. Meta could have avoided any legal troubles with OpenAI if they would just used Llama 2 output for training purposes. Or possibly, the company just wants to keep it with itself for internal use.

In the meantime, the release of Code Llama to be available for commercial purpose, just like Llama 2, gives Meta an edge over other code generation platforms such as Copilot, which are still pay-to-use. Moreover, the 7B models allow code generation to be done locally on a single GPU.

With the release of Code Llama, Meta continues to be the good guy in the open source and developer ecosystem, just like they did with Llama 2, and even PyTorch. The company is making its moat even stronger. The partnership with Microsoft is definitely going to help them make some bucks in the future.

Access all our open Survey & Awards Nomination forms in one place