AIM logo Black

Meta’s Code Llama is Here, But Unnaturally

Meta has decided to not release its most powerful code generation model because it is trained on 'Unnatural Instructions'

Share

Meta is Training Llama 3 on 24k NVIDIA H100 Clusters
Listen to this story

Within a few months of the launch of LLaMA, Meta caught up with OpenAI in almost every aspect except coding. Now, the company has finally released its code generation model called Code Llama, which generates code based on both code and natural language prompts. The best part is that just like Llama 2, Code Llama is open source and also available for commercial use. 

Code Llama is built upon the foundation of Llama 2, which has been fine-tuned with specialised code-related datasets. The company announced four versions of Code Llama — Code Llama, Code Llama Instruct, Code Llama Python, and Unnatural Code Llama, each with varying capacities: 7B, 13B, and 34B parameters. However, the release only includes the first three versions of Code Llama except Unnatural Code Llama. 

Code Llama models can effectively process up to 100,000 tokens of context, resulting in more relevant code generation. This proves useful for understanding large code bases and debugging extensive code. Developers can input substantial portions of their codebase to receive assistance in resolving issues and comprehending intricate coding challenges. The 7B model can run on a single GPU for lower latency and real-time code completion.

Extensive benchmark testing validates Code Llama’s prowess. In contrast to other code-specific AI models, Code Llama’s 34B model achieves an impressive score of 53.7% on HumanEval and 56.2% on Mostly Basic Python Programming (MBPP), rivalling even ChatGPT’s performance. 

One of the interesting parts about Code Llama’s dataset is the selection of Unnatural Instructions, which is a dataset created using existing AI models. And surprisingly enough, the company has decided to not release the Unnatural model, which is a version of Python 34B on 15,000 unnatural instructions. This was the most powerful version of Code Llama, according to the paper. 

What is Unnatural code?

In December 2022, Meta AI with Tel Aviv University published the paper named Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. The paper talks about how Meta created a large dataset of creative and diverse instructions and collected 64,000 examples by prompting a language model. This was then further prompted to create a total of 240,000 examples on inputs and outputs, which contained only a little amount of noise. 

Basically, Meta AI created a synthetic dataset of code which is entirely automated. According to the paper, using the dataset, Meta’s models were able to outperform other models like ChatGPT in several tasks related to natural language processing. The same method has been now applied to Code Llama with a coding dataset. 

Interestingly, according to the Unnatural Instructions paper, the data generation model used text-davinci-002 and GPT-3 for generating input and output data. Though there is no specific mention of using OpenAI’s GPT for Code Llama model by Meta, there is a high possibility that it would be a mix of code generated by Llama 2 and possibly GPT-4 as well. This might be one of the reasons to not release the Unnatural Code Llama model. 

Synthetic data is too precious, but too problematic

Just like everyone was trying to mimic ChatGPT’s success by training their models on its output data, the same is happening with code generating models. There is a high possibility that the Unnatural model would be trained on GPT-4, or more specifically OpenAI’s Codex output, through GitHub Copilot. This would get Meta into legal trouble with OpenAI as the company clearly restricts training on GPT output now.

It is, however, clear that synthetic data is proving to be the winner when it comes to expanding the capabilities of generative models. Being trained on only 15,000 synthetic dataset, the unreleased Unnatural Code Llama tested out to be the most powerful. Meta could have avoided any legal troubles with OpenAI if they would just used Llama 2 output for training purposes. Or possibly, the company just wants to keep it with itself for internal use. 

In the meantime, the release of Code Llama to be available for commercial purpose, just like Llama 2, gives Meta an edge over other code generation platforms such as Copilot, which are still pay-to-use. Moreover, the 7B models allow code generation to be done locally on a single GPU. 

With the release of Code Llama, Meta continues to be the good guy in the open source and developer ecosystem, just like they did with Llama 2, and even PyTorch. The company is making its moat even stronger. The partnership with Microsoft is definitely going to help them make some bucks in the future.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India