MITB Banner

Now Run Programs in Real Time with Llama 3 on Groq

The creator of the first language processing unit (LPU) inference engine, Groq delivers scalable, repeatable inference at up to 10x faster performance.

Share

Illustration by Raghavendra Rao

Listen to this story

Groq is lightning fast. It recently achieved a throughput of 877 tokens/s on Llama 3 8B and 284 tokens/s on Llama 3 70B. A user on X compared both Llama 3 (on Groq) and GPT-4 by asking them to code a snake game in Python, and Groq was exceptionally fast. “There is no comparison,” he said.

Andrej Karpathy, a former OpenAI researcher, was also impressed by Groq’s speed and jokingly commented: “Ugh kids these days! Back in my days, we used to watch the tokens stream one at a time and wait for the output.” 

Another user wrote, “Llama 3 8b on Groq is absurdly fast and good quality! Here, with a simple prompt, it pumps out interrogatories for a trademark case at 826 tokens/second. Not perfect, but useful, and the output approaches GPT-4 level quality.” 

Llama 3 is a compelling choice for enterprises integrating LLMs into their operations. On Groq, Llama 3 is priced at $0.59 per 1 million tokens for input and $0.79 per 1 million tokens for output, significantly lower than the pricing for Claude 3 and GPT-4 by Anthropic and OpenAI.

https://twitter.com/iamhitarth/status/1782548444130976168

Groq doesn’t offer its LPU hardware directly as a standalone product. Instead, it provides access to the processing power of its LPUs through its cloud service, GroqCloud. 

Recently, the company acquired Definitive Intelligence, a Palo Alto-based company that provides various AI solutions designed for businesses, such as chatbots, data analytics tools, and documentation builders.

In a recent interview, Groq founder Jonathan Ross said that within four weeks of the cloud service being launched, the company now has 70,000 developers, and approximately 18,000 API keys have already been generated. 

“It’s really easy to use and doesn’t cost anything to get started. You just use our API, and we’re compatible with most applications that have been built,” said Ross. He added that if any customer has a large-scale requirement and is generating millions of tokens per second, the company can deploy hardware for the customer on-premises.

What’s the secret? 

Founded in 2016 by Ross, Groq distinguishes itself by eschewing GPUs in favour of its proprietary hardware, the language processing unit (LPU). 

Prior to Groq, Ross worked at Google, where he created the tensor processing unit (TPU). He was responsible for designing and implementing the core elements of the original TPU chip, which played a pivotal role in Google’s AI efforts, including the AlphaGo competition. 

LPUs are only meant to run the LLMs and not train them. “The LPUs are about 10 times faster than GPUs when it comes to inference or the actual running of the models,” said Ross,  adding that when it comes to training LLMs, that’s a task for the GPUs.

When asked about the purpose of this speed, Ross said, “Human beings don’t like to read like this, as if something is being printed out like an old teletype machine. Eyes scan a page really quickly and figure out almost instantly whether or not they’ve got what they want.”

Groq’s LPU poses a significant challenge to traditional GPU manufacturers like NVIDIA, AMD, and Intel. Groq built its tensor streaming processor specifically to speed up deep learning computations, rather than modifying general-purpose processors for AI.

The LPU is designed to overcome the two LLM bottlenecks: compute density and memory bandwidth. In terms of LLMs, an LPU has greater compute capacity than a GPU and CPU. This reduces the amount of time per word calculated, allowing text sequences to be generated much faster. 

Additionally, eliminating external memory bottlenecks enables the LPU inference engine to deliver orders of magnitude better performance on LLMs compared to GPUs. The LPU is designed to prioritise the sequential processing of data, which is inherent in language tasks. This contrasts with GPUs, which are optimised for parallel processing tasks such as graphics rendering. 

“You can’t produce the 100th word until you’ve produced the 99th so there is a sequential component to them that you just simply can’t get out of a GPU,” said Ross.

Moreover, he added that GPUs are notoriously thirsty for power, often requiring as much power as the average household per chip. “LPUs use as little as a tenth as much power,” he said. 

What’s Next? 

Groq recently partnered with Earth Wind & Power to develop the first European vertically integrated AI Compute Centre in Norway. Groq has committed to deploying and operating 21,600 LPUs at Earth Wind & Power’s AI Compute Center in 2024, with the option to increase this number to 129,600 LPUs in 2025.

“If we can deploy over 220,000 LPUs this year, given how much faster they are than GPUs, it would be equivalent to more than all of Meta’s compute,” said Ross, adding that next year they want to deploy 1.5 million LPUs, which would be more than all of the compute of all the tech hyperscalers combined.

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.