Now Run Programs in Real Time with Llama 3 on Groq

The creator of the first language processing unit (LPU) inference engine, Groq delivers scalable, repeatable inference at up to 10x faster performance.

Share

Illustration by Raghavendra Rao

Published on April 24, 2024

by Siddharth Jindal

Listen to this story

Groq is lightning fast. It recently achieved a throughput of 877 tokens/s on Llama 3 8B and 284 tokens/s on Llama 3 70B. A user on X compared both Llama 3 (on Groq) and GPT-4 by asking them to code a snake game in Python, and Groq was exceptionally fast. “There is no comparison,” he said.

I'm amazed by this.

Llama 3 on Groq makes GPT-4 look like a grandpa.

I asked both models to list all the prime numbers from 1 to 1000.

Llama 3 hit over 830 tokens per second(!) pic.twitter.com/g1BcNpCBj4
— Alex Banks (@thealexbanks) April 20, 2024

Andrej Karpathy, a former OpenAI researcher, was also impressed by Groq’s speed and jokingly commented: “Ugh kids these days! Back in my days, we used to watch the tokens stream one at a time and wait for the output.”

Another user wrote, “Llama 3 8b on Groq is absurdly fast and good quality! Here, with a simple prompt, it pumps out interrogatories for a trademark case at 826 tokens/second. Not perfect, but useful, and the output approaches GPT-4 level quality.”

Llama 3 is a compelling choice for enterprises integrating LLMs into their operations. On Groq, Llama 3 is priced at $0.59 per 1 million tokens for input and $0.79 per 1 million tokens for output, significantly lower than the pricing for Claude 3 and GPT-4 by Anthropic and OpenAI.

https://twitter.com/iamhitarth/status/1782548444130976168

Groq doesn’t offer its LPU hardware directly as a standalone product. Instead, it provides access to the processing power of its LPUs through its cloud service, GroqCloud.

Recently, the company acquired Definitive Intelligence, a Palo Alto-based company that provides various AI solutions designed for businesses, such as chatbots, data analytics tools, and documentation builders.

In a recent interview, Groq founder Jonathan Ross said that within four weeks of the cloud service being launched, the company now has 70,000 developers, and approximately 18,000 API keys have already been generated.

“It’s really easy to use and doesn’t cost anything to get started. You just use our API, and we’re compatible with most applications that have been built,” said Ross. He added that if any customer has a large-scale requirement and is generating millions of tokens per second, the company can deploy hardware for the customer on-premises.

What’s the secret?

Founded in 2016 by Ross, Groq distinguishes itself by eschewing GPUs in favour of its proprietary hardware, the language processing unit (LPU).

Prior to Groq, Ross worked at Google, where he created the tensor processing unit (TPU). He was responsible for designing and implementing the core elements of the original TPU chip, which played a pivotal role in Google’s AI efforts, including the AlphaGo competition.

LPUs are only meant to run the LLMs and not train them. “The LPUs are about 10 times faster than GPUs when it comes to inference or the actual running of the models,” said Ross, adding that when it comes to training LLMs, that’s a task for the GPUs.

When asked about the purpose of this speed, Ross said, “Human beings don’t like to read like this, as if something is being printed out like an old teletype machine. Eyes scan a page really quickly and figure out almost instantly whether or not they’ve got what they want.”

Groq’s LPU poses a significant challenge to traditional GPU manufacturers like NVIDIA, AMD, and Intel. Groq built its tensor streaming processor specifically to speed up deep learning computations, rather than modifying general-purpose processors for AI.

The LPU is designed to overcome the two LLM bottlenecks: compute density and memory bandwidth. In terms of LLMs, an LPU has greater compute capacity than a GPU and CPU. This reduces the amount of time per word calculated, allowing text sequences to be generated much faster.

Additionally, eliminating external memory bottlenecks enables the LPU inference engine to deliver orders of magnitude better performance on LLMs compared to GPUs. The LPU is designed to prioritise the sequential processing of data, which is inherent in language tasks. This contrasts with GPUs, which are optimised for parallel processing tasks such as graphics rendering.

“You can’t produce the 100th word until you’ve produced the 99th so there is a sequential component to them that you just simply can’t get out of a GPU,” said Ross.

Moreover, he added that GPUs are notoriously thirsty for power, often requiring as much power as the average household per chip. “LPUs use as little as a tenth as much power,” he said.

What’s Next?

Groq recently partnered with Earth Wind & Power to develop the first European vertically integrated AI Compute Centre in Norway. Groq has committed to deploying and operating 21,600 LPUs at Earth Wind & Power’s AI Compute Center in 2024, with the option to increase this number to 129,600 LPUs in 2025.

“If we can deploy over 220,000 LPUs this year, given how much faster they are than GPUs, it would be equivalent to more than all of Meta’s compute,” said Ross, adding that next year they want to deploy 1.5 million LPUs, which would be more than all of the compute of all the tech hyperscalers combined.

Access all our open Survey & Awards Nomination forms in one place