MITB Banner

Benchmarks are a Waste of Time 

“Users don’t care about a 3% increase in this benchmark or an 8% boost on that one."

Share

Illustration by Diksha Mishra

Listen to this story

Since the beginning of LLMs, benchmarks have been the litmus test to judge their efficiency, at least on paper. However, companies often manipulate the data to project themselves at the top; and in this race, there’s yet to emerge a clear winner. 

The recent launch of Gemini and its comparison with GPT-4 on different benchmarks, gives a glimpse of the benchmark manipulation. For instance, Google claimed it outperformed GPT-4 on the MMLU benchmark. However, it was later discovered that Google used COT@32 instead of 5-shot learning.

Meanwhile, Microsoft didn’t stay silent and struck back by releasing a blog stating that, by applying medprompt+ to GPT-4, it achieved a record score of 90.10%. Microsoft stated that with systematic prompt engineering, one can achieve maximum performance.

The tech giant said it continues to explore the out-of-the-box performance of frontier models using simple prompts. Moreover, Microsoft recently launched Phi-2, claiming it outperforms Mistral 7B, Llama 2, and Gemini Nano. Similarly, Mistral AI asserted that its latest model, 8X7B, performs better than GPT-3.5 and Llama 2.

However, the battle over benchmarks is not something new. The obsession of LLM creators with benchmarks isn’t benefiting anyone, as they persist in their competition over a few points here and there.

Do Benchmarks Really Matter?

While benchmarks provide a general indication of where a particular LLM stands, they shouldn’t be considered the sole criteria for judgment. The primary purpose of any LLM is to serve its customers and streamline their tasks. 

For example, if you’re utilising an LLM for tasks like creating meeting notes, summaries, and blogs, its ability to perform these tasks accurately is more significant than its benchmark score. 

Echoing similar sentiments, AI advisor Vin Vashista said, “Users don’t care about a 3% increase in this benchmark or an 8% boost on that one. If they don’t see the difference, they don’t know there is one.” 

“Users want more accurate speech-to-text for meeting notes and summarization, BUT a 3% improvement in accuracy must be noticeable to users in a head-to-head comparison,” he added. 

Regarding Gemini, he said,” I understand that Gemini’s benchmarks are better, but generative AI winners won’t be decided by benchmarks. That’s how models win Kaggle, not how products win over customers.”

He added that model metrics must connect with customer and user outcomes, or they’re just vanity metrics. Companies are spending millions to publish benchmarks that customers ignore. 

Even though Llama 2 lags behind OpenAI and other models, it is still actively used by several enterprises due to its low cost and high adaptability. Recently, the Indian startup Sarvam AI launched its first Hindi LLM, called OpenHathi, which is  built on top of Llama 2.

Benchmarks can be Manipulated 

Manipulating metrics is not a difficult task, all you have to do is test your model on the training dataset. This stands out as one of the simplest tricks to achieve a high benchmark score. 

Affirming a similar viewpoint, Abacus AI founder Bindu Reddy said, “Now that we are all obsessed about benchmarks, keep in mind that they can be easily gamed. All you have to do is train on benchmark datasets to improve your MMLU scores.” she added. This is not the first time when Reddy has spoken on the futility of benchmarks

“It’s the age-old trick of “training on test” data and works pretty well on LLMS. As the number of LLMs explodes, we will see more and more folks employ this trick.” she added.

https://twitter.com/Karmedge/status/1734476310825554069

Earlier she posted on X, “When it comes to ChatGPT-like apps, vibes matter, not benchmarks. If your LLM isn’t interesting or spicy and generates boring  corporate speak, it’s not going to make it”. 

Furthermore, LLM benchmarks don’t necessarily paint an accurate picture of how they can be utilised in real-life scenarios. A user on X said, “After a certain limit benchmarks are a spook for certain use cases. I don’t care about an LLM’s lack of making mistakes – I actually want my LLM to make the best guess at some code because it followed my instructions”.

Another user on X said, “With all the LLM benchmarks, have we reached the point where Goodhart’s law applies? i.e.: “When a measure becomes a target, it ceases to be a good measure.”

Simply put, when a specific metric or indicator is used as a target for policy or decision-making, individuals or entities may optimise their behavior to achieve favorable results in that particular metric. However, this can lead to distortions, as the original measure may no longer accurately reflect the intended goals or overall system performance.

This absolutely makes sense. Whenever a company releases an LLM, they tend to emphasize the model’s strong points without revealing the full picture. For instance, when Claude 2 was announced, its GSM8K and Codex Humaneval scores were published, while the MMLU score was not even mentioned. 

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India