Last updated February 28, 2024
In AI Origins & Evolution

Google Fools Everyone with Gemini

“Gemini doesn't really beat GPT-4"

Share

Published on December 8, 2023

by Siddharth Jindal

Listen to this story

Google appears desperate. After announcing to launch Gemini in fall this year, Google was unable to deliver on its promise. Now, the sudden launch of Gemini as the year ends suggests that Google did not want to be left behind. It seems that it has acted under pressure, when other players like OpenAI and Microsoft were unveiling new products.

Among the three Gemini models released by Google, Gemini Ultra created a buzz as it outperformed OpenAI’s GPT-4 on various benchmarks, including MMLU—a key metric used to evaluate the language model’s capabilities across a spectrum of subjects, ranging from STEM to social sciences and humanities.

Something’s Fishy

If one delves into the technical report of Gemini, we will discover that on the MMLU benchmark, Gemini Ultra outperformed both GPT-4 and GPT-3.5. However, the twist in the tale is that Google has cleverly employed COT@32 instead of 5-shot learning to enhance the perceived performance of Gemini.

“Digging deeper into the MMLU Gemini Beat – Gemini doesn’t really beat GPT-4. When we evaluate any large language model (LLM) on the MMLU benchmark, we typically employ 5-shot learning,” pointed out Bindu Reddy, the founder of Abacus AI.

Digging deeper into the MMLU Gemini Beat – Gemini doesn't really Beat GPT-4 On This Key Benchmark.

The Gemini MMLU beat is specifically at CoT@32. GPT-4 still beats Gemini for the standard 5-shot – 86.4% vs. 83.7%

5-shot is the standard way to evaluate this benchmark. You… pic.twitter.com/2OIzF8tL1a
— Bindu Reddy (@bindureddy) December 6, 2023

In 5-shot learning, the model is given five examples of each class during the training phase. This limited set of examples serves as the training data, and the model is expected to learn to recognize and generalize patterns effectively based on this small dataset.

On the other hand, Chain of Thought (CoT) prompting involves providing a series of reasoning steps in the form of a chain of thought to guide the model in generating intermediate rationales for solving a problem. It aims to enhance the multi-step reasoning abilities of LLMs by encouraging them to generate coherent and logical intermediate steps during problem-solving.

How the MMLU benchmark probably went with Gemini pic.twitter.com/26FzB5seiG
— anton (@abacaj) December 6, 2023

“Google has invented a different methodology around CoT@32 to claim that it’s better than GPT-4. CoT@32 only surpasses when you factor in ‘uncertainty routing.’ I need to dig into this more, but it seems like a method that optimises a consensus cutoff to determine when to use the majority approach versus falling back to the max likelihood greedy strategy,” Reddy said, adding, “GPT-4 is still better than Gemini Ultra.”

Even if Gemini Ultra beats GPT-4, does it truly make a difference? Every other day, new open-source LLMs emerge, boasting superior performance to GPT-4 or GPT-3.5. For instance, Llama 2 is on par with GPT-3.5, while TII’s Falcon 180B, at least on paper, surpasses GPT-3.5.

Regarding Gemini, AI Advisor Vin Vashishta said, “I understand that Gemini’s benchmarks are better, but Generative AI winners won’t be decided by benchmarks. That’s how models win Kaggle, not how products win over customers.”

He added that model metrics must connect with customer and user outcomes, or they’re merely vanity metrics. “Companies are spending millions to publish benchmarks that customers often ignore,” he added.

Echoing similar sentiments, Reddy said, “When it comes to ChatGPT-like apps, vibes matter, not benchmarks. If your LLM isn’t interesting or spicy and generates boring corporate speak, it’s not going to make it”.

Google Fooled Everyone

Google showcased the multi-modal capabilities of Gemini Ultra through a demo video. However, later it was found that the video was staged.

The six-minute video uploaded by Google guides us through various examples where Gemini engages in fluent conversations, responding to queries and participating in activities such as playing games like rock-paper-scissors with a person.

In the demo, it seems that everything is happening in real time and Gemini is quickly able to respond. On the contrary, the Youtube description of the video reads, “For the purposes of this demo, latency has been reduced and Gemini outputs have been shortened for brevity.”

In reality, the demonstration didn’t happen in real-time or with voice interaction. When Bloomberg reached out to Google about the video, a spokesperson explained that it was created “using still image frames from the footage, and prompting via text.” Simply put, they first gave pictures to Gemini, and then they wrote text prompts to get the output.

🚨PSA about Google’s jaw-dropping video demo of Gemini – the one with the duck:

It was not carried out in real time or in voice. The model was shown still images from video footage and human prompts narrated afterwards, per a spokesperson. More here: https://t.co/ITU29Z5Oi9 pic.twitter.com/b9Bl9EpuuI
— Parmy Olson (@parmy) December 7, 2023

This is not the first time when Google has tried to pull off something just by marketing. In a recent move, it took a dig at AWS by displaying a Google Cloud ad on Sphere in Las Vegas during the AWS re:Invent.

However, Gemini Ultra isn’t out yet. Who knows, it might actually be better than GPT-4 by the time it comes out next year. Google can only hope that OpenAI doesn’t release GPT-5 by then.

Access all our open Survey & Awards Nomination forms in one place