Hugging Face’s Philipp Schmid on Rethinking AI Evaluation

“You always need to evaluate the models based on your needs and use cases.”

Share

Published on December 20, 2023

by Tasmia Ansari

Listen to this story

Two weeks ago, Google pulled the curtain back on Gemini. In the release, they were chest-thumping, claiming that their new AI model outshines OpenAI’s GPT-4 on specific benchmarks. But can these benchmarks be solely trusted? Philipp Schmid, technical lead at Hugging Face (HF), does not believe so.

“Benchmarks are critical to compare models with each other to get a first level of understanding where a company or someone should start building similar models.” But they cannot be completely relied upon, “Especially academic benchmarks,” said Schmid. “Those are very static and do not represent real-world use cases. Also, there’s a lot of potential data contamination,” he clarified in an interview with AIM.

For example, a popular benchmark is massive multitask language understanding (MMLU), a wealth of knowledge test with multiple choice questions. If this data is on the internet and was not removed from the training set of the model, the model might have seen all of those questions before and could perform better than a different model, which has yet to see the questions, he explained.

“You should always be careful about what marketing content is shared,” he proposed. While Google parades Gemini’s superior performance, they conveniently tucked away GPT-4’s victories. “It’s not a very fair comparison,” Schmid said.

Further, he described that Gemini uses the chain of thought 32. As per the technical report, they also evaluated Gemini on the five-shot concept, where GPT-4 performs better than Gemini, but was not included in the marketing blog post to show that Gemini is equally performing or better than GPT.

Schmid pointed out something else he came across recently on X: Microsoft researchers again evaluated GPT-4 on MMLU, and now it performs 90.10% Gemini with the correct prompts. I’m not sure how much you should trust those academic benchmarks. “You always need to evaluate the models based on your needs and use cases,” he added.

Leave it to Hugging Face to keep it real, the platform rolled out the Open LLM leaderboard earlier this year. It is a central place for people to start looking at what the current model is in their compute budget range.

“We evaluate models not only for the biggest one but all different classes of models with three, seven, thirteen, thirty billion parameters and even bigger. It’s more like starting at one and not zero,” Schmid declared.

New Tech, Same Old Problems

Generative AI’s been stirring the pot, no doubt. People are up in arms about it, churning out fake news and conjuring faces out of thin air. But Schmid is not buying into the hype—or the panic.

“The tech is not creating something new which we haven’t seen before. There was already Photoshop which could be used to create images which might not be safe for work or could harm other people. It’s not that we are creating new problems to which we need to find a new solution,” Schmid specified.

Generative AI might be new, but the headaches it brings are old. Suggesting a solution, he name-dropped Meta’s recent Purple Llama initiative, which can be used as an additional step in conversational application to make sure that the prompt the user provides is safe and the user can define their own safety rules.

Schmid’s Open Source Devotion

Exactly three years ago, Schmid started working on the HF platform. He joined in the early days to start working on HF’s cloud integration when it began its partnership with Amazon Web Services (AWS). Then came the chat with Clem Delangue, the CEO of Hugging Face. Next thing he knew, he was the guy making the AWS integration happen; he expanded it to Azure and dabbled with NVIDIA and all the big cloud and hardware partners, he recounted.

Why the open-source devotion? Schmid doesn’t mince words: “Open source is the only right way to solve AI. We have learned from regular software development that open source is way more secure and robust than closed source since more people are looking at it.”

The young gun, just 27 but with the gravitas of a seasoned vet, argues that when dealing with the unpredictable outputs of generative AI, transparency is king. When users hit a snag or something weird in generative AI models, open source lets them trace it back to the source. Furthermore, questions like ‘Was there some biased information in my training data set? Was there some bias inside the model? Can I see how the different tokens are predicted?’ can be answered via open source, Schmid pointed out.

“That’s just not possible if you have a black box API where you don’t know anything about it since, I mean, for example, we don’t know if any of the closed source providers add a layer of a different model which does some checks or which removes some other words and like behaviour changes, which is I think, supercritical for companies if you want to adopt AI in productive use cases,” he said.

Access all our open Survey & Awards Nomination forms in one place

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.