21st-may-banner design

Cohere Rolls Out Multi-Model Framework PoLL for Comprehensive LLM Evaluation

By leveraging a diverse panel of models, PoLL aligns more closely with human judgment and provides a scalable, cost-effective solution to the growing need for accurate LLM assessments.

Share

Listen to this story

In response to the complexities and challenges associated with evaluating the ever-evolving LLMs, influential AI startup Cohere has introduced a new evaluation framework called the Panel of LLM Evaluators (PoLL) which leverages a diverse panel of smaller, distinct model families to assess LLM outputs, promising a more accurate, less biased, and cost-effective method compared to traditional single-model evaluations.

Traditional evaluations often use a single large model like GPT-4 to judge the quality of other models’ outputs. However, this method is not only costly but also prone to intra-model bias, where the evaluator model might favour outputs similar to its training data.

PoLL addresses these challenges by assembling a panel of smaller models from different model families to evaluate LLM outputs. This setup reduces evaluation costs by over seven times compared to using a single large model and minimises bias through its varied model composition. The framework’s effectiveness has been validated across multiple settings, including single-hop QA, multi-hop QA, and competitive benchmarks like the Chatbot Arena.

Studies utilising PoLL have demonstrated a stronger correlation with human judgments compared to single-model evaluations. This suggests that a diverse panel can better capture the nuances of language that a single, large model might miss due to its broader, generalised training.

Methodology Behind PoLL

The PoLL consisted of models from three distinct families—GPT-3.5, CMD-R, and Haiku—each contributing diverse perspectives to the evaluation process. This diversity allows PoLL to offer a well-rounded assessment of LLM outputs, addressing different language understanding and generation aspects.

The success of PoLL paves the way for more decentralised and diversified approaches to LLM evaluation. Future research could explore different combinations of models in the panel to further optimise accuracy and cost. Moreover, applying PoLL to other language processing tasks, such as summarisation or translation, could help establish its effectiveness across the field.

Share
Picture of Shritama Saha

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.