Listen to this story
|
In response to the complexities and challenges associated with evaluating the ever-evolving LLMs, influential AI startup Cohere has introduced a new evaluation framework called the Panel of LLM Evaluators (PoLL) which leverages a diverse panel of smaller, distinct model families to assess LLM outputs, promising a more accurate, less biased, and cost-effective method compared to traditional single-model evaluations.
Traditional evaluations often use a single large model like GPT-4 to judge the quality of other models’ outputs. However, this method is not only costly but also prone to intra-model bias, where the evaluator model might favour outputs similar to its training data.
PoLL addresses these challenges by assembling a panel of smaller models from different model families to evaluate LLM outputs. This setup reduces evaluation costs by over seven times compared to using a single large model and minimises bias through its varied model composition. The framework’s effectiveness has been validated across multiple settings, including single-hop QA, multi-hop QA, and competitive benchmarks like the Chatbot Arena.
Studies utilising PoLL have demonstrated a stronger correlation with human judgments compared to single-model evaluations. This suggests that a diverse panel can better capture the nuances of language that a single, large model might miss due to its broader, generalised training.
Methodology Behind PoLL
The PoLL consisted of models from three distinct families—GPT-3.5, CMD-R, and Haiku—each contributing diverse perspectives to the evaluation process. This diversity allows PoLL to offer a well-rounded assessment of LLM outputs, addressing different language understanding and generation aspects.
The success of PoLL paves the way for more decentralised and diversified approaches to LLM evaluation. Future research could explore different combinations of models in the panel to further optimise accuracy and cost. Moreover, applying PoLL to other language processing tasks, such as summarisation or translation, could help establish its effectiveness across the field.