Listen to this story
|
Chinese artificial intelligence startup Baichuan Intelligent Technology recently introduced two open-source AI-powered large language models called Baichuan 2-7B and Baichuan 2-13B.
Interestingly, one thing that caught everyone’s eye was that it performed better than ChatGPT on AGIEval – a benchmark created by Microsoft Research. ChatGPT’s score on AGIEval was 46.13 while Baichuan 2-13 B’s was 48.17. The word soon spread that Baichuan2-13B beats ChatGPT on AGIEval.
Baichuan 2
— Yam Peleg (@Yampeleg) September 13, 2023
The 2nd iteration the leading Chinese model is a major improvement.
Baichuan 2-13B beats ChatGPT on AGIEval.
—
Code: https://t.co/pwJIGDJ6nz
Paper: https://t.co/j7WRMJ5O5u
—
—
The effort is on a different scale.
The paper detailing the process for creating… pic.twitter.com/OpFfODhKyx
This isn’t new. Whenever a new foundational model arrives, it often wants to show how it measures up to ChatGPT. However, the real question here was how Baichuan 2-13 B was able to do that.
Language matters
The rankings of LLMs on benchmarks often depend on the training dataset they use, and AGIEval is no exception. In AGIEval’s case, it primarily assesses foundational models within the framework of college entrance exams like SAT, LSAT, and various math competitions.
Surprisingly, the real reason for outperforming ChatGPT is that the Baichuan 2-13 B was trained on Chinese-English bilingual dataset comprising several million webpages from hundreds of reputable websites that represent various positive value domains, encompassing areas such as policy, law, vulnerable groups, general values, traditional virtues, and more.
Upon closer inspection of the AGIEval research paper, it becomes evident that, in addition to entrance exams like SAT and LSAT, it also encompasses Chinese entrance exams such as Gaokao. Furthermore, this benchmark extends its scope to include bilingual tasks in both Chinese and English.
On the other hand, open source models like LLaMA and Llama 2 have focused primarily on English. For instance, the main data source for LLaMA is Common Crawl, which comprises 67% of LLaMA’s pre-training data but is filtered to English content only.
As Baichuan is China based it has easy access to the chinese material to train its model. Recently, a report came out that said Chinese authorities have approved Baichuan Intelligent Technology and Zhipu AI’ requests to open its AI large language models to the public.
It’s apparent that Chinese authorities may not have intervened to prevent them from accessing data from the Chinese internet, distinct from the global internet used elsewhere.
Microsoft is Behind This
The AGIEval benchmark created by Microsoft says that evaluating the general abilities of foundational models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of AGI.
Their paper casually dismisses traditional benchmarks, which rely on artificial datasets and says they may not accurately represent human-level capabilities. Does it mean that Baichuan 2-13B is closer to AGI than ChatGPT. If that’s indeed the case, it’s a significant achievement.
However if we introspect, AGIEval is no different from any other benchmark in a way that they all are based on a certain dataset on which they are evaluated.
Apart from AGIEval, if we check Baichuan 2 coding and math problem solving abilities, it is way behind ChatGPT. So how can we conclude that AGIEval is the criteria to judge AGI.
Recently, Baidu also claimed that Ernie 3.5, the latest version of its Ernie AI model, had surpassed “ChatGPT in comprehensive ability scores” and outperformed “GPT-4 in several Chinese capabilities.”. The Beijing-based company referred to a test conducted by the state newspaper China Science Daily, which included datasets like AGIEval and C-Eval.
Interestingly, Microsoft’s Orca earlier this year had also claimed that it performs better on AGIEval. In the Orca’s research paper it is specifically mentioned that “Evaluation benchmarks like AGIEval which relies on standardized tests such as GRE,SAT, LSAT etc offer more robust evaluation frameworks”. However, If we dig into Orca’s dataset, one finds out that it is also trained on Chinese dataset.
Orca scored higher than ChatGPT and was nearly identical to text-davinci-003 in the AGIEval benchmark. However, Orca still significantly lags behind GPT-4 in these metrics.
— Tiz (@tatendampofu4) June 16, 2023
The marketing of Orca revolved around the AGIEval benchmark. Similarly majority of the foundational models which are performing well on AGIeval have a Chinese dataset which gives them undue advantage. It isn’t fair to all the other models present out there.
In Conclusion
The performance of AI models on benchmarks like AGIEval is not solely indicative of their progress towards AGI. While models like Baichuan 2-13 B have showcased impressive scores, the underlying advantage often lies in their training data, particularly the accessibility of specific Chinese internet content.
AGIEval’s focus on real-world tasks is commendable, but it’s crucial to recognise a broader spectrum of abilities are equally vital in assessing AGI. Can we really say that if an LLM passes SAT, LSAT or any other exams, it is closer to AGI?