The development of large language generation models is one of the most exciting fields to be in right now as it finds its usage in a diverse range of sectors – better customer service, chatbot and virtual assistance, enhanced gaming experience, improved search engines, etc. Big names such as Meta, Google, Microsoft, and NVIDIA are investing time, energy and money in building large language generation models. Innovation leader DeepMind, which has had path-breaking innovations like Alpha Fold, Alpha Fold 2.0, and Enformer in the past, has also come out with something amazing in the language model space. It has introduced a 280 billion parameter transformer language model called Gopher.
DeepMind’s research went on to say that Gopher almost halves the accuracy gap from GPT-3 to human expert performance and exceeds forecaster expectations. It stated that Gopher lifts performance over current state-of-the-art language models across roughly 81% of tasks containing comparable results. This works notably in knowledge-intensive domains like fact-checking and general knowledge.
DeepMind said that larger models are more likely to generate toxic responses when provided with toxic prompts. They can also more accurately classify toxicity. The model scale does not significantly improve results for areas like logical reasoning and common-sense tasks. The research team found out that the capabilities of Gopher exceed existing language models for a number of key tasks. This includes the Massive Multitask Language Understanding (MMLU) benchmark, where Gopher demonstrates a significant advancement towards human expert performance over prior work.
Along with Gopher, DeepMind has also released two other papers. One deals with the study of ethical and social risks associated with large language models, and the second investigates a new architecture with better training efficiency.
In a lengthy 118-page paper, DeepMind deep dives into what Gopher actually is. The research paper added that DeepMind trained the Gopher family of models on MassiveText, which is a collection of large English-language text datasets from diverse sources such as web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. They found out that successive stages of this pipeline improve language model downstream performance, emphasising the importance of dataset quality.
MassiveText contains 2.35 billion documents or about 10.5 TB of text. The research team added, “Since we train Gopher on 300B tokens (12.8% of the tokens in the dataset), we subsample from MassiveText with sampling proportions specified per subset (books, news, etc.) We tune these sampling proportions to maximise downstream performance.
The War of the Large Language Models
2021 has been a revolutionary year for the development of large language models.
We all know how path-breaking San Francisco-based artificial intelligence research laboratory Open AI’s GPT-3 autoregressive language model is in the field of language generation models. Launched last year, GPT -3’s full version has a capacity of a massive 175 billion machine learning parameters. Other tech giants have also been paying attention to this field and stepping up their game. AI21 Labs released Jurassic-1, which has 178 billion parameters. Gopher is larger than both of them and stands at a whopping 280 billion parameters.
But, it is definitely not the largest. Microsoft and NVIDIA teamed up earlier this year to bring out the Megatron-Turing Natural Language Generation (MT-NLG) model with an astounding 530 billion parameters. Google has developed and benchmarked Switch Transformers, a technique to train language models, with over a trillion parameters. The Chinese government-backed Beijing Academy of Artificial Intelligence (BAAI) has introduced Wu Dao 2.0 with 1.75 trillion parameters.
Who Wins the Race?
In the research paper, DeepMind tries to draw a comparison between Gopher and the models that exist. It is said that Gopher outperforms the current state-of-the-art for 100 tasks (81% of all tasks). The baseline model includes large language models such as GPT-3 (175 billion parameters), Jurassic-1 (178B parameters), and Megatron-Turing NLG (530 billion parameters). They found that Gopher showed the most uniform improvement across reading comprehension, humanities, ethics, STEM and medicine categories. It also displayed a general improvement on fact-checking. The general trend is less improvement in reasoning-heavy tasks (say, Abstract Algebra) and a larger and more consistent improvement in knowledge-intensive tests (say, General Knowledge).
For language model benchmarks, the firm expand the relative performance results of Gopher versus the current 178B SOTA model Jurassic-1 and 175B GPT-3. Gopher does not outperform the state-of-the-art on 8 of 19 tasks; under-performs on Ubuntu IRC and DM Mathematics in particular. This may be due to a poor tokeniser representation for numbers. Gopher demonstrates improved modelling on 11 of 19 tasks, in particular books and articles. It is said that this can happen due to the heavy use of book data in MassiveText (sampling proportion of 27% compared to 16 per cent in GPT-3).
Too early to know how impactful Gopher can be
Just like the massive buzz GPT-3 created around its launch, Gopher has done the same.
But GPT-3, described as revolutionary by some, was criticised as well by well-known tech leaders. We are yet to see if Gopher will draw such kind of criticisms from the tech world. It is too early to tell at the moment as the model has just been introduced.
As more and more large language tools are getting developed, the need of the hour is faster development of interpretability tools and data quality to understand the models better. Only then the benefits of such models can be used for the benefit of the society.