Listen to this story
Ever since the release of ChatGPT, people have been obsessed with the LLM chatbot. There are users, doomers, and then there are developers and other AI companies that have been trying to build their own version of it. While OpenAI’s model is closed source, a lot of open source models have been giving developers hope to build something similar to probably the best chatbot in the market.
One of the emerging inexpensive methods to improve an open source language model such as LLaMa, Alpaca, or Self-Instruct is to fine-tune it on outputs from proprietary systems like ChatGPT, which are stronger models. It might seem like an efficient way to imitate and build up the weaker model’s capabilities to match the stronger model, but it fails big time.
Researchers from UC Berkeley recently published a paper – The False Promise of Imitating Proprietary LLMs – which critically analysed the efficacy of this imitation approach. The researchers explain how training models using proprietary language models raises various legal and ethical concerns.
Talking about the technical improvement that this approach achieves is also negligible. According to the paper, fine-tuning a weaker model to improve the knowledge capabilities has little to no impact on the language model.
Pre-training is the main source for improving the capabilities of a language model. Thus, fine-tuning smaller models like LLaMa, Vicuna, Alpaca, and Self-Instruct, on the output of large models like ChatGPT and Bard, does not improve the knowledge of the model because the base data remains unchanged. It only alters the style of the model.
Moreover, when trying to imitate the knowledge and capabilities of large models like ChatGPT through their outputs, weaker models also end up inheriting their flaws and biases.
Fine-tuning of these models through imitation also removes the capability of directly improving the design decisions of companies that have closed AI models, such as ChatGPT, making the models perform even poorer than them. The models fail to improve on important aspects like factuality, coding, and problem solving.
Data imitation not a good idea
Big companies that have large base models like GPT-4 or Google’s PaLM, have no reason to worry about imitation as there is a huge gap between them and the models trying to imitate them. Companies that acquire large amounts of data, compute, and algorithmic advances are much likely to maintain their competitive advantages.
Smaller companies that are trying to establish their moat by utilising off-the-shelf LLMs such as LLaMa or other open source offerings are at more risk of imitation. As explained earlier, fine-tuning with ChatGPT data does not make as much improvement in language models as building and improving the pre-training data by the company itself.
Still, companies such as OpenAI have never openly disclosed the data that they are trained on. It is safe to say that a lot of it is just open internet data and not their proprietary data. The paper by UC Berkeley, as pointed out by users on Hacker News, said that it might be illegal to use the data, which might not be correct.
The paper draws various conclusions that are not based on the current developments in open source software. Before the release of LLaMa, it was believed that the larger the model, the better it performs. But various models built on top of a smaller model like LLaMa are outperforming GPT-4 and PaLM.
What if data gets private?
In Google’s leaked document where it said that neither them nor does OpenAI have a moat in AI, the document praised the open source community and in many ways Meta’s LLaMa. Even though Google and OpenAI do not have proprietary rights to the data that they are trained on, the future of language models might not remain the same, and get rather scary.
Recently, Mark Cuban, the American businessman and AI enthusiast, said that the next step for LLM based chatbots by big companies is to have their private data that can be bought to build exclusive large knowledge models, instead of LLMs. These models would be able to perform better and act like a moat for the companies as no one would have access to the data.
“We ignore what created us; we adore what we create.” — Aleister Crowley, The Book of Lies
We already have an example of how Elon Musk, the chief twit, stopped OpenAI from accessing its data and threatened to file a lawsuit against it and Microsoft for still using it. Now, his new chatbot is in the line, which is possibly trained on Twitter data, which is exclusive to him. Not even the open source can access it.
What this means for open source is that if developers rely on off the shelf language models like LLaMa, they would not be able to fine-tune it on any other model that is built by the large companies.
Currently there are no legal or ethical issues around using data from ChatGPT or Bard to fine-tune your models as the data does not belong solely to the companies. But if Cuban is right, and companies start buying exclusive access to data, then the open source community would stop thriving.
These companies might not want to allow anyone else to compete, and with the open source rising up, restricting data access might be the biggest move for them. Currently, one of the important things for smaller models is their specific use case capabilities. But if big companies acquire exclusive rights to other companies’ data, and make it private, they might be able to outperform smaller models easily.
No one would reveal or release their data source, but it is clearly the internet. If these giants start acquiring intellectual property rights to data from other companies, they would rise as the most knowledgeable models in the community, and the open source community might die.