Listen to this story
|
About five years back, for a company to build a data moat was considered as important as storing your crown jewels in a bank vault. Even in the pre-internet era, companies like IBM had vendor data lock-in and specific details on the client’s business requirements. A strong data moat is why Microsoft bought LinkedIn in 2016 – it gave the software giant access to data of more than 433 million members and a very competitive social graph. It’s also why Twitter cut the cord to its free API. But the irony is that with passing time and more innovations the data land grab has come to mean nothing.
NEW: Prominent Google AI researcher resigned after warning Alphabet CEO Sundar Pichai and other senior execs that Bard—Google’s rival to ChatGPT—was *using data from ChatGPT*.
— Amir Efrati (@amir) March 29, 2023
Big no-no in that world. https://t.co/a5NeclJPK5 w/ @jon_victor_ pic.twitter.com/YEZqEqpzPS
Cheaper spinoff models
Last week, a report by The Information revealed that Google’s Bard was trained on output from OpenAI’s landmark chatbot ChatGPT. The output was said to be hosted on ShareGPT, a website where users share conversations with ChatGPT. While Google has outright denied these allegations, the report mentioned that the search giant had fired the engineer, Jacob Devlin who had tried to warn Google against using OpenAI’s data.
In the past couple of weeks, the pace at which LLMs were churned out has been hard to keep up with. Given how hot the AI race is, developers have taken to training their models with data from foundational models like ChatGPT to come up with smaller, lighter models of their own.
Most high-performing large language models (LLMs) are closed-source and can only be accessed via paid APIs. However, the public release of LLaMA has recently challenged this trend. Here’s what you need to know about LLaMA… 🧵[1/7] pic.twitter.com/6nZjilarh6
— Cameron R. Wolfe, Ph.D. (@cwolferesearch) April 11, 2023
In the past few weeks, Meta’s LLaMA has given birth to a bunch of spinoffs like Alpaca which just fine-tuned LLaMA using data generated from OpenAI’s text-davinci-003. Another model called Koala built by UC Berkeley professors was trained using data from ShareGPT while another 13-billion parameter model called Vicuna was trained on fine-tuned data from a LLaMA base model using conversations gathered from ShareGPT as well.

On the one hand, this makes perfect sense for developers and researchers who want to build specialised or GPT’X’ applications that cost much less per inference. They can skip several tedious and costly steps in the process – there’s no need to collect or label datasets. So, then what use is it to a market leader like OpenAI to spend resources to build a benchmark model like GPT-4 trained on a large, labelled training dataset of their own? And what is the real significance of a data moat now?
Does the data moat even exist anymore?
Andrew Ng, the former head and co-founder of Google Brain questioned the defensibility of the data moat. “You may have spent a lot of effort to collect a large labelled training set, yet a competitor can use your model’s output to gain a leg up. This possibility argues that, contrary to conventional tech-business wisdom, data doesn’t always make your business more defensible,” he noted.
We still may not know exactly how much OpenAI spent but grand estimates have been made. Rowan Curran, a Forrester analyst for AI/ML has stated that it could have cost OpenAI USD 40 million to process the millions of prompts fed into ChatGPT’s software.
To compare, Alpaca was launched for a meagre USD 600 and Vicuna which reportedly achieved 90% of ChatGPT’s quality cost USD 300 to train.
But the method itself begs questions around the ethics, potential legal squabbles and engineering issues involved. Every model essentially becomes a clone of a clone and has similar responses to each other. In addition, OpenAI’s terms of use explicitly forbid against ‘using output from the Services to develop models that compete with OpenAI.’ Even though it makes perfect sense for others to cheapen the steep training costs of foundational models, it is akin to stealing a recipe instead of making your own.

So, if developers are scraping ChatGPT outputs from a website like ShareGPT, do OpenAI’s conditions still apply? Or is data simply fair game and just a levelling of the playing field?
Even as these grey areas are worth mulling over, several experts and investors have essentially called building data moat overrated. Silicon Valley VC firm, Andreessen Horowitz have called relying on the data moat at empty endeavour. In a blog written by Martin Casado and Peter Lauten, senior partners at the firm said, “Treating data as a magical moat can misdirect founders from focusing on what is really needed to win.”
Sometime back, every AI startup founder was asked about their data moat by investors, almost as if it was a standard rule for everyone to build one. But just like many other elements in AI, it depends. A data moat can protect but only to a very limited degree and its effectiveness may vary. But for OpenAI at the moment, the moat has long been crossed.