Data Moats have Fallen with GPT Clones

Andrew Ng, the former head and co-founder of Google Brain questioned the defensibility of the data moat.
Listen to this story

About five years back, for a company to build a data moat was considered as important as storing your crown jewels in a bank vault. Even in the pre-internet era, companies like IBM had vendor data lock-in and specific details on the client’s business requirements. A strong data moat is why Microsoft bought LinkedIn in 2016 – it gave the software giant access to data of more than 433 million members and a very competitive social graph. It’s also why Twitter cut the cord to its free API. But the irony is that with passing time and more innovations the data land grab has come to mean nothing. 

Cheaper spinoff models

Last week, a report by The Information revealed that Google’s Bard was trained on output from OpenAI’s landmark chatbot ChatGPT. The output was said to be hosted on ShareGPT, a website where users share conversations with ChatGPT. While Google has outright denied these allegations, the report mentioned that the search giant had fired the engineer, Jacob Devlin who had tried to warn Google against using OpenAI’s data.

In the past couple of weeks, the pace at which LLMs were churned out has been hard to keep up with. Given how hot the AI race is, developers have taken to training their models with data from foundational models like ChatGPT to come up with smaller, lighter models of their own. 

In the past few weeks, Meta’s LLaMA has given birth to a bunch of spinoffs like Alpaca which just fine-tuned LLaMA using data generated from OpenAI’s text-davinci-003. Another model called Koala built by UC Berkeley professors was trained using data from ShareGPT while another 13-billion parameter model called Vicuna was trained on fine-tuned data from a LLaMA base model using conversations gathered from ShareGPT as well. 

Source: Stanford, Alpaca’s training backend

On the one hand, this makes perfect sense for developers and researchers who want to build specialised or GPT’X’ applications that cost much less per inference. They can skip several tedious and costly steps in the process – there’s no need to collect or label datasets. So, then what use is it to a market leader like OpenAI to spend resources to build a benchmark model like GPT-4 trained on a large, labelled training dataset of their own? And what is the real significance of a data moat now? 

Does the data moat even exist anymore?

Andrew Ng, the former head and co-founder of Google Brain questioned the defensibility of the data moat. “You may have spent a lot of effort to collect a large labelled training set, yet a competitor can use your model’s output to gain a leg up. This possibility argues that, contrary to conventional tech-business wisdom, data doesn’t always make your business more defensible,” he noted. 

We still may not know exactly how much OpenAI spent but grand estimates have been made. Rowan Curran, a Forrester analyst for AI/ML has stated that it could have cost OpenAI USD 40 million to process the millions of prompts fed into ChatGPT’s software. 

To compare, Alpaca was launched for a meagre USD 600 and Vicuna which reportedly achieved 90% of ChatGPT’s quality cost USD 300 to train. 

But the method itself begs questions around the ethics, potential legal squabbles and engineering issues involved. Every model essentially becomes a clone of a clone and has similar responses to each other. In addition, OpenAI’s terms of use explicitly forbid against ‘using output from the Services to develop models that compete with OpenAI.’ Even though it makes perfect sense for others to cheapen the steep training costs of foundational models, it is akin to stealing a recipe instead of making your own. 

The assumed importance of a data moat, Source: Noy Shulman, Medium

So, if developers are scraping ChatGPT outputs from a website like ShareGPT, do OpenAI’s conditions still apply? Or is data simply fair game and just a levelling of the playing field? 

Even as these grey areas are worth mulling over, several experts and investors have essentially called building data moat overrated. Silicon Valley VC firm, Andreessen Horowitz have called relying on the data moat at empty endeavour. In a blog written by Martin Casado and Peter Lauten, senior partners at the firm said, “Treating data as a magical moat can misdirect founders from focusing on what is really needed to win.” 

Sometime back, every AI startup founder was asked about their data moat by investors, almost as if it was a standard rule for everyone to build one. But just like many other elements in AI, it depends. A data moat can protect but only to a very limited degree and its effectiveness may vary. But for OpenAI at the moment, the moat has long been crossed. 

Download our Mobile App

Poulomi Chatterjee
Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox