Data Moats have Fallen with GPT Clones

Andrew Ng, the former head and co-founder of Google Brain questioned the defensibility of the data moat.

Published on April 13, 2023
by Poulomi Chatterjee

Listen to this story

About five years back, for a company to build a data moat was considered as important as storing your crown jewels in a bank vault. Even in the pre-internet era, companies like IBM had vendor data lock-in and specific details on the client’s business requirements. A strong data moat is why Microsoft bought LinkedIn in 2016 – it gave the software giant access to data of more than 433 million members and a very competitive social graph. It’s also why Twitter cut the cord to its free API. But the irony is that with passing time and more innovations the data land grab has come to mean nothing.

NEW: Prominent Google AI researcher resigned after warning Alphabet CEO Sundar Pichai and other senior execs that Bard—Google’s rival to ChatGPT—was *using data from ChatGPT*.

Big no-no in that world. https://t.co/a5NeclJPK5 w/ @jon_victor_ pic.twitter.com/YEZqEqpzPS
— Amir Efrati (@amir) March 29, 2023

Cheaper spinoff models

Last week, a report by The Information revealed that Google’s Bard was trained on output from OpenAI’s landmark chatbot ChatGPT. The output was said to be hosted on ShareGPT, a website where users share conversations with ChatGPT. While Google has outright denied these allegations, the report mentioned that the search giant had fired the engineer, Jacob Devlin who had tried to warn Google against using OpenAI’s data.

In the past couple of weeks, the pace at which LLMs were churned out has been hard to keep up with. Given how hot the AI race is, developers have taken to training their models with data from foundational models like ChatGPT to come up with smaller, lighter models of their own.

Most high-performing large language models (LLMs) are closed-source and can only be accessed via paid APIs. However, the public release of LLaMA has recently challenged this trend. Here’s what you need to know about LLaMA… 🧵[1/7] pic.twitter.com/6nZjilarh6
— Cameron R. Wolfe, Ph.D. (@cwolferesearch) April 11, 2023

In the past few weeks, Meta’s LLaMA has given birth to a bunch of spinoffs like Alpaca which just fine-tuned LLaMA using data generated from OpenAI’s text-davinci-003. Another model called Koala built by UC Berkeley professors was trained using data from ShareGPT while another 13-billion parameter model called Vicuna was trained on fine-tuned data from a LLaMA base model using conversations gathered from ShareGPT as well.

*Source: Stanford, Alpaca’s training backend*

On the one hand, this makes perfect sense for developers and researchers who want to build specialised or GPT’X’ applications that cost much less per inference. They can skip several tedious and costly steps in the process – there’s no need to collect or label datasets. So, then what use is it to a market leader like OpenAI to spend resources to build a benchmark model like GPT-4 trained on a large, labelled training dataset of their own? And what is the real significance of a data moat now?

Does the data moat even exist anymore?

Andrew Ng, the former head and co-founder of Google Brain questioned the defensibility of the data moat. “You may have spent a lot of effort to collect a large labelled training set, yet a competitor can use your model’s output to gain a leg up. This possibility argues that, contrary to conventional tech-business wisdom, data doesn’t always make your business more defensible,” he noted.

We still may not know exactly how much OpenAI spent but grand estimates have been made. Rowan Curran, a Forrester analyst for AI/ML has stated that it could have cost OpenAI USD 40 million to process the millions of prompts fed into ChatGPT’s software.

To compare, Alpaca was launched for a meagre USD 600 and Vicuna which reportedly achieved 90% of ChatGPT’s quality cost USD 300 to train.

But the method itself begs questions around the ethics, potential legal squabbles and engineering issues involved. Every model essentially becomes a clone of a clone and has similar responses to each other. In addition, OpenAI’s terms of use explicitly forbid against ‘using output from the Services to develop models that compete with OpenAI.’ Even though it makes perfect sense for others to cheapen the steep training costs of foundational models, it is akin to stealing a recipe instead of making your own.

*The assumed importance of a data moat, Source: Noy Shulman, Medium*

So, if developers are scraping ChatGPT outputs from a website like ShareGPT, do OpenAI’s conditions still apply? Or is data simply fair game and just a levelling of the playing field?

Even as these grey areas are worth mulling over, several experts and investors have essentially called building data moat overrated. Silicon Valley VC firm, Andreessen Horowitz have called relying on the data moat at empty endeavour. In a blog written by Martin Casado and Peter Lauten, senior partners at the firm said, “Treating data as a magical moat can misdirect founders from focusing on what is really needed to win.”

Sometime back, every AI startup founder was asked about their data moat by investors, almost as if it was a standard rule for everyone to build one. But just like many other elements in AI, it depends. A data moat can protect but only to a very limited degree and its effectiveness may vary. But for OpenAI at the moment, the moat has long been crossed.

Access all our open Survey & Awards Nomination forms in one place >>

Poulomi Chatterjee

Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.

Data Moats have Fallen with GPT Clones

Cheaper spinoff models

Does the data moat even exist anymore?

Poulomi Chatterjee

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

KissanAI Releases Dhenu Llama 3, an Indic LLM for Farmers

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Is it Humane to Bash Humane Ai Pin?

Meta Llama 3 Now Available on Databricks For Enterprise

How Databricks is Enabling Agriculture’s Data Revolution with UPL

How Good is Llama 3 for Indic Languages?

OpenAI Hires Pragya Misra As Its First Employee in India

Meta Forces Developers Cite ‘Llama 3’ in their AI Development

India is Making its Own AI Servers

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.

AIM Launches the 3rd Edition of Data Engineering Summit. May 30-31, Bengaluru