Bard and ChatGPT Will Die If You Don’t Help Them

To sustain an LLM and make it better than the previous version, it needs human content
Soon Human Generated Content is Going to Sell At A Big Premium
Listen to this story

Google and OpenAI aren’t shying away from accepting that they need your data by all means to better Bard and ChatGPT, respectively. Recently, The Guardian came out with a report in which Google says that copyright law should be altered to allow generative AI systems to scrape the internet. 

The company is urging Australian policymakers to endorse “copyright systems that enable appropriate and fair use of copyrighted content to enable the training of AI models in Australia on a broad and diverse range of data” while also offering an option to opt out for entities that prefer not to have their data used for AI training.

On the other hand, in the midst of debates surrounding web scraping without consent, OpenAI introduced GPTBot, an automated website crawler. The bot is designed to collect publicly accessible data to train AI models, a process that OpenAI assures will be executed transparently and responsibly.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

As generative AI gains more popularity, there’s a growing need for more data. LLM-based chatbots ChatGPT and Google Bard depend on lots of text, images, and videos. 

OpenAI says that GPT-4 learns from a wide variety of data sources that are approved, organized, and available to the public, which might also include information that’s out there for anyone to see. OpenAI has recently acquired a trademark for GPT-5, whose success depends on the quality of data it’s trained with, aside from the computational power of GPUs. 

However, the accessibility of data for both OpenAI and Google remains uncertain due to widespread awareness of their internet scraping practices, a matter that’s attracting considerable public opposition.

Human Content is Lifeline

To sustain an LLM and make it better than the previous version, it needs human content. The problem here arises on deciding whether the companies should pay for it or simply take it from the internet. Going by the current scenario, it wouldn’t be surprising if human-generated content is sold for a premium in the future. 

It is not like OpenAI trains GPT-4 on only human content; it recently started training GPT-4 on datasets created by ChatGPT. However, it cannot go for long as it will eventually lead to a model collapse. This degenerative process takes place when models learn from data produced by other models. As a result, there is a gradual loss in the accurate representation of the true data distribution.

Anyone trying out ChatGPT to write poems can easily figure out that it was trained on poetry books and essays of the highest level. However, unfortunately, OpenAI didn’t take any permission from the authors. 

Last month, 8000 authors  including Margaret Atwood, Viet Thanh Nguyen and Philip Pullman signed a petition calling out artificial intelligence companies to stop using writers’ work without consent or credit. They argued that the hard work behind any form of art needs to be validated and credit should be given to the respective creator. 

However, when it comes to copyright for AI-generated works, the question of ownership arises. Usually, the Copyright Act assigns initial ownership to the creators of the work. But, because there haven’t been any legal or copyright office rulings on AI-made creations, there’s still uncertainty about who the actual creators could be.

Twist in the tale 

At the moment, OpenAI and Google are playing safe. They have transferred the onus of sharing the data to the publishers. Google said that publishers should be able to opt out of having their work mined by generative AI. The Google spokesperson pointed out that they want a discussion around creating a community-developed standard, which would be similar to robots.txt system that will enable publishers to opt out of the parts of sites being crawled by them.

In a similar vein, OpenAI mentioned in a blog post that if you don’t want GPTBot to visit your website, you can prevent it by adding GPTBot to your site’s robot.txt file. This implies that website owners need to actively take a step to stop OpenAI from accessing their website, rather than choosing to let them use their content for training. This is the initial move by OpenAI to let people on the internet choose not to have their information used for training their big language models. 

The question here arises: is it the right approach? Notably, users are asked to opt out and why not to opt in. This could be due to the fact that LLM creators might find it challenging to persuade individuals to compromise their privacy. 

Also, OpenAI has taken several measures to avoid legal tussles like partnering with Associated Press  recently to avail real-time data that can be freely used for training their future models. 

Fighting for compensation with these firms might not reap any results as there are no proper laws to back plus it’s time and money consuming. So, if you’re someone who depends on ChatGPT or Bard for tasks like composing emails or coding, the trade-off you face is sacrificing your data as the price you must bear.

Siddharth Jindal
Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR