Listen to this story
Google and OpenAI aren’t shying away from accepting that they need your data by all means to better Bard and ChatGPT, respectively. Recently, The Guardian came out with a report in which Google says that copyright law should be altered to allow generative AI systems to scrape the internet.
The company is urging Australian policymakers to endorse “copyright systems that enable appropriate and fair use of copyrighted content to enable the training of AI models in Australia on a broad and diverse range of data” while also offering an option to opt out for entities that prefer not to have their data used for AI training.
On the other hand, in the midst of debates surrounding web scraping without consent, OpenAI introduced GPTBot, an automated website crawler. The bot is designed to collect publicly accessible data to train AI models, a process that OpenAI assures will be executed transparently and responsibly.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
As generative AI gains more popularity, there’s a growing need for more data. LLM-based chatbots ChatGPT and Google Bard depend on lots of text, images, and videos.
OpenAI says that GPT-4 learns from a wide variety of data sources that are approved, organized, and available to the public, which might also include information that’s out there for anyone to see. OpenAI has recently acquired a trademark for GPT-5, whose success depends on the quality of data it’s trained with, aside from the computational power of GPUs.
However, the accessibility of data for both OpenAI and Google remains uncertain due to widespread awareness of their internet scraping practices, a matter that’s attracting considerable public opposition.
Human Content is Lifeline
To sustain an LLM and make it better than the previous version, it needs human content. The problem here arises on deciding whether the companies should pay for it or simply take it from the internet. Going by the current scenario, it wouldn’t be surprising if human-generated content is sold for a premium in the future.
It is not like OpenAI trains GPT-4 on only human content; it recently started training GPT-4 on datasets created by ChatGPT. However, it cannot go for long as it will eventually lead to a model collapse. This degenerative process takes place when models learn from data produced by other models. As a result, there is a gradual loss in the accurate representation of the true data distribution.
Anyone trying out ChatGPT to write poems can easily figure out that it was trained on poetry books and essays of the highest level. However, unfortunately, OpenAI didn’t take any permission from the authors.
Last month, 8000 authors including Margaret Atwood, Viet Thanh Nguyen and Philip Pullman signed a petition calling out artificial intelligence companies to stop using writers’ work without consent or credit. They argued that the hard work behind any form of art needs to be validated and credit should be given to the respective creator.
However, when it comes to copyright for AI-generated works, the question of ownership arises. Usually, the Copyright Act assigns initial ownership to the creators of the work. But, because there haven’t been any legal or copyright office rulings on AI-made creations, there’s still uncertainty about who the actual creators could be.
Twist in the tale
At the moment, OpenAI and Google are playing safe. They have transferred the onus of sharing the data to the publishers. Google said that publishers should be able to opt out of having their work mined by generative AI. The Google spokesperson pointed out that they want a discussion around creating a community-developed standard, which would be similar to robots.txt system that will enable publishers to opt out of the parts of sites being crawled by them.
In a similar vein, OpenAI mentioned in a blog post that if you don’t want GPTBot to visit your website, you can prevent it by adding GPTBot to your site’s robot.txt file. This implies that website owners need to actively take a step to stop OpenAI from accessing their website, rather than choosing to let them use their content for training. This is the initial move by OpenAI to let people on the internet choose not to have their information used for training their big language models.
The question here arises: is it the right approach? Notably, users are asked to opt out and why not to opt in. This could be due to the fact that LLM creators might find it challenging to persuade individuals to compromise their privacy.
Also, OpenAI has taken several measures to avoid legal tussles like partnering with Associated Press recently to avail real-time data that can be freely used for training their future models.
Fighting for compensation with these firms might not reap any results as there are no proper laws to back plus it’s time and money consuming. So, if you’re someone who depends on ChatGPT or Bard for tasks like composing emails or coding, the trade-off you face is sacrificing your data as the price you must bear.