Last updated January 31, 2023
In AI Origins & Evolution

ChatGPT, Pay Us!

While Google is taking up all the attention and lawsuits, a new monster is secretly crawling up the news websites, eating their content, all while paying next to nothing to the publishers

Share

Published on January 31, 2023

by Lokesh Choudhary

Listen to this story

From Google to TikTok to ChatGPT, internet users are finding newer ways to unearth information and experience entertainment. While users might be interacting with such websites/apps directly, these platforms source information from all over the internet — available on various websites. And this raises a major copyright infringement issue.

Recently, the minister of state in the ministry of electronics and information technology, Rajeev Chandrasekhar, said big-tech content aggregators should give a “fair share of revenues” to the digital platforms of print news publishers.

In the meantime, Google is being tried in the courts of the USA for monopolising the digital advertising technology products. In the past too, several governments have sued Google for the non-payment of revenue to news publications. In August last year, the US passed a Bill that would force Google and Facebook to support journalism and help news organisations negotiate collectively with big-tech companies.

However, while Google is taking up all the attention and lawsuits, a new monster is secretly crawling up the news websites, eating their content, while paying next to nothing to the publishers. Yes, it’s ChatGPT. Large language models like ChatGPT have come under heavy criticism for being trained on web content, where the original source is neither informed nor attributed. The data, on the other hand, is openly used by ChatGPT to answer queries.

Webpages tackle this menace by implementing the Robots Exclusions Protocol (Robots.txt), which essentially indicates to the web crawlers and other web robots the portions of the web page they are allowed to visit. However, this also results in less visibility on the internet as it restricts Google crawlers to access the page.

When asked what websites the model is trained on, ChatGPT says that it was trained on text from websites like BBC, CNN, The New York Times, The Guardian etc. News websites can not apply restrictive practices as they are heavily dependent on views.

This raises a pertinent question: So, if ChatGPT is using the data from these news websites, should it not be paying them?

When asked if was trained on Analytics India Magazine, the model answered in the affirmative and added that it continues to use our data in its responses.

According to the law, any original online content or feed in the form of text, image, video, or music is protected as a literary work under Section 14 of the Copyright Act. The law clearly states that no person can copy, or publish the content without the permission of the original creator.

Microsoft-OpenAI lawsuit paving the way?

Recently, Microsoft along with OpenAI were sued by Matthew Butterick , a lawyer and open-source programmer, for the creation of GitHub Copilot, an AI-powered coding assistant. Copilot has been trained on public repositories of codes scraped off the web, many of which were allegedly published with licences. Many developers claim that GitHub Copilot produces their copyrighted codes without any attribution or licences.

One such developer is Tim Davis, who also claims that Copilot produced his copyrighted codes without giving due credits.

@github copilot, with "public code" blocked, emits large chunks of my copyrighted code, with no attribution, no LGPL license. For example, the simple prompt "sparse matrix transpose, cs_" produces my cs_transpose in CSparse. My code on left, github on right. Not OK. pic.twitter.com/sqpOThi8nf
— Tim Davis (@DocSparse) October 16, 2022

As per the AI community, these kinds of lawsuits restrict the development of generative AI. On the other hand, Butterick, the plaintiff, said the case will allow people to come forward and find a better way to do it. Spotify and iTunes, according to Butterick, present good examples. In the early 2000s, Napster was loved by everyone, owing to the platform providing free music, but was illegal. So, we have to find better ways of doing it.

It’s not just Butterick that is suing OpenAI on the issue of Copilot, we’ve also seen companies like Getty Images suing Stability AI on unlawfully copying and processing millions of images which were protected by copyright. Getty Images actually provided licences for training AI models, in a manner which respects personal and IP rights. Stability AI however, did not seek any such licences, as per Getty Images.

TikTok shows the way for webpages

In April 2020, a body comprising of several music studios, with Universal Studio being in lead, threatened TikTok with copyright infringement for using their music on its platform without paying for the content. The body, NMPA, is historically known for suing platforms like YouTube, and Spotify for copyright with a successful track record.

While the drift between TikTok and music labels is still continuing, with the former agreeing to pay a part of its revenue to the latter, YouTube has paid around $6 billion between July 2021 and July 2022. Furthermore, since TikTok is expected to pocket revenues of around $12 billion in the FY 2023, music labels want a bigger share of the pie.

Similarly, since the LLM models like ChatGPT are trained on educational platforms, research papers, news publications, etc, they should be liable for the content they are trained on. Moreover, if the models can’t pay the creators, at least there should be an option for the webpage owners to remove their content from the dataset.

Recently, Stability AI had announced that the platform will allow artists to remove their work from the training dataset for an upcoming Stable Diffusion 3.0 release.

Access all our open Survey & Awards Nomination forms in one place