MITB Banner

The Dark Consequence of AI’s Data Cannibalism

Eventually, the AI models producing content currently will start training on data generated by themselves leading to data cannibalism

Share

Listen to this story

AI is eating itself. The internet has now become an AI dumping ground and the models being trained on the web are feeding on its own kind. That’s data cannibalism

In an article for The New Yorker, acclaimed science fiction author Ted Chiang drew attention to the perils of AI copies breeding copies in a digital photocopying of sorts. He likens this burgeoning dilemma to the JPEG effect, where each subsequent copy degrades in quality, revealing a mosaic of unsightly artefacts. As the boundaries of AI replication blurs, the point to ponder is, what happens as AI-generated content proliferates around the internet, and AI models begin to train on them, instead of primarily human-generated content?

Recent findings by researchers from Britain and Canada state that generative AI models exhibit a phenomenon known as “model collapse“. This degenerative process occurs when models learn from data generated by other models, leading to a gradual loss of accurate representation of true data distribution. Remarkably, it is deemed unavoidable, even in scenarios where the conditions for long-term learning are nearly ideal. 

Repercussions

According to Ross Anderson, a professor of security engineering at Cambridge University and co-author of the ‘model collapse’ research paper, the internet is at the risk of being flooded with insignificant content, similar to how the oceans are littered with plastic waste. This flood of content could hinder the training of new AI models through web scraping, benefiting firms that have already amassed data or control large-scale human interfaces. He further mentioned the recent Internet Archive fiasco when AI startups were aggressively mining the website for valuable training data, further highlighting this concern.

According to a recent report by media research organisation NewsGuard, an alarming trend has emerged in which websites are being filled with AI-generated junk content to attract advertisers. The report reveals that over 140 prominent brands unknowingly end up paying for advertisements displayed on websites powered by AI-written content. This growing spammy AI-generated material poses a threat for the very AI companies responsible for these models. As training data sets become increasingly saturated with AI-produced content, concerns are being raised regarding the diminishing utility of language models. 

“There are many other aspects that will lead to more serious implications, such as discrimination based on gender, ethnicity or other sensitive attributes,” Ilia Shumailov a research fellow at Oxford University’s Applied and Theoretical Machine Learning Group said, especially if generative AI learns over time to produce, say, one race in its responses, while “forgetting” others exist.

Inclusivity All The Way

The current models are already in the bad books of AI ethicists for their lack of inclusivity. In 2021, a group of researchers warned about the white male problem in language models. Lead author Anders Søgaard, a professor at UCPH’s department of computer science, explains that these models exhibit systematic bias. Surprisingly, they align best with the language used by white men under 40 with lesser education, while showing the weakest alignment with language from young, non-white men. This discovery emphasises the pressing need to address and rectify the biases within language models to ensure fairness and inclusivity for all.

Along similar lines, Shumailov said, “To stop model collapse, we need to make sure that minority groups from the original data get represented fairly in the subsequent datasets.”

While some companies are working towards a more inclusive AI —  like Meta’s recently released open-sourced consent-driven dataset of recorded monologues called ‘Casual Conversations v2‘. This enhanced version has been built to serve a broad spectrum of use cases. By offering researchers a robust resource, it empowers them to evaluate the performance of their models with greater depth. 

But on the other hand, we have Google which has been in the news for not-so-good reasons since renowned AI ethicist Timnit Gebru was fired followed by the exit of the rest of the team calling Google a ‘white tech organisation’. 

Flawed, not useless

While language models have a long, long list of ethical defects, they come with an array of advantages. For instance, Shumailov along with his team originally called ‘model collapse’ — the effect model dementia, but decided to rename it after objections from a colleague. “We couldn’t think of a replacement until we asked Bard, which suggested five titles, of which we went for The Curse of Recursion,” he wrote.  

Currently, language models are becoming a part of every second company’s strategy. Firms in every sector are learning to unlock the full potential of the advanced chatbots based on language models like GPT-4. While it is too early to judge, companies are shifting their way through the generative AI clutter trying to figure out the best use cases for their business. 

Share
Picture of Tasmia Ansari

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.