Last updated July 18, 2023
In AI Trends & Future

The Dark Consequence of AI’s Data Cannibalism

Eventually, the AI models producing content currently will start training on data generated by themselves leading to data cannibalism

Share

Published on July 18, 2023

by Tasmia Ansari

Listen to this story

AI is eating itself. The internet has now become an AI dumping ground and the models being trained on the web are feeding on its own kind. That’s data cannibalism.

In an article for The New Yorker, acclaimed science fiction author Ted Chiang drew attention to the perils of AI copies breeding copies in a digital photocopying of sorts. He likens this burgeoning dilemma to the JPEG effect, where each subsequent copy degrades in quality, revealing a mosaic of unsightly artefacts. As the boundaries of AI replication blurs, the point to ponder is, what happens as AI-generated content proliferates around the internet, and AI models begin to train on them, instead of primarily human-generated content?

Recent findings by researchers from Britain and Canada state that generative AI models exhibit a phenomenon known as “model collapse“. This degenerative process occurs when models learn from data generated by other models, leading to a gradual loss of accurate representation of true data distribution. Remarkably, it is deemed unavoidable, even in scenarios where the conditions for long-term learning are nearly ideal.

Repercussions

According to Ross Anderson, a professor of security engineering at Cambridge University and co-author of the ‘model collapse’ research paper, the internet is at the risk of being flooded with insignificant content, similar to how the oceans are littered with plastic waste. This flood of content could hinder the training of new AI models through web scraping, benefiting firms that have already amassed data or control large-scale human interfaces. He further mentioned the recent Internet Archive fiasco when AI startups were aggressively mining the website for valuable training data, further highlighting this concern.

According to a recent report by media research organisation NewsGuard, an alarming trend has emerged in which websites are being filled with AI-generated junk content to attract advertisers. The report reveals that over 140 prominent brands unknowingly end up paying for advertisements displayed on websites powered by AI-written content. This growing spammy AI-generated material poses a threat for the very AI companies responsible for these models. As training data sets become increasingly saturated with AI-produced content, concerns are being raised regarding the diminishing utility of language models.

“There are many other aspects that will lead to more serious implications, such as discrimination based on gender, ethnicity or other sensitive attributes,” Ilia Shumailov a research fellow at Oxford University’s Applied and Theoretical Machine Learning Group said, especially if generative AI learns over time to produce, say, one race in its responses, while “forgetting” others exist.

Inclusivity All The Way

The current models are already in the bad books of AI ethicists for their lack of inclusivity. In 2021, a group of researchers warned about the white male problem in language models. Lead author Anders Søgaard, a professor at UCPH’s department of computer science, explains that these models exhibit systematic bias. Surprisingly, they align best with the language used by white men under 40 with lesser education, while showing the weakest alignment with language from young, non-white men. This discovery emphasises the pressing need to address and rectify the biases within language models to ensure fairness and inclusivity for all.

Along similar lines, Shumailov said, “To stop model collapse, we need to make sure that minority groups from the original data get represented fairly in the subsequent datasets.”

While some companies are working towards a more inclusive AI — like Meta’s recently released open-sourced consent-driven dataset of recorded monologues called ‘Casual Conversations v2‘. This enhanced version has been built to serve a broad spectrum of use cases. By offering researchers a robust resource, it empowers them to evaluate the performance of their models with greater depth.

But on the other hand, we have Google which has been in the news for not-so-good reasons since renowned AI ethicist Timnit Gebru was fired followed by the exit of the rest of the team calling Google a ‘white tech organisation’.

Flawed, not useless

While language models have a long, long list of ethical defects, they come with an array of advantages. For instance, Shumailov along with his team originally called ‘model collapse’ — the effect model dementia, but decided to rename it after objections from a colleague. “We couldn’t think of a replacement until we asked Bard, which suggested five titles, of which we went for The Curse of Recursion,” he wrote.

Currently, language models are becoming a part of every second company’s strategy. Firms in every sector are learning to unlock the full potential of the advanced chatbots based on language models like GPT-4. While it is too early to judge, companies are shifting their way through the generative AI clutter trying to figure out the best use cases for their business.

Access all our open Survey & Awards Nomination forms in one place