MITB Banner

Why Google Is Killing Itself

The company is scraping the web to build better AI models but the move can backfire

Share

Listen to this story

Two weeks ago, Google secretly updated its privacy policy disclosing its practice of mining public data from web sources to enhance its AI services such as Bard and Cloud. Since the internet is already contaminated with AI-generated junk, eventually, future AI models, trained on web scraped data will perpetuate more biases leading to flawed outputs. While Google is busy scraping the web, OpenAI seems to have charted a better alternative to accurate data for its models. 

Google spokesperson Christa Muldoon asserted that the company has maintained a transparent privacy policy concerning the utilisation of publicly available data from the open web to train language models for services such as Google Translate. In a recent update, this practice has been extended to “newer services like Bard”. Muldoon emphasised that Google takes extensive measures to integrate privacy principles and safeguards into the development of their AI technologies, in line with their established AI principles.

Contrary to her statement, the policy revision for “publically accessible sources” is not displayed but rather buried under an embedded link within the “Your Local Information” tab of the privacy policy. Clicking on this link is necessary to access the relevant section.

Bard Has Eyes

Google has been hoarding everyone’s data and that is no secret. The company processes over 20 petabytes of data daily but it hasn’t been without its share of legal skirmishes. The largest newspaper publisher in the US, sued Google claiming that advancements in AI have helped the search giant hold a monopoly over the digital ad market. Google’s AI search beta, has also been labelled a “plagiarism engine“, while it was accused of gulping down website traffic, leaving others to starve for attention.

While the change in privacy policy will help Google collect every chunk of data on its platforms, the risk of unfiltered spam datasets to train future AI models increases. In terms of collecting clean data OpenAI seems to be a step ahead, looking at the recent partnerships with organisations like Associated Press (AP), one of the biggest US news agencies, Shutterstock and Boston Consulting Group.

The partnership with AP is said to explore ways to develop AI to support local news and in the process OpenAI will indirectly tie up with 41 news agencies that AJP supports. The six year partnership with Shutterstock, the Altman-run company, will use its images, videos, and music from content creators to train its large-language model. 

OpenAI Is A Parasite 

The recent efforts to partner with the agencies like media organisations, stock audio visual providers and veteran consulting firms shows OpenAI’s outline to obtain clean first source information for its datasets. In this case, Google can learn from OpenAI the art of harvesting data. 

But OpenAI has been extremely cagey about where the company got the data it used to train GPT4, the driving force behind the internet’s favourite ChatGPT. Questions have been raised yet the data theft issue currently sits in a legal grey area. No concrete solution has been proposed but several countries around the world have taken steps to have stricter AI regulations. 

Newsguard, an information tracking site has identified 50 websites as “almost entirely written by artificial intelligence software”. According to a new report from Europol, “Experts estimate that as much as 90 percent of online content may be synthetically generated by 2026,” referring to AI produced mass junk on the internet and models being trained on it.  

‘Don’t believe everything you see on the internet’ has been standard advice for a while now. It’s high time that big tech companies like Google take their data seriously as ignoring the issue will ripple its effect leading to a digital collapse. 

Share
Picture of Tasmia Ansari

Tasmia Ansari

Tasmia is a tech journalist at AIM, looking to bring a fresh perspective to emerging technologies and trends in data science, analytics, and artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.