Listen to this story
|
Two weeks ago, Google secretly updated its privacy policy disclosing its practice of mining public data from web sources to enhance its AI services such as Bard and Cloud. Since the internet is already contaminated with AI-generated junk, eventually, future AI models, trained on web scraped data will perpetuate more biases leading to flawed outputs. While Google is busy scraping the web, OpenAI seems to have charted a better alternative to accurate data for its models.
Google spokesperson Christa Muldoon asserted that the company has maintained a transparent privacy policy concerning the utilisation of publicly available data from the open web to train language models for services such as Google Translate. In a recent update, this practice has been extended to “newer services like Bard”. Muldoon emphasised that Google takes extensive measures to integrate privacy principles and safeguards into the development of their AI technologies, in line with their established AI principles.
Contrary to her statement, the policy revision for “publically accessible sources” is not displayed but rather buried under an embedded link within the “Your Local Information” tab of the privacy policy. Clicking on this link is necessary to access the relevant section.
Bard Has Eyes
Google has been hoarding everyone’s data and that is no secret. The company processes over 20 petabytes of data daily but it hasn’t been without its share of legal skirmishes. The largest newspaper publisher in the US, sued Google claiming that advancements in AI have helped the search giant hold a monopoly over the digital ad market. Google’s AI search beta, has also been labelled a “plagiarism engine“, while it was accused of gulping down website traffic, leaving others to starve for attention.
While the change in privacy policy will help Google collect every chunk of data on its platforms, the risk of unfiltered spam datasets to train future AI models increases. In terms of collecting clean data OpenAI seems to be a step ahead, looking at the recent partnerships with organisations like Associated Press (AP), one of the biggest US news agencies, Shutterstock and Boston Consulting Group.
The partnership with AP is said to explore ways to develop AI to support local news and in the process OpenAI will indirectly tie up with 41 news agencies that AJP supports. The six year partnership with Shutterstock, the Altman-run company, will use its images, videos, and music from content creators to train its large-language model.
OpenAI Is A Parasite
The recent efforts to partner with the agencies like media organisations, stock audio visual providers and veteran consulting firms shows OpenAI’s outline to obtain clean first source information for its datasets. In this case, Google can learn from OpenAI the art of harvesting data.
But OpenAI has been extremely cagey about where the company got the data it used to train GPT4, the driving force behind the internet’s favourite ChatGPT. Questions have been raised yet the data theft issue currently sits in a legal grey area. No concrete solution has been proposed but several countries around the world have taken steps to have stricter AI regulations.
Newsguard, an information tracking site has identified 50 websites as “almost entirely written by artificial intelligence software”. According to a new report from Europol, “Experts estimate that as much as 90 percent of online content may be synthetically generated by 2026,” referring to AI produced mass junk on the internet and models being trained on it.
‘Don’t believe everything you see on the internet’ has been standard advice for a while now. It’s high time that big tech companies like Google take their data seriously as ignoring the issue will ripple its effect leading to a digital collapse.