Last updated February 28, 2024
In AI Origins & Evolution

Why is Google Eyeing Reddit’s Data?

Google forks out over $60 million to Reddit in a deal to rent a dive into wild conversations to train its AI, all while Reddit gets to spruce up its search capabilities with Google's AI magic.

Share

Illustration by Nikhil Kumar

Published on February 24, 2024

by K L Krithika

Listen to this story

Recently, Google signed a data licensing agreement with Reddit, reportedly costing $60 million per year, to access the social media and news aggregation platform’s real-time content through its data API. This has come close to Reddit’s initial public offering (IPO) in March this year.

Reddit, the first major social media company to go public since Pinterest in 2019, will likely be aiming for a valuation of at least $5 billion in the IPO. Since the first announcement in 2021, the company has been making sweeping changes to allegedly increase its revenue and further its valuation.

Besides, in a big change, Reddit eliminated free access to most of its API. Developers now need to pay for access, with pricing based on the number of requests made, which were exorbitant, shutting down alternative apps to the site.

The primary purpose of this is to provide Google’s AI with a large amount of conversational data to train and improve their LLMs. The users on Reddit sign off on the rights to the platform to use the content however they see fit. However it is unclear how anonymisation of the data will be applied before being used for training. Reddit also benefits by leveraging Google’s Vertex AI, improving its not-so-good search feature.

Useful Data?

Most of Reddit is known for conversations that range from flippant to hateful. The polarised crowd in each Subreddit form their own echo chambers. The recent example of conversations on Russia-Ukraine war showed a bias towards Ukraine often spewing vitriolic speech when countered this stance.

Although the social media platform is regulated by moderators trying to curb hate speech, it is not always successful as the Subreddit rules vary.

Bindu Reddy, CEO and co-founder of Abacus.AI said on X, “Once they (Google) pre-train their model on this largely uncensored corpus where humans routinely reveal their true opinions, they will spend > $60M suppressing the Reddit content, nerfing, and nudging their model to reflect their ideology!”

Popular for its diverse content, the top comments on most of the posts are funny and satirical. The irony, satire, and humour offer a unique data set for training Google’s AI, contributing to a deeper understanding of complex human communication.

The data on Reddit is also organised, sorted and clear, with information on the upvotes making it structured. This gives Google instant access to all types of data to use this information better. The company plans to use “enhanced signals” to improve how it shows information, like showing more content from Reddit as announced by Google and Reddit.

From training on such content to AI being able to grasp nuances, identify misinformation through satire, and improve generative models with informal language and creativity, it’s a long way to go. However, challenges arise from the heavy reliance on context, potential bias amplification, and the limited generalisability of niche or slang-dominated data.

The company blog further added that Reddit searches are popular on Google, “This partnership will facilitate more content-forward displays of Reddit information that will make our products more helpful for our users and make it easier to participate in Reddit communities and conversations.”

AI experts suggest that messy, diverse data can enhance model performance, emphasising the importance of sophisticated filtering and a balanced training approach. Privacy advocates, however, raise concerns about using even anonymised Reddit data for advancing profiling and targeted advertising techniques.

The massive downside is the inaccurate information, during the peak of COVID-19, the company said, it would leave up Subreddits that spread misinformation related to Covid-19. Days later, after protest from many of its own users, Reddit banned the forum in question, saying it had violated other rules.

Meanwhile, Google has paused its image generator and apologised for it ‘missing the mark’ owing to inaccuracies. The irony here is that the model was inclusive but in showing African and Asian Nazi soldiers. Though training on Reddit’s (although structured) data would be easier, the questionable quality of data could lead to similar problematic outputs, where harmful stereotypes or misinformation are reinforced rather than detected.

Access all our open Survey & Awards Nomination forms in one place