Reddit, Stack Overflow Chase Fool’s Gold in Generative AI Rush

Will Wikipedia follow in the footsteps of Reddit, Stack Overflow, and other platforms by limiting access to its data for AI purposes?

Share

Published on May 2, 2023

by Mohit Pandey

Listen to this story

Everyone is going after generative AI these days, from big tech to IT companies to tech influencers, and now online communities. Last week, Reddit announced changes to its new API that will now start restricting the content pipeline used to train AI models by big-tech companies like Microsoft, Google, and OpenAI. This thought out move will now enable Reddit to put the fuel, the content, for chatbots like ChatGPT or Bard behind paywall. But, this begs a question: why the sudden shift towards monetisation, though?

Reddit chief Steve Huffman recognises the importance and value of the corpus of the data that the community platform hosts. And interestingly, Reddit is planning an initial public offering (IPO) this year. Since most of its revenue comes from advertising, the company’s plan to monetise on the generative AI landscape with the most valuable offering it has, is a smart move. “We don’t need to give all of that value to some of the largest companies in the world for free,” Huffman told The New York Times.

The current restriction of Reddit’s data API is just for big-techs which are building AI chatbots using LLMs.

The data API has been available in a structured form for developers since 2008. Unlike unstructured data that is available on the internet through web-scraping, Reddit’s API allows developers to research and build moderation and other tools by providing “data dumps”. The company says that it will still allow free-access to the Reddit data API for developers.

Following Reddit’s footsteps, the ‘LLM-obsessed’ Stack Overflow also announced that it is planning to begin charging large AI developers for access to its programming driven community questions. Stack Overflow chief Prashanth Chandrasekar told Wired that he was very supportive of Reddit’s approach.

“Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive,” explained Chandrasekar.

Reddit and Stack Overflow have not yet released the exact pricing details for access to their data APIs. But with the recent charge of $42,000 per month for accessing 50 million tweets by Musk, it is possible that these two platforms will also charge somewhere around that number.

Chandrasekar said that companies that are building LLMs are violating the terms of service of the platform. Even though companies can use the data to train models freely, the content posted by users on the platform falls under a Creative Commons licence, which means it needs proper attribution to where the data came from, in this case to the questions and answers of the specific users. This is not possible in the case of LLMs and is therefore clearly a violation. This is similar to how Musk accused Microsoft and OpenAI of illegally using Twitter data and stopping the access.

Sailing Against the ‘Generative AI’ Tides

In a most absurd behaviour, Stack Overflow previously had banned posting of chatbot generated answers. But later, the company announced that it is planning to integrate generative AI services within the community. Now with putting the data behind a paywall, the community is clearly trying to surf on the generative AI waves, or stopping it from rising higher.

Chandrasekhar said that for ensuring that future chatbots perform better than the current ones, it is essential that they are trained on evolving and progressing data. Fencing off valuable data might deter AI training and slow improvements in LLMs. He believes that proper licensing of the data API will accelerate the development of high-quality LLMs.

Similarly, publishers have also been wary about the usage of their website for training AI chatbots. According to the Washington Post, Google’s Bard uses data from Wikipedia, New York Times, The Guardian, and a lot more websites in its CommonCrawl Database. It is quite possible that Wikipedia might also put up some walls behind the usage of its data for AI since it has been seeking donations for the last few years. Jimmy Donal Wales, CEO of Wikipedia, believes that generative AI could actually help improve the online encyclopaedia.

On the flip side, Discord has announced no plans for modifying its API offerings, and are going to remain free. Swaleha Carlson, the spokesperson of the company said the API is provided under the terms that forbid AI training anyway.

When it comes to Reddit, the situation might be tricky. The company mostly has a very healthy relationship with Google and Microsoft. The search engines “crawl” the community platform’s pages for indexing information in the search results. This has been boding well for Reddit as its pages appear higher in the search results.

The dynamics is clearly a little different when it comes to data gobbling LLMs. Now that the company is putting them up behind a paywall, that too for big AI makers like Google and Microsoft, it might run into a situation where the search engines stop crawling the community platform’s pages for search results. This might result in platforms like Reddit and Stack Overflow losing on the revenue they currently generate through visitors and advertisers.

Everyone is chasing the generative AI’s fool’s gold (If data is the gold, data API’s are the fool’s gold). When it comes to community platforms like Stack Overflow and Reddit, the move of monetising on data API’s has a high possibility of backfiring. At the same time, this could be best bet that they can make.

Access all our open Survey & Awards Nomination forms in one place