Listen to this story
To say that Stack Overflow has been having a bad year is an understatement. From considerable community backlash for its proposed LLM product, to uproar over its API access changes, the community question answer platform has come under fire since ChatGPT exploded in popularity. However, this isn’t the only reason the site has declined in popularity.
New statistics show that Stack Overflow has lost around 50% of its traffic over the past one and a half years. Moreover, it has also experienced a decrease in its lifeblood of questions and answers, which has also reduced by 50%. This also comes at a time where many users of the site feel increasingly strangled by moderation.
Even as the site continues to crack down on the quality of its content, a case can be made for the increase in moderation on the website. As the Internet continues to be filled with AI junk, Stack Overflow’s heavily-moderated database of rich user-driven content might be the last bastion of human-generated domain specific-data.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Stack Overflow’s unsteady mutiny
Even before the launch of ChatGPT in November last year, Stack Overflow was seeing a steady decline in users. This was mainly caused by the company’s new-found attitude towards moderation, which started to veer into the extreme. Hacker News forum member JohnMakin stated,
“Moderation on SO has gotten progressively more horrible. Can’t tell you how many times I found the exact, bizarre question I was asking only to see one comment trying to answer it and then a mod aggressively shutting it down for not being “on topic” enough or whatever….Oftentimes the best answer is buried in comments and has very negative feedback despite answering the exact question.”
This can largely be traced back to a moderation strike which curators, contributors, and moderators of the site participated in on June 5th 2023. The main objective of this was to protest Stack Overflow’s flip-flopping AI policy, which first led to thousands of posts being removed and hundreds of users being suspended. This was then revoked in May of this year, allowing AI content to be published on the platform, much to moderators’ chagrin.
This then led to moderator’s raising the alarm over AI-generated content, believing that it will “over time, drive the value of the sites to zero”. They also argued that the company has ignored the needs of its community, instead focusing on business pivots. Through the strike, they aimed to bring attention to the issues moderators on the site face.
While the moderators are currently engaging in a retracted battle against the site’s owners, it seems that they are slowly winning. They have succeeded in bringing in an interim solution on the generative AI front, wherein the AI-generated content will be checked against a set of ‘strong’ and ‘weak’ heuristics, which will determine whether a post should be removed or not. The moderators were also successful in getting Stack Overflow to continue providing access to the data dumps and API access. This battle belies the importance of sticking to human-generated content in the age of AI, especially when the company is trying to make a living selling training data.
Saving the golden goose
Currently, many developers have turned to using chatbots to solve their programming issues. As algorithms like ChatGPT get better, their capability to logically deconstruct code also becomes more capable. Kartik D, a Senior Backend Developer at MachineHack, said on using Stack Overflow, “Finding the right Stack Overflow answer for an issue is difficult, but it’s easier in ChatGPT. Combining GPT-3.5 and Bard you get a good result, but the suggested results in Bard usually redirect to Stack Overflow.”
This shows the impact that Stack Overflow has on the training datasets of large language models like GPT-4. It is well-known information that question-answer sites are some of the richest sources of data, especially for large language models. Not only is the quality of the data high, but it is also structured in a model that could net the best training.
User maxlin on the Hacker News forum summarised this perfectly, stating, “Even though StackOverflow in the common use case has been taken over by ChatGPT, I sincerely hope it keeps operating, stays strict (even if it causes collateral) and keeps ban on LLM-generated content…Obviously ChatGPT was trained partly with data only gainable from a healthy StackOverflow-kind of site with users actively asking unique questions and enough people answering those unique questions with well-thought-out answers.”
This also echoes the statements of Reddit CEO Steve Huffman, who has stated that Reddit’s ‘corpus of data is really valuable’, as it contains things that people would ‘only ever say in therapy, or A.A., or never at all”. In that way, Stack Overflow also contains answers to some of the most specific technical queries on the Internet, keeping the quality high and up-to-date.
If AI content is allowed on the site, the quality of overall content would deteriorate and move away from the carefully worded and constructed answers of today. Moreover, stronger moderation will only increase the quality of the data, which is something Stack Overflow will soon desperately count on as self-debugging LLMs become more prominent.