Listen to this story
During Web 2.0, Google and other browser companies made themselves profitable by selling user data to advertisers. To break into this monopolistic market, newcomers needed a unique proposition that set them apart in the market.
While many browsers used privacy as a selling factor, Brave browser offered a standout feature; that of rewarding users for their anonymised data. This was powered by a blockchain-based token called Basic Attention Token (BAT). Now that the blockchain hype has died down, it seems Brave is jumping on the AI bandwagon by selling an API for AI training data.
Collating and selling training data has become one of the hottest growing markets in the generative AI wave. Recognising this, many top text-based platforms, like Twitter and Reddit, have locked down access to their APIs. Even companies with focus on data security and privacy have thrown these ideals to the wayside in a bid to make money off the AI wave.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Reports have emerged that privacy-focused browser Brave is making a business by selling access to a paid API of web data. Reading the fine text for the API has led many to question Brave’s strong privacy and security stance while raising ethical issues about the copyright of the content.
Brave Search API explained
In a bid to capitalise on the AI wave, the Brave search API offers plans targeted specifically for use in AI models. Subscribers to the paid API get results from the web, access to Brave’s news cluster, as well as “rights to use data for AI inference”. Aiming to feed the ever-growing appetite of AI algorithms, it seems that Brave has resorted to selling the Internet.
As mentioned previously, there is another alternative to Brave’s search API, namely Bing’s competing offering. However, the main difference is that Bing does not mention using the API for training user data, which could be due to a combination of vested interests in OpenAI and to avoid possible copyright kerfuffles.
Brave, on the other hand, does not seem to have any issues in distributing web content for free. According to research by Alex Ivanovs at StackDiary, the output from the Brave web search for AI API extracts up to 260 words in a machine-readable format through its ‘Extra Snippets’ feature. While these are functionally similar to Google’s Featured Snippets, they routinely extend above 150 words, which stretches the limits of what is allowed under Fair Use.
In addition to the Extra Snippets feature, Brave also offers rich and structured web result data through Schema and access to its FAQs and Discussions features. This combination of features would allow any paying customer of the API to extract valuable data in a certain domain and even use it to fine-tune trained models.
To build this database, Brave makes heavy use of its own crawler, which has indexed over 8 billion pages over the course of its functioning. Moreover, it crawls over 40 million pages every day, contributing further to the ever-growing index of the search engine. However, by selling this data for a monthly fee, there is a case to be made that Brave is in violation of copyright standards like CC BY-NC-ND, which expressly prohibit using content for commercial purposes.
While there is a possibility that Brave is being safe about the type of data it indexes, it is difficult to prove this. Moreover, once copyrighted data has been used to train an AI model, there is no provenance to trace the data’s source. This, coupled with the recent trend of API selling, has the potential to set a bad example for the rest of the industry.
Selling what they don’t own
APIs famously began with commercial roots, spearheaded by Salesforce’s automation API, which is widely considered to be the first API in the world. However, this trend quickly shifted to websites providing services in an XML or JSON format, mostly for free. Facebook’s API launch arguably played a big part in its growth, and Flickr’s API is ever-present in websites from the 2000s.
However, with the value ascribed to data thanks to AI, companies are walking back down the route of closed and paid APIs. It seems that APIs are now going back to being a sure-shot way to monetisation, mainly thanks to the value of high-quality training data. Even in this market, Brave is treading dangerous ground.
Apart from the API service, Brave also offers a ‘bespoke, large-data solution’ for companies looking to build a product beyond the API’s capabilities. This also seems to suggest that Brave has a dataset, similar to LAION, that encompasses the entirety of the Internet. This approach has been shown to be risky, as evidenced by the recent spate of copyright lawsuits against AI companies.
Even industry leaders like OpenAI and Meta come under fire recently for its wanton use of copyrighted materials to train its algorithms. In a class-action lawsuit headed up by author Sarah Silverman, OpenAI and Meta were accused of using copyrighted books from shadow libraries like Library Genesis and Z-Library as training data.
As AI continues to eat more of the Internet, companies are looking to make a quick buck by selling more of this data. However, without adequate protection against copyright laws, these services find themselves increasingly on the grey side of the law.