Listen to this story
|
Amidst controversies about scraping websites from the internet without consent, OpenAI has released GPTBot for crawling website automatically. This bot will gather publicly available data for training AI models, which the company says would be in a transparent and responsible manner.
OpenAI said in its documentation about the release that the web crawler will filter to remove sources that require paywall access while also removing personally identifiable information (PII), or text that violates its policies. The GPT creator claims allowing the bot can help improve the accuracy and capabilities of AI systems in the future. It can be identified with the below code:
User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
On the other hand, you can also disable GPTBot to access your site by adding GPTBot to your site’s robot.txt. This means that website owners would have to voluntarily make a step to disable OpenAI’s access to their website, instead of opting-in for training.
User-agent: GPTBot
Disallow: /
You can also control the access of the GPTBot on certain parts of your website by including the code below into robot.txt.
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
Though OpenAI is acknowledging that it scrapes the internet for training its large language models like GPT-4, this still looks like a half-baked approach to address the ethical dilemmas around copying data from other people’s websites.
People on HackerNews discuss the ethics of the release of this web crawler for training AI models. “OpenAI isn’t even citing in moderation. It’s making a derivative work without citing, thus obscuring it,” said one of the users. Moreover, OpenAI does not acknowledge the websites it has already used to build its models.
Recently, OpenAI had also filed for a trademark for ‘GPT-5’, hinting that the company is training its next version of GPT-4 which would be, according to several reports, close to AGI, what the company’s goal has been all this while. GPTBot is clearly going to help the company gather more data from across the internet to train this model. On other hand, the company also discontinued its AI Classifier for detecting GPT generated text.