MITB Banner

OpenAI Now Crawls the Internet with GPTBot

This means that website owners would have to voluntarily make a step to disable OpenAI’s access to their website, instead of opting-in for training.

Share

OpenAI is Now Crawling the Internet with GPTBot
Listen to this story

Amidst controversies about scraping websites from the internet without consent, OpenAI has released GPTBot for crawling website automatically. This bot will gather publicly available data for training AI models, which the company says would be in a transparent and responsible manner.

OpenAI said in its documentation about the release that the web crawler will filter to remove sources that require paywall access while also removing personally identifiable information (PII), or text that violates its policies. The GPT creator claims allowing the bot can help improve the accuracy and capabilities of AI systems in the future. It can be identified with the below code:

User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

On the other hand, you can also disable GPTBot to access your site by adding GPTBot to your site’s robot.txt. This means that website owners would have to voluntarily make a step to disable OpenAI’s access to their website, instead of opting-in for training. 

User-agent: GPTBot
Disallow: /

You can also control the access of the GPTBot on certain parts of your website by including the code below into robot.txt.

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

Though OpenAI is acknowledging that it scrapes the internet for training its large language models like GPT-4, this still looks like a half-baked approach to address the ethical dilemmas around copying data from other people’s websites.

People on HackerNews discuss the ethics of the release of this web crawler for training AI models. “OpenAI isn’t even citing in moderation. It’s making a derivative work without citing, thus obscuring it,” said one of the users. Moreover, OpenAI does not acknowledge the websites it has already used to build its models. 

Recently, OpenAI had also filed for a trademark for ‘GPT-5’, hinting that the company is training its next version of GPT-4 which would be, according to several reports, close to AGI, what the company’s goal has been all this while. GPTBot is clearly going to help the company gather more data from across the internet to train this model. On other hand, the company also discontinued its AI Classifier for detecting GPT generated text.

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India