MITB Banner

Google Open-Sources Robot.txt To Help Standardise Robots Exclusion Protocol

Share

Open-sourcing has become a popular practice now. Companies have realised the importance of the feedback that they get from open sourcing their projects and have experienced improvements by doing so. No matter how large the company is, open-sourcing projects brings insights into conceivable refinements.

Google has taken a strong stand in standardising the Robots Exclusion Protocol (REP). Now, as a part of these efforts, the search engine giant has open-sourced its Google.txt Parser.

Need For REP

REP, in simple words, is a standard given to websites to communicate with the web robots. It tells them about the web sections that they should and should not scan and process. But despite the standards, not all web robots follow these standards. Some web robots scan the parts of the website that they have been adviced to not scan. If there is any conflict to any statement by robots.txt by any directive, robot.txt disallows it.

The REP was only a de-facto standard for 25 years. This had affected negatively because

  • Uncertainty for corner cases for webmasters. 
  • It introduced uncertainty for crawler and tool developers as well.

The REP says that in case of multiple domains, each subdomain must have its own robots.txt file. Crawler directives are responsible to tell the web crawlers the sections of the web that they can crawl. Major search engines follow these. But indexer crawling does not exist and hence crawling of the resource with the indexer directive must be allowed by search engines. Despite this, it is possible to get them for groups of URIs. 

Google.txt Parser Open-Sourced

As a part of its efforts to have an internet standard for a REP, Google open-sourced its Google.txt Parser. Google.txt is a C++ library as old as 20 years. The team at Google uses the REP for the purpose of matching the rules included in robot.txt files. Google in these 20 years upgraded this library to a large extent. It has learned a lot about webmasters writing robots.txt files and corner cases which Google had covered for. The library is hosted at GitHub repository which the community can use to access the work. 

The main objective of Google.txt Parser making open source is to take help from the community worldwide for the standardization of the REP. By making it open source, the entire community can access, make suggestions and help in these efforts. There is also a testing tool package by Google to help the community test a few rules. This defines all the previously undefined scenarios for robots.txt parsing.

With this open-sourced parser from Google, the developer community can create parsers as instructed by the REP. It is to make sure that web robots only scan the parts of the web that they are instructed to scan. Web robots can now use any transfer protocol that is based on URI, instead of HTTPs.

https://twitter.com/googlewmc/status/1145648654415478785

Outlook

Google is not commonly known for revealing something from its core research to the open-source world. This is one of the rare times that Google has done it. By this, the open source world can read and listen to robots.txt files available for any crawler or coder. There are some industries that are trying to extend the REM. The Yahoo! Search Blog and the Microsoft Live Search Webmaster team includes wildcard support, sitemaps and extra META tags.

Share
Picture of Disha Misal

Disha Misal

Found a way to Data Science and AI though her fascination for Technology. Likes to read, watch football and has an enourmous amount affection for Astrophysics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.