Open-sourcing has become a popular practice now. Companies have realised the importance of the feedback that they get from open sourcing their projects and have experienced improvements by doing so. No matter how large the company is, open-sourcing projects brings insights into conceivable refinements.
Google has taken a strong stand in standardising the Robots Exclusion Protocol (REP). Now, as a part of these efforts, the search engine giant has open-sourced its Google.txt Parser.
Need For REP
REP, in simple words, is a standard given to websites to communicate with the web robots. It tells them about the web sections that they should and should not scan and process. But despite the standards, not all web robots follow these standards. Some web robots scan the parts of the website that they have been adviced to not scan. If there is any conflict to any statement by robots.txt by any directive, robot.txt disallows it.
The REP was only a de-facto standard for 25 years. This had affected negatively because:
- Uncertainty for corner cases for webmasters.
- It introduced uncertainty for crawler and tool developers as well.
The REP says that in case of multiple domains, each subdomain must have its own robots.txt file. Crawler directives are responsible to tell the web crawlers the sections of the web that they can crawl. Major search engines follow these. But indexer crawling does not exist and hence crawling of the resource with the indexer directive must be allowed by search engines. Despite this, it is possible to get them for groups of URIs.
Google.txt Parser Open-Sourced
As a part of its efforts to have an internet standard for a REP, Google open-sourced its Google.txt Parser. Google.txt is a C++ library as old as 20 years. The team at Google uses the REP for the purpose of matching the rules included in robot.txt files. Google in these 20 years upgraded this library to a large extent. It has learned a lot about webmasters writing robots.txt files and corner cases which Google had covered for. The library is hosted at GitHub repository which the community can use to access the work.
The main objective of Google.txt Parser making open source is to take help from the community worldwide for the standardization of the REP. By making it open source, the entire community can access, make suggestions and help in these efforts. There is also a testing tool package by Google to help the community test a few rules. This defines all the previously undefined scenarios for robots.txt parsing.
With this open-sourced parser from Google, the developer community can create parsers as instructed by the REP. It is to make sure that web robots only scan the parts of the web that they are instructed to scan. Web robots can now use any transfer protocol that is based on URI, instead of HTTPs.
Google is not commonly known for revealing something from its core research to the open-source world. This is one of the rare times that Google has done it. By this, the open source world can read and listen to robots.txt files available for any crawler or coder. There are some industries that are trying to extend the REM. The Yahoo! Search Blog and the Microsoft Live Search Webmaster team includes wildcard support, sitemaps and extra META tags.
Register for our upcoming events:
- Meetup: NVIDIA RAPIDS GPU-Accelerated Data Analytics & Machine Learning Workshop, 18th Oct, Bangalore
- Join the Grand Finale of Intel Python HackFury2: 21st Oct, Bangalore
- Machine Learning Developers Summit 2020: 22-23rd Jan, Bangalore | 30-31st Jan, Hyderabad
Enjoyed this story? Join our Telegram group. And be part of an engaging community.
What's Your Reaction?
Found a way to Data Science and AI though her fascination for Technology. Likes to read, watch football and has an enourmous amount affection for Astrophysics.