“In the interest of maintaining a healthy ecosystem and preparing for potential future open source releases, we’re retiring all code that handles unsupported and unpublished rules (such as noindex) on September 1, 2019.”
This means rules unsupported by the internet draft, such as crawl-delay, nofollow, and noindex will no longer be effective.
What are the alternatives suggested by Google?
As these are going to become ineffective, what are other ways for those who want to control crawling? For those of you who relied on the noindex indexing directive in the robots.txt file, Google suggests a number of alternative options such as:
- Noindex in robots meta tags: This is supported in the HTTP response headers as well as in HTML, and the most effective way to remove URLs from the index when crawling is allowed.
- 404 and 410 HTTP status codes: These status codes show that page no longer exists and will drop such URLs from Google’s index once they’re crawled and processed.
- Password protection: If the markup is not used to indicate subscription or paywalled content, using a login password to access a page will generally remove the page from Google’s index.
- Disallow in robots.txt: Preventing a page from being crawled will stop indexing because search engines can only index pages that they know about. The search engine may also index a URL based on links from other pages. Making such pages less visible will prevent a search engine from indexing them.
- Search Console Remove URL tool: One of the best ways to remove a URL temporarily from Google’s search results.
Why Google decided to stop supporting robot.txt file
The main reason to abandon all the code that handles all the rules unsupported and unpublished in the internet draft is that they are unofficial.
The robots.txt directive that Google has, in the past supported will no longer be working. It is important to come out with the right course of actions to control the crawling and indexing of web pages.
Let us see what Google says…
“For 25 years, the Robots Exclusion Protocol (REP) was only a de-facto standard. This had frustrating implications sometimes. On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files. On the other hand, for crawler and tool developers, it also brought uncertainty.”
With the new internet draft published recently and that provides an extensible architecture for rules that are not part of the standard, noindex directive in robot.txt is going to be futile to help publishers to control the indexing of their pages and prevent the crawler from crawling through the page that they don’t want.
One of the most basic and critical components of the web, Robots Exclusion Protocol (REP) allows website owners to exclude automated clients, such as web crawlers, from accessing their sites – either partially or completely.
However, this was not made part of an official Internet standard, and that is developers have interpreted the protocol somewhat differently over the years. Also, the ERP has, since its inception, hasn’t been updated to cover today’s corner cases. That ambiguous de-facto standard made it difficult to write the rules correctly, this is a challenging problem for website owners.
“We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers,” says Google.
The proposed REP draft was prepared after looking at 20 years of real-world experience of relying on robots.txt rules. It does not change the original rules. Rather it only defines the essentially all undefined scenarios for robots.txt parsing and matching and extends it for the modern web. So these fine-grained controls still give the publisher the ability to decide on the page to be crawled on their site and potentially shown to interested users.
One thing is clear that noindex directive is no longer going to work. If you still depend on them to prevent pages from being crawled and indexed, you need to change your strategy and look for other options or use those suggested by Google. Make sure you do this before the deadline when Google will completely stop supporting noindex robots.txt and other undocumented rules like now follow, crawl delay, etc.
Latest posts by Amit Mishra (see all)
- How Motion UI is Predicting the Future Trend in Web Development - February 14, 2020
- Why ‘Dark Mode’ is Trending among App Designers & Mobile Users? - February 7, 2020