FeedHall

State of robots.txt

The purpose of the robots.txt is to tell what url's a web crawler can and can't access. It is however only an advice, and there are several crawlers that do not completely obey the robots.txt.

After accessing and analyzing over a hundred million domains some patterns of the robots.txt usage have emerged.

Before digging deeper, there are some facts that need to be mentioned:

  • robots.txt must be placed in the sites top-level directory (e.g example.com/robots.txt, not example.com/robots/robots.txt)
  • Crawlers may choose to ignore your robots.txt file. This is common for scrapers.
  • robots.txt is a publicly available file which means that you should not utilize it as a way of hiding hidden urls for search engines. Do not use this a security measure.
  • robots.txt is lower case and UTF8. Always.
  • robots.txt is not bullet-proof for removing pages from search engines. A page can be indexed even though it has not been crawled if there are links to it.
  • Crawlers may impose a robots.txt size limit of 500KiB, so it's good to keep it under that size.
  • Crawlers should not use a cached version longer than 24h unless it's unreachable.

Syntax

There are five common terms for the robots.txt file. Note that the standard-column refers to the proposed RFC standard made by Google: https://tools.ietf.org/html/draft-rep-wg-topic-00#section-2.2 .

Term Type Description Example Standard
User-agent Group Used for grouping of rules. Matching is case insensitive, characters allowed: "a-zA-Z_-" User-agent: Googlebot Yes
Disallow Rule Target of the rule. Disallow: /wp-admin Yes
Allow Rule Rule type, disallows a path Allow: /Posts/ Yes
Crawl-delay Rule Delay for crawler in seconds Crawl-delay: 10 No
Sitemap Extra Location of the sites sitemap Sitemap: https://www.example.com No

There are three special characters to consider when it comes to the rules:

Character Description Example
# Designates an end of line comment allow: / # comment in line
# comment at the end
$ Designates the end of the match pattern. An URI MUST end with a $. allow: /this/path/exactly$
* Designates 0 or more instances of any character. (Wildcard) allow: /this/*/exactly

By the proposed standard it is permitted for clients to interpret record that are not part of the standard, such as Crawl-delay and Sitemap directives.

General matching rules

When evaluating an URI that matches a path in allow/disallow, the most specific match should be used. The most specific match is determined by the strategy "longest path" (octets). allow-rules has precedence over disallow if they are equal.

User agent

The user agent of the bot/crawler/spider you are giving instructions. Case insensitive.

Disallow

Dissallows a specific URL. The path you set here is to think of as "starts with", that is: "/" will disallow everything. If you only want to disallow a specific path you can end string with $, such as: "/images/$". That will disallow "/images/" but not "/images/test.png" for instance.

Lastly there's another peculiarity when it comes to disallow. Since the beginning, there were no "allow" tag (and some crawlers still don't understand it). If you want to allow one crawler and disallow another, you have to use the empty disallow:

User-agent: *
Disallow: /

User-agent: googlebot
Disallow: 

The above means: for all agents, disallow everything. for googlebot, allow everything. "Disallow: " essentially means "Allow: /"

  • Think "starts with"
  • Explicit disallow path should use $ to mark end of path
  • "Disallow: " means the opposite - allow everything.

Allow

Adds possibility to allow paths that in part are disallowed by having longer / more specific paths.

Example:

User-agent: *
Disallow: /posts/
Allow: /posts/public/

Accessing /posts/public/test.html would be OK since the allow-rule is more specific. Accessing /posts/private/test.html would not be allowed since the disallow-rule would kick in.

Crawl-delay

Crawl-delay is not proposed in the standards. Google and Baidu do not support it while Bing, Yandex and Yahoo do support it.

The purpose is to specify a crawl rate that the crawler can use so that your site does not get affected by the crawling itself (performance issues). With Google, you can utilize its "Google Search Console" to specify crawl rates instead. It's the same with Baidu as well.

Example

User-agent: *
Crawl-delay: 10

Would set the crawl delay to 10 seconds between each fetch, giving a maximum of 8640 pages indexed per day.

Redirects and status codes

Crawlers should follow up to 5 redirects (301, 302). If not found within 5 redirects, the crawler may assume that the robots.txt is unavailable.

Http Status Code Type Description
400-499 Unavailable If the robots.txt is determined to be unavailable, the crawler may access any resource on the server or use a cached version of robots.txt
500-599 Unreachable If the robots.txt is determined to be unreachable, the crawler MUST assume complete disallow.
If it is unreachable for a long period of time (approx 1 month) the crawler may assume the robotx.tst as "Unavailable" and can access any resource or use a cached copy.