The purpose of the robots.txt is to tell what url's a web crawler can and can't access. It is however only an advice, and there are several crawlers that do not completely obey the robots.txt.
After accessing and analyzing over a hundred million domains some patterns of the robots.txt usage have emerged.
Before digging deeper, there are some facts that need to be mentioned:
There are five common terms for the robots.txt file. Note that the standard-column refers to the proposed RFC standard made by Google: https://tools.ietf.org/html/draft-rep-wg-topic-00#section-2.2 .
|User-agent||Group||Used for grouping of rules. Matching is case insensitive, characters allowed: "a-zA-Z_-"||User-agent: Googlebot||Yes|
|Disallow||Rule||Target of the rule.||Disallow: /wp-admin||Yes|
|Allow||Rule||Rule type, disallows a path||Allow: /Posts/||Yes|
|Crawl-delay||Rule||Delay for crawler in seconds||Crawl-delay: 10||No|
|Sitemap||Extra||Location of the sites sitemap||Sitemap: https://www.example.com||No|
There are three special characters to consider when it comes to the rules:
|#||Designates an end of line comment||allow: / # comment in line
# comment at the end
|$||Designates the end of the match pattern. An URI MUST end with a $.||allow: /this/path/exactly$|
|*||Designates 0 or more instances of any character. (Wildcard)||allow: /this/*/exactly|
By the proposed standard it is permitted for clients to interpret record that are not part of the standard, such as Crawl-delay and Sitemap directives.
When evaluating an URI that matches a path in allow/disallow, the most specific match should be used. The most specific match is determined by the strategy "longest path" (octets). allow-rules has precedence over disallow if they are equal.
The user agent of the bot/crawler/spider you are giving instructions. Case insensitive.
Dissallows a specific URL. The path you set here is to think of as "starts with", that is: "/" will disallow everything. If you only want to disallow a specific path you can end string with $, such as: "/images/$". That will disallow "/images/" but not "/images/test.png" for instance.
Lastly there's another peculiarity when it comes to disallow. Since the beginning, there were no "allow" tag (and some crawlers still don't understand it). If you want to allow one crawler and disallow another, you have to use the empty disallow:
User-agent: * Disallow: / User-agent: googlebot Disallow:
The above means: for all agents, disallow everything. for googlebot, allow everything. "Disallow: " essentially means "Allow: /"
Adds possibility to allow paths that in part are disallowed by having longer / more specific paths.
User-agent: * Disallow: /posts/ Allow: /posts/public/
Accessing /posts/public/test.html would be OK since the allow-rule is more specific. Accessing /posts/private/test.html would not be allowed since the disallow-rule would kick in.
Crawl-delay is not proposed in the standards. Google and Baidu do not support it while Bing, Yandex and Yahoo do support it.
The purpose is to specify a crawl rate that the crawler can use so that your site does not get affected by the crawling itself (performance issues). With Google, you can utilize its "Google Search Console" to specify crawl rates instead. It's the same with Baidu as well.
User-agent: * Crawl-delay: 10
Would set the crawl delay to 10 seconds between each fetch, giving a maximum of 8640 pages indexed per day.
Crawlers should follow up to 5 redirects (301, 302). If not found within 5 redirects, the crawler may assume that the robots.txt is unavailable.
|Http Status Code||Type||Description|
|400-499||Unavailable||If the robots.txt is determined to be unavailable, the crawler may access any resource on the server or use a cached version of robots.txt|
|500-599||Unreachable||If the robots.txt is determined to be unreachable, the crawler MUST assume complete disallow.
If it is unreachable for a long period of time (approx 1 month) the crawler may assume the robotx.tst as "Unavailable" and can access any resource or use a cached copy.