Search engines crawl from site to site spidering the web. As they discover content, they index the content, to be served up to user searches. If they find a robot.txt file, the crawlers will use it as instructions on how to crawl a site.
Robots follow a group of web standards that regulate how robots crawl the web called, robots exclusion protocol (REP). The major search engines that follow this standard include Ask, AOL, Baidu, DuckDuckGo, Google, Yahoo!, and Yandex and Bing.
Create a robots.txt File
This file lives in your sites root folder and you can use the browser to check if the file exists by going to:
https://www.domain.com/robots.txt
If you don't see a file there you can simply create a robots.txt file with notepad or any text editor and upload it via FTP.
Robots.txt File with WordPress
By default, WordPress generates a robots.txt file for you that looks like:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
If you are using WordPress you can use the Yoast plugin to customize and manage your robots.txt file in the plugins "Tools" section.
Optimize For Crawl Budget
It is important to optimize robot.txt file because Googlebots have a “crawl budget.” This means each site has a limited number of URLs Googlebots can and wants to crawl.
So we want to make sure that we are focusing on the most valuable pages and not pages like search results, queries or thank you pages.
This is broken down as part of the crawl rate limit documentation.
So we need to optimize the robot.txt file to ignore any pages or files that could cause a crawling budget limit on our site. Here is an example of a standard robots.txt file that allows all bots, to crawl all the pages, except for the template pages.
Don't Block Files Including .CSS and .js
While we want to optimize for crawl budget it is a bad idea to disallow build files like images, .css
, .js
and .htaccess
. In an episode of Ask Google Webmasters, Google’s John Mueller goes over whether or not it’s okay to block special files in robots.txt. John says "Crawling CSS is absolutely critical as it allows Googlebots to properly render pages."
A Basic Robots.txt Example
This is the robots.txt file I am currently using here on Design 2 SEO. It is a basic example to get started giving instruction to the crawlers. I decided to block any URLs that are using the (question mark) query string that is only for the sites dynamic API's and search results.
User-Agent: *
Allow: /
Disallow: /*?
Sitemap: https://example.com/sitemap.xml
Although it is not necessary it is recommended that you add your sitemap to the bottom of the robots.txt file. This will help smaller search engines like Bing find the sitemap and crawl your site more efficiently.
Common Expressions
There are two expressions that are commonly used.
- Asterisk: * is treated as a wildcard and can represent any sequence of characters.
- Dollar sign: $ is used to designate the end of a URL.
Using robots.txt with sitemap.xml
Be sure to exclude any disallowed pages from your sitemap.xml file to avoid having them being auto excluded by Google. Check your Google Search Console under Sitemap > Coverage for errors or excludes. I cover this in my video Fixing Sitemap Coverage Search Console Errors.
Disallow Examples
Here is are some example of results that you would want to disallow in your robots.txt file.
# Disallow any URLs with a question mark at the end
Disallow: /*?
# All PDF files
Disallow: *.pdf
# APIs
Disallow: /api.html?*
# Thank you page
Disallow: /thank-you/
# unsubscribe pages
Disallow: /unsubscribe/
Disallow: /unsubscribe/*
The Different User-agents
Here is a list of internet bots that you can specify in your robots.txt file. I'm sure there are even more, but this gives you an idea of all the different types of bots out there.
User-agent: Mediapartners-Google
Allow: /
User-agent: AdsBot-Google
Allow: /
User-agent: Googlebot-Image
Allow: /
User-agent: Googlebot-Mobile
Allow: /
User-agent: Slurp
Allow: /
User-agent: DuckDuckBot
Allow: /
User-agent: Baiduspider
Allow: /
User-agent: Baiduspider-image
Allow: /
User-agent: YandexBot
Allow: /
User-agent: facebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: MSNBot
Allow: /
User-agent: AMZNKAssocBot
Allow: /
User-agent: ia_archiver
Allow: /
Conclusion
The robot.txt file control crawler access, and should be used in conjunction with your sitemap.xml file and the Google Search Console. If there are no areas on your site to which you want to control user-agent access, you may not need a robots.txt file at all.
So use it cautiously as you don't want to accidentally disallow Googlebot from crawling your entire site!