What is a robots.txt file?
A robots.txt file tells search engine crawlers which pages or files the crawler can or can’t request
form your site. This is used mainly to avoid overloading your site with requests; it is not a
mechanism for keeping a web page out of Google . To keep a web page out of Google, you
should use noindex directives , or password protect your page.
It is a file webmasters create to instruct web robots how to crawl pages on their website. The
robots.txt file is a part of the robots exclusion protocol (REP), a group of web standards that
regulate how robots crawl the web, access and index content, and serve that content up to
users.
Basic Format: User-agent: [user-agent name] Disallow: [url string not to be crawled]
Together, these two lines are considered a complete robots.txt file.
Examples of robots.txt file:
Robots.txt file URL: www.example.com/robots.txt
Blocking all web crawlers from all content:
User-agent: * Disallow: /
Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on
www.example.com , including the homepage.
To allow web crawlers
User-agent: * Allow:
Blocking a specific web crawler from a specific folder
User-agent: Googlebot Disallow: /example-subfolder/
This syntax tells only Google’s crawler not to crawl any pages that contain the URL string
www.example.com/example-subfolder/
Blocking a specific web crawler from a specific web page
User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html
This syntax tells only Bing’s crawler to avoid crawling the specific page at
www.example.com/example-subfolder/blocked-page.html
How does robots.txt work?
Search engines have two main jobs:
- Crawling the web to discover content;
- Indexing that content so that it can be served up to searches who are looking for
information
To crawl sites, search engines follow links to get from one site to another- ultimately, crawling
across many billions of links and websites. This crawling behaviour is sometimes known as
“spidering”
After arriving at a website but before spidering it, the search crawler will look for a robots.txt file.
If it finds one, the crawler will read that file first before continuing through the page. Because the
robots.txt file contains information about how the search engine should crawl, the information
found there will instruct further crawler action on this particular site.
Other quick robots.txt must-knows:
- In order to be found, a robots.txt file must be placed in a website’s top-level directory.
- Robots.txt is case sensitive: the file must be named “robots.txt” (not Robots.txt.
robots.TXT, or other) - Some user agents may choose to ignore your robots.txt file. This is especially common
with more nefarious crawlers like malware robots or email address scrapers. - The robots.txt file is publicly available.
- Each subdomain on a root domain uses a separate robots.txt file.
- It’s generally a best practice to indicate the location of any sitemaps associated with this
domain at the bottom of the robots.txt file.
Technical robots.txt syntax:
There are five common terms you’re likely to come across in a robots file.
User-agent: the specific web crawler to which you’re giving crawl instructions. A list of most
user agents can be found here.
Disallow: the command used to tell a user-agent not to crawl a particular URL.
Allow (only applicable for Googlebot): the command to tell Googlebot it can access a page or
subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: how many seconds a crawler should wait before loading and crawling page
content.
Sitemap: used to call out the location of any XML sitemaps associated with this URL. This
command is only supported by Google, Ask, Bing, and Yahoo.
Pattern-Matching:
When it comes to the actual URLs to block or allow, robots.txt files can get fairly complex as
they allow the use of pattern-matching to cover a range of possible URL options. Google and
Bing both honor two regular expressions that can be used to identify pages or subfolders that an
SEO wants excluded. These two characters are the asterisk (*) and the dollar sign ($). - * is a wildcard that represents any sequence of characters
- $ matches the end of the URL
Why do you need robots.txt file?
Robots.txt files control access to certain areas of your site. While this can be very dangerous if
you accidently disallow Googlebot from crawling your entire site (!!), there are some situations in
which a robots.txt file can be very handy.
Some common uses cases include: - Preventing duplicate content from appearing in SERPs. (meta robots is often a better
choice for this) - Keeping entire sections of a website private.
- Keeping internal search results page from showing up on a public SERP
- Specifying the location of sitemap(s)
- Preventing search engines from indexing certain files on your website (images, PDFs,
etc.) - Specifying a crawl delay in order to prevent your servers from being overloaded when
crawlers load multiple pieces of content at once.
Checking if you have a robots.txt file or not?
Not sure if you have a robots.txt file? Simply type in your root domain, then add /robots.txt to the
end of the URL. For instance, www.yourdomain.com/robots.txt
If no .txt page appears, you do not currently have a (live) robots.txt page.