Robots.txt for SEO: Introduction to Basics of Robots.txt File


What is Robots.txt?

Webmasters create a file to direct and instruct the crawler (search engine bots) on how to crawl the website. It is also called the Robots exclusion protocol which is a text file and an easy way to ramp up the website SEO health. To leverage this element, you do not have to be proficient in the technical aspects of a website. So, let’s understand this file better!

Robots.txt not only tells your search engine what to crawl but also what not to crawl. Let’s say a search engine is about to crawl The Tipsy Marketer’s website but before it actually does that, the crawler will look for robots.txt file and follow the directives mentioned there.

Before crawling any site, search engine bots will scan the robots.txt file for all directives pertaining to the pages to be avoided for a crawl, the location of a site’s XML sitemap. While a robots.txt file is a directive followed by a majority of the search engines, it is important to note that search engine bots can choose to ignore the file completely!

How to find your robots.txt file?

You can access your robots.txt file by navigating through the link structure as given below

www.example.com/robots.txt

Fun fact: Robots.txt is case-sensitive. The file must be named “robots.txt” and not “Robots.txt”, ‘’robots.Txt”

User-agents:

  • Google: Googlebot

  • Google Images: Googlebot-Image

  • Bing: Bingbot

  • DuckDuckGo: DuckDuckBot

  • Yahoo: Slurp

  • Baidu: Baiduspider

Which directives are used by Robots.txt?

There are mainly 2 directives to consider
  • Allow
  • Disallow

Allow: 

This directive is used to instruct the crawlers on the path that is allowed to crawl.

Example: allow: {path}

Disallow: 

This directive is used to instruct the crawlers on the path that is not allowed to crawl. 

Example: Disallow: {path}

The path here is used to map with the beginning of a URL. 

Let’s consider an example here if the given path is: /blog, the following are the URLs that would be crawled and ignored.
Fun fact: The directives that are no longer supported by Google – crawl-delay, noindex, nofollow

Example robots.txt file:

1. To allow all robots complete access

Alternative: Do not use a robots file, this will ensure each bot crawls your website 

 2. To exclude all robots complete access

3. To exclude a single bot

4. Example of excluding specific directories

About the Robots <meta> tag:

You can use this <meta> tag which is a special HTML element to instruct the bot not to crawl the content of a webpage.

Refer the below code for example,



Like any <META> tag it should be placed in the HEAD section of an HTML page, as in the example above. These are the valid values for the content attribute: “NOINDEX”, “INDEX”, “FOLLOW”, “NOFOLLOW”.

To address a specific crawler, you can write the name of the crawler in the “NAME” attribute.


Fun fact: Robots can ignore the <meta> tags on your page especially if it’s a malware bot

How to check Robots.txt for blocked URL's? 

1. Open the tool known as Screaming Frog
2. Input the site URL and crawl the site using Screaming Frog. 
3. Check the Crawl Overview report for obtaining the list of URLs blocked by robots.txt.
4. You can then export this list to an excel

Comments