Understanding Robots.txt

Published on Feb 2nd, 2015 by Kyle Piira

Almost every site on the web has a robots.txt file located at Example.com/robots.txt which tells web crawlers or bots what they are allowed to visit and what they are not allowed to visit. This can be helpful for blocking off admin areas or backend files that you do not need or want crawled or giving bots information like the URL(s) to your sitemap(s).

So what actually is a robots.txt? Well, it's just a simple text file made in notepad or any other word processor that has several tags that web crawlers understand. The Following are some examples of common elements that you might see in a Robots.txt file.

# Basic Robots.txt allowing bots to visit your whole site
User-agent: *
Disallow:
# Robots file telling bots not to crawl any part of the site
User-agent: *
Disallow:/
# Robots file telling bots not to crawl any files in the admin directory (backend)
User-agent: *
Disallow: /admin/

You can also use tags such as the Sitemap tag to tell bots where to find your XML sitemap or RSS feeds to increase crawl rates.

There are a few drawbacks to using a robots.txt file though, mainly being that it is public to anyone who wants it so if you are trying to hide an exposed backend directory or an admin area with the Disallow tag anybody can view your robots, and it does not stop regular users from visiting those pages. Another thing is that not all bots obey the Robots.txt file, these are mainly malicious bots that are trying to spam post comments or upload malware to your web servers. These bots may use your Robots file against you by visiting you disallow links in an attempt to crack your admin accounts.

In most cases the benefits of having an up to day robots.txt file out weight the drawbacks because of the increased search engine crawling from spiders like GoogleBot or BingBot. It also helps to give information about your site to smaller search engines that might not have a “webmaster tools” section like DuckDuckGo or AskClash about where your sitemaps are located.

Overall it's up to you how much time you want to spend designing your robots.txt file and how you want to deal with the possible threats too using one.