Robots.txt Celebrates 20 Years of Search Engine Control
Published on 15 Jul 2014 at 9:57 pm.
Filed under Informative,Search Engine Optimization.
This week marked 20 years since the creation of Robots.txt. Want to know how it helps for your site?
What is Robots.txt?
When the web was young a typical web server could only handle a few requests at a time. A few search engines crawling your entire site at the same time could take everything down. After suffering an accidental Denial of Service attack when a search engine took down his site a Dutch software developer named Martijn Koster proposed a new standard for search engine crawlers. His proposal called for search engines to look for a file named robots.txt that a search engine will request before crawling the site. This file has rules listing URLs that a search engine will not crawl. You can create rules that apply to all search engines or create specific rules for each crawler. This file resides at the root of each website. For example, you can view our robots.txt file at:
Let me be clear: Crawling a part of a site is different from indexing a part of a site! You need to include special meta tags in your web page to prevent search engines from indexing the content. If you include these meta tags and then tell search engines not to crawl that content they will never re-crawl the page to see the instruction to stop indexing the page. Instead of getting the content out of the index you are preventing it from ever coming out.
Why Should I use Robots.txt?
Web servers are much more powerful than they were in 1994. It’s highly unlikely that Google, Bing, and Yahoo would take down your server even if all three crawled your site at the same time. So if you’re site isn’t going to crash then why would you ever need to bother with a robots.txt file? A couple of reasons include:
- You have a part of the site that is not meant for the public to interact. By this I mean things like scripts that control the website.
- You have scheduled downtime for parts of the site. You can disallow search engines from crawling the site during the downtime. By disallowing them from crawling your site you prevent the risk of having the search engines update the index to remove your previous content and replacing it with your downtime message. When doing this you must update your robots.txt file at least 24 hours in advance.
- You know your traffic is going to spike at a certain time. Think of the recent roll-out of HealthCare.gov. Obviously the traffic would spike when enrollment began and when they were near the deadline for enrollment. Keeping search engines from accessing the site during that time would help reduce server load.
- You want specific crawlers to exclude certain types of content. For example, you can prevent images from appearing in Google Image Search. This lets Google’s main crawler index your web pages but keeps the images out of image search.
- You want to tell search engines about your XML Sitemapfiles. From my experience I haven’t seen Google ever detect an XML Sitemap from a robots.txt file. I have seen Bing, and thus Yahoo, learn about an XML Sitemap in this way. I cannot speak for any other crawler.
Now that we’ve explained the purpose behind a robots.txt file why don’t you check to see if one is running on your site? By fine tuning your robots.txt file you might see an improvement to your SEO.
Thank you and keep building your brand.
This post was originally published as Robots.txt Celebrates 20 Years of Search Engine Control for Brand Builder Websites.