Why you should have a robots.txt in your sites
By Ikki on Apr 9, 2008 in Tutorials, Web Development
Hi peeps,
I’m pretty sure that most of you have heard about the world famous robots.txt file at least once in your lifetime, right? Let’s see what it is, how can it be used and why you should have it in your sites.
What is robots.txt?
One of the most important files for your site is robots.txt. With it, you can let the spiders (a.k.a bots or web crawlers) know what areas of your site they can or cannot crawl. This is specially useful when you don’t want them to index specific folders/pages of your site (eg. admin, cgi-bin, etc.)
Note that this file must be placed at the root of your site and must be named as “robots.txt“. The reason for this is that spiders will only check for this file in the root directory. If it’s not found there then they will continue to crawl and index all your pages.
Why should I use this, anyway?
I’ve listed a few reasons why you might want to partially/fully exclude bots from your site:
- Your site is still under construction: you don’t want your unfinished work appearing on the search engines, do you?
- Keep some folders out from public visibility: there are some directories that you might want to keep private - for example: cgi-bin, admin, my-porn-collection xD I don’t think you want to see these being indexed on Google. Got the idea?
- Exclude some evil bots from your site: there are out there some spiders whose purpose is to collect email addresses, suck up your bandwith, scrap your contents, data mining, etc.
Got it, now how do I use this robots.txt thing?
Creating a robots.txt file isn’t that hard. All you need is a text editor (like Notepad) and a couple of commands: User-Agent and Disallow. The usage:
User-Agent: [Spider's name]
Disallow: [Folder or File name]
Please note that the order of these two lines is important. Always start with User-Agent and next add the Disallow command.
User-Agent also has the ability to use a wildcard character, the “*“, to let know all the spiders that the next Disallow command will affect them all. Let’s see an example:
User-Agent: *
Disallow: /cgi-bin/
This would allow all spiders to crawl every page in your site except the cgi-bin directory. All files inside this folder will be ignored.
How about blocking an specific bot from your site? Try this:
User-Agent: Googlebot
Disallow: /
This robots.txt file would tell Googlebot (Google’s spider) not to crawl your entire site.
Another example: let’s allow all bots to crawl any part of your site
User-Agent: *
Disallow:
We’re leaving our Disallow command empty so no restriction is applied to bots. The following will do the exact opposite - block all bots:
User-Agent: *
Disallow: /
Now, let’s disallow access to several folders:
User-Agent: *
Disallow: /cgi-bin
Disallow: /admin
Disallow: /stats
Disallow: /includes
Final Words
As you can see, creating a robots.txt isn’t that hard at all. However, you must be aware that while most spiders out there respect your robots.txt, some won’t and continue to crawl your site. These spiders are know as bad-behaved bots and were designed to ignore your robots.txt file. Because of this, you cannot consider this technique as a security method.
Some people like to combine a robust robots.txt file with some .htaccess configuration directives to effectively block bad bots. But that’s another story
Oh, and if you found my tutorial helpful please don’t forget to Digg it ![]()



This is exactly what I am looking for. Thanks a lot.
I am also wondering about “honeypot” term, do you think you can cover it for your next post? (or is it already somewhere here?
Budhi | Apr 10, 2008 | Reply
Hi there,
I’m glad you liked my tutorial
About your request, well no I haven’t covered that. Maybe I will sometime! Thanks for the tip!
Ikki | Apr 10, 2008 | Reply
Hi ikki! It’s a cool article..
I also wrote something about it last year and it could be an addition to what you wrote
Here’s the link: http://wakish.info/robotstxt-why-it-this-simple-file-still-so-widely-used/
Cheers!
- Wakish -
Wakish | Apr 11, 2008 | Reply
Hey Wakish
Thanks for sharing this!
I’ve read your post and it’s pretty good, too!
Again, Thank you!
Ikki | Apr 11, 2008 | Reply
hi dude, i setup my robots.txt after seeing ur article… i am const my own site and took to change robots.txt reading a well quoted article..
Kudos to you
sunil | Jul 6, 2008 | Reply
@sunil: Great! Nice to know that you found my post useful
Thanks for your comments!
Ikki | Jul 6, 2008 | Reply