The robots.txt file
The search robots that go around the web and index the pages use (should use) the robots.txt file created by the web site owner. The robots.txt file can specify which portion of the web site should NOT be and may be crawled by a search robot.
The good search engines (like Google’s google bot) respect robots.txt; until you place a robots.txt file in the root directory of your site, it keeps on looking for that file.
The standard (RFC) for robots.txt file is avialable here:
There are basically three important name value pairs in robots.txt:
- User-agent: (name of the spider for which these standards apply)
- Allow: (the files/directories that you want the above spider to index)
- Disallow: (the files/directories that don’t want the above spider to index)
Values for the above names can contain regular expressions. For example, if your web server has plenty of bandwidth and don’t care about the spiders that visit your site, just put the following lines:
For a new site with a decent hosting plan, you would typically want that site to be found by as many search spiders as possible. If your content is changing frequently, then you would want many (or at least the top search spiders) come to your site as often as possible. Most public sites (and the pages in them) are found by the users via search engines.
On the other end, if your web site is very private and you are not trying to market the site to anybody other than the members (and the members know the existence of the site through other means than through the search engines), you might not want any search engine index your site. In that case, use the following:
The in-between situation between the above two extremes is disallowing access to some directories. These directories, like images, might not add much to the findability of the site. However, if your site has a high quality image content, then you want to not only allow access to these images but also provide as much meta-data about these images.
- Google’s search robot is called: Googlebot
- Additional help is available here: http://google.com/bot.html
- Yahoo’s search robot is called: Slurp
- Additional help is available here: http://help.yahoo.com/help/us/ysearch/slurp/index.html
- Microsoft’s search robot is called: MSNbot
- Additional help is available here: http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIndexing.htm
A discussion on the important user agents (browsers and search spiders) that overtime access your site can be found here:
Important User Agent Strings