{infiniteZest}
// Articles. Tutorials. Utilities.
Home  |   Search  |   Login  
Categories Skip Navigation Links
New / All
AJAX
Apple
ASP.NET
.NET
Git
Google / Android
Python / IronPython
Miscellaneous
SQL Server
Search robots and robots.txt
Summary
This article contains information on various search robots and robots.txt. Thoughts on which directories to be allowed access and which are to be kept away from the search engines.
 
Table of Contents

The robots.txt file

The robots

 

The robots.txt file

The search robots that go around the web and index the pages use (should use) the robots.txt file created by the web site owner. The robots.txt file can specify which portion of the web site should NOT be and may be crawled by a search robot.

The good search engines (like Google’s google bot) respect robots.txt; until you place a robots.txt file in the root directory of your site, it keeps on looking for that file.

The standard (RFC) for robots.txt file is avialable here:

http://www.robotstxt.org/wc/norobots-rfc.html

There are basically three important name value pairs in robots.txt:

  • User-agent: (name of the spider for which these standards apply)
  • Allow: (the files/directories that you want the above spider to index)
  • Disallow: (the files/directories that don’t want the above spider to index)

Values for the above names can contain regular expressions. For example, if your web server has plenty of bandwidth and don’t care about the spiders that visit your site, just put the following lines:

User-agent: *

Disallow:

For a new site with a decent hosting plan, you would typically want that site to be found by as many search spiders as possible. If your content is changing frequently, then you would want many (or at least the top search spiders) come to your site as often as possible. Most public sites (and the pages in them) are found by the users via search engines.

On the other end, if your web site is very private and you are not trying to market the site to anybody other than the members (and the members know the existence of the site through other means than through the search engines), you might not want any search engine index your site. In that case, use the following:

User-agent: *

Disallow: /

The in-between situation between the above two extremes is disallowing access to some directories. These directories, like images, might not add much to the findability of the site. However, if your site has a high quality image content, then you want to not only allow access to these images but also provide as much meta-data about these images.

The robots

  • Google’s search robot is called: Googlebot
  • Additional help is available here: http://google.com/bot.html
  • Yahoo’s search robot is called: Slurp
  • Additional help is available here: http://help.yahoo.com/help/us/ysearch/slurp/index.html
  • Microsoft’s search robot is called: MSNbot
  • Additional help is available here: http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIndexing.htm

A discussion on the important user agents (browsers and search spiders) that overtime access your site can be found here:

Important User Agent Strings

Bookmark and Share This

More Articles With Similar Tags
It's easy and convenient to obtain articles, blog posts, etc. via ID as query string. But the page generated with ID paramater has less of a chance of being indexed by search engine. Instead of ID, use a more search-engine-friendly name in the URL.
The backspace escapes the special characters like period in the regular expressions. This article contains some examples of how to match and not match special characters like period.
About  Contact  Privacy Policy  Site Map