Everyone wants to be indexed by Google, Yahoo, Bing and the other popular search engines but often there are parts of your website that you should keep out of the indexes.
There are a number of ways to accomplish this but if you are dealing with reputable search engines you only need to do a few simple things.
Robots Text File
Create a robots.txt file that you place in the root of your website.
This file will include a directory listing or file index of what you don’t want in search engines.
Remember that search engines need links to things to index and they find them in your pages. This means if you have a Images Directory or Media Directory that one of your pages references to include an image … the search engines will try to access that directory directly and pull back a list of files.
This can cause unwanted or unpaid transfer / traffic on your system.
An example of a robots.txt file may look like this
User-agent: * Disallow: /folder1/ User-Agent: Googlebot Disallow: /folder2/
In the example above the user agent of the robot software is restricted from indexing specific folders in your website.
the * symbol means all robots and in the second example only Google’s Googlebot is restricted.
The Allow command is also provided so you can restrict all bots from everything and allow some to portions of the site.
This is important if you run advertising on your website and Google will provide you specific information about your adsence settings to allow display of relavant ads.
If you want to protect your images from indexing you would add this line.
If you wanted to protect any url that has a ? question mark this line should work
Other Option for noindex
The HTML tag noindex is provided to allow you to protect specific pages or content from indexing. You may want to keep specific time sensitive material out of the indexes so searchers do not get confused.
To use the noindex meta tag you would place the tag in the header of any page you did not want indexed.
<html> <head> <title>...</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head>
You may also want to control how links within your site or pointing out from your site are seen by search engine robots.
To tell a robot to not follow or index the fact that you are linking to a page you can place this meta tag in your header
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
This tag will disallow indexing of the current page and indexing of links on the page.
Final Note About Search Bots
Not all bots that index your site are friendly. Some could be harmful but many are research bots that do not follow your robots.txt or meta tag requests.
Reputable search engines will follow these directives and this is about all you can ask.
For other ways to protect your content you should learn about index files, .htaccess .htpassword and language and server settings that can help you control how your website is seen by the world.