By checking your web site logs of who is coming and going you will see a number of bots and spiders referencing your pages and indexing for their masters most notably search engines, this happens all the time and at various times of day and night.
You might not want some of your pages made public – here is where you can create a robots.txt file which the bots and spiders are supposed to adhere to. The file protocol follows an industry standard and you can allow/disallow certain search engines to crawl and also protect certain parts of the URL from being indexed.
If you have no robots.txt file then you are are allowing all and sundry in, but if you do have a robots.txt file and it is formatted like this:
User-agent: * Disallow: /
….then you are blocking everything as the “/” is your top level or root directory, or like this:
User-agent: * Disallow: /mysecretstuff/
….then any url or directories starting /mysecretstuff and down is protected.
The robots.txt file has to be filed in your root level typically in your www or public_HTML folder. More info here.
One benefit of using a robots.txt file and initially blocking all crawlers and spiders is when you have your site in a development stage and are not ready to get your site indexed until you have it ready for the outside world.
If you have a self hosted WordPress site and its location is in the root level then there is a plug in which you can edit through WordPress and its by our Dutch friend again. Its pretty cool you can disallow directories not required to be indexed and tweak further settings.
No related posts.




No comments
Comments feed for this article
Trackback link: http://www.seo-hub.net/robotstxt/trackback/