John Blackburn announces that the refreshed robots.txt analysis tool will now be able to recognize sitemap declarations and relative urls.
“Earlier versions weren’t aware of sitemaps at all, and understood only absolute URLs; anything else was reported as Syntax not understood. The improved version now tells you whether your sitemap’s URL and scope are valid. You can also test against relative URLs with a lot less typing.
Reporting is better, too. You’ll now be told of multiple problems per line if they exist, unlike earlier versions which only reported the first problem encountered. And we’ve made other general improvements to analysis and validation.”
In order to let search engine bots index all in your portal (barring the images folder). Your robots.txt file will look like this:
disalow images
user-agent: *
Disallow:
sitemap: http://www.example.com/sitemap.xml
You visit Webmaster Central to test your site against the robots.txtanalysis tool using these two test URLs:
http://www.example.com
/archives
Previous version of the tool
Whereas the latter image of the tool would look like this.
For more information read Google Webmaster Central official blog.
In other news, Google confirms the new unavailable_after META tag which we reported about last week.
“Let’s assume you are running a promotion that expires at the end of 2007. In the headers of page www.example.com/2007promotion.html, you can use the following:
The second exciting news: the new X-Robots-Tag directive, which adds Robots Exclusion Protocol (REP) META tag support for non-HTML pages! Finally, you can have the same control over your videos, spreadsheets, and other indexed file types. Using the example above, let’s say your promotion page is in PDF format. For www.example.com/2007promotion.pdf, you would use the following in the file’s HTML headers:
X-Robots-Tag: unavailable_after: 31 Dec
2007 23:59:59 EST
Remember, REP META tags can be useful for implementing noarchive, nosnippet, and now unavailable_after tags for page-level instruction, as opposed to robots.txt, which is controlled at the domain root.”