Google's Crawling Through HTML Forms- Sounds Good but comes with Side-Effects!

May 27, 2008 | 3,628 views | by Navneet Kaushal
VN:F [1.9.20_1166]
Rating: 0.0/5 (0 votes cast)

Our readers might recall that, last month I had informed them about Google testing a new search related technology, thereby enabling its crawlers to explore some HTML forms in an attempt to discover new web pages and URLs which have not yet been found and indexed. However, Michael VanDeMar at Smackdown has found a loophole in this technology due to which certain page are being indexed, that are not supposed to be. Here is a look at links that have been indexed. Please note, that these pages have an almost similar format:

These pages were part of search results that were intended to bring in pages from Michael's blog. But these phrases were insignificant towards searches, as users were not likely to use them as search terms. Later Michael tried a sort of an experiment in his WordPress account, be excluding these phrases. This is what he did,

if(!is_archive() && !is_search()){
<meta name=”ROBOTS” content=”ALL”/>
< ?php
< ?php

This code stopped the indexing of those pages. However, it was only a temporary reprieve. Then, in February 2008, Michael noticed that his website had been visited by Googlers from within the Googleplex. Only Googlers had the tools to view websites and pages that had a 'noindex' tag in them. This wasn't a solo occurrence, but the same phenomenon was happening with other Webmasters as well. After a while, there were new pages that had been indexed, that according to Micheal were not supposed to be indexed. Then, Matt Cutt explained that the new form of crawling had the sole purpose of discovering new links. According to Matt, “it’s less about crawling search results and more about discovering new links. A form can provide a way to discover different parts of a site to crawl. The team that worked on this did a really good job of finding new urls while still being polite to webservers.”

However, according to Michael, there are some side-effects to this new form of crawling, that were evident from two websites that he monitored. One was Better Mortgage Rate, and the other was, both of which are mortgage websites. For these websites, the forms that had been crawled by the Googlebot were not search forms at all. In fact these pages were interfaces for Javascript mortgage calculators. According to the Official Google Webmaster Central Blog, the new method only retrieves forms that use GET as the method as opposed to POST, “Needless to say, this experiment follows good Internet citizenry practices. Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won’t crawl any of the URLs that a form would generate. Similarly, we only retrieve GET forms and avoid forms that require any kind of user information.”

If there is no method in a form then it defaults to GET. Moreover, in cases where pages that do not have declared GET, would most probably be using the form for something other than server side scripting. There is no default for the action attribute, it is required for the form tag to be valid. The absence of an action, allows Googlebot to assume that without an action the page that the form resides on must be the intended target.

In case of website, Michael observed that taking values from fields, mixing them up, and appending them to the URL does absolutely nothing to the content of the page. As a result, 6 identical copies of the same page were indexed with only the URL differing in each case.

image 1 glenreily 6pages Googles Crawling Through HTML Forms  Sounds Good but comes with Side Effects!

According to Michael, In the case of Better Mortgage Rate, there appears to have been up to 172 of these fictitious pages indexed.

image 2 bmrate 172pages Googles Crawling Through HTML Forms  Sounds Good but comes with Side Effects!

This is what Matt Cutts had to say about the crawling of forms by Googlebot. “The main thing that I want to communicate is that crawling these forms doesn’t cause any PageRank hit for the other pages on a site. So it’s pretty much free crawling for the form pages that get crawled.”

In the end, it all comes down to this. If Google is intending to implement this new from of crawling as a part of their mainstream discovery, then it definitely need to implement some changes in the Robots Exclusion Protocol, such as a NOFORM meta tag. This could help in reducing the inconsistencies.

4.thumbnail Googles Crawling Through HTML Forms  Sounds Good but comes with Side Effects!

Navneet Kaushal

Navneet Kaushal is the founder and CEO of PageTraffic, an SEO Agency in India with offices in Chicago, Mumbai and London. A leading search strategist, Navneet helps clients maintain an edge in search engines and the online media. Navneet's expertise has established PageTraffic as one of the most awarded and successful search marketing agencies.
4.thumbnail Googles Crawling Through HTML Forms  Sounds Good but comes with Side Effects!
4.thumbnail Googles Crawling Through HTML Forms  Sounds Good but comes with Side Effects!