Issues like website architecture, crawling and indexing, as well as ranking issues always revolve around one central issue, i.e. 'How easy is it for search engines to crawl your site?' Google Webmaster Central Blog had discussed this topic many times and once again it has come up with a presentation and some key points to be considered: Here is the slideshow:
Every time new content is being created and uploaded on the Internet. But, with limited number of resources Googlebot can only find and crawl a definite percentage of content, out of the infinite number of content available online. And only a portion of the crawled content is then indexed by Google. Then comes the URLs. Well, URLs can be called as the bridges between a website and a search engine's crawler. Crawlers should be able to find and crawl the URLs in order to get to a site's content. Now, if a site's URLs are complicated or they have excessive words, search engine crawlers tend to spend more time tracing and retracing similar steps, but if the URLs are organized and lead directly to a particular content, then crawlers can find more time to access the content rather than crawling through empty pages, or crawling the same content over and over via different URLs. In the slides above, some examples of what not to do in this regard are given. Below are some recommendations regarding the complicated issue of URL crawling. By considering these you can help crawlers find your website's content faster. They are:
- Remove user-specific details from URLs: Consider removing URL parameters like session IDs or sort order from the URL and put them into a cookie. Putting this information in a cookie and 301 redirecting to a “clean” URL can help you retain the information and at the same time help reduce the number of URLs pointing to that same content.
- Rein in infinite spaces: If a website boasts a calendar that links to an infinite number of past or future dates (each with their own unique URL) or if it has paginated data that returns a status code of 200 when &page=3563 to the URL is being added, even if there aren't that many pages of data, this may be indication of the presence of an infinite crawl space on the website. In this case, crawlers could be wasting their bandwidth trying to crawl it all. These tips will help you know how to rein in infinite crawl spaces.
- Disallow actions Googlebot can't perform:Using robots.txt file , one can disallow crawling of login pages, contact forms, shopping carts and other pages whose sole functionality is something that a crawler can't perform. (Crawlers are notoriously cheap and shy, so they don't usually "Add to cart" or "Contact us.") This will allow crawlers to spend more of their time crawling content.
- One URL, one set of content: There should be one URL that leads to a unique piece of content or each piece of content can only be accessed via one URL. The one-to-one pairing between URL and content can help streamline a site for effective crawling and indexing. However, if your CMS or current site setup makes this difficult, you can always use the rel=canonical element to indicate the preferred URL for a particular piece of content.
For more information on optimizing a site for crawling and indexing, visit Webmaster Help Forum.