Googlebot is like a dream which knows us all, , and soul. Here in this interview, Maile Ohye as the website and Jeremy Lilley as the Googlebot from Google Central Webmasterblog would answer all those questions that you ever had.
Website: Would you crawl with the same headers if the site were in the U.S., Asia or Europe? Do you ever use different headers?
Googlebot: Typically the headers are the same world-wide. It crawls around to see what a page looks like for the default language and settings for the site. At times the User-Agent is different, for example, AdSense fetches use "Mediapartners-Google": User-Agent: Mediapartners-Google, Or for image search: User-Agent: Googlebot-Image/1.0. Wireless fetches mostly have carrier-specific user agents, while Google Reader RSS fetches extra info
such as number of subscribers as well.
However, in order not to affect the content due to session-specific info, I usually avoid cookies . Further I can even identify a session id, if a server is using a dynamic URL rather than a cookie and thus I can easily avoid crawling up same page a million time with a million different session ids.
Website: Do you index all URLs or are certain file extensions automatically filtered?
Googlebot: While indexing for regular web search, the links to MP3s and videos would not be downloaded by me. Similarly, I will treat a JPG, differently than an HTML or PDF link. Whereas if I'm looking for links as Google Scholar, I will be more interested in the PDF article than the JPG file. But if I'm crawling for image search, I'm more interested in JPGs and, HTML & images for news.
Website: How do you treat an unknown file extension, for example http://www.example.com/page1.LOL111?
Googlebot: I would treat it fair. After I download a file, I use the Content-Type header to check if it really is HTML, an image, text, or something else. And if it happens to be a special data type like a PDF file, Word document, or Excel spreadsheet, I'll look for valid format and extract the text content. So, while crawling http://www.example.com/page1.LOL111,
with an unknown file extension, first I would start off with downloading it and in the mean while would try to figure out the content type from the header, or it's a format that we don'tt index (e.g. mp3), then it'll be put aside. Otherwise, we proceed indexing the file.
Website: Can you explain your header: Accept-Encoding: gzip,deflate
Googlebot: The gzip compression of content is to save bandwidth. This question just doesn't have a simple answer. Both Apache and IIS have options to enable gzip and deflate compression, though there's a CPU cost involved for the bandwidth saved. Typically, it's only enabled for easily compressible text HTML/CSS/PHP content. And it only gets used if the user's browser or I (a search engine crawler) allow it. Personally, I prefer "gzip" over "deflate". Gzip is a slightly more robust encoding Ć¢ā‚¬ā€¯ there is consistently a checksum and a full header, giving me less guess-work than with deflate. Otherwise they're very similar compression algorithms. If you have some spare CPU on your servers, it might be worth experimenting with compression (links: Apache, IIS). But, if you're serving dynamic content and your servers are already heavily CPU loaded, you might want to hold off.
Website: What do you have to say about the over protective parent like robots.txt ?
Googlebot: Well, there are plenty of them; Some are mere HTML error pages rather than valid robots.txt. Others have infinite redirects to totally unrelated sites; while there are others, who are just huge and have thousands of different URLs listed individually. The problem that I face with them is that, after I see the restrictive robots.txt, I may have to start throwing away content I've already crawled in the index. And then I have to recrawl a lot of content once I'm allowed to hit the site again. At least a 503 response code would've been temporary. MostIy I re-check robots.txt once a day. For webmasters, trying to control crawl rate through robots.txt swapping usually backfires. It's better to set the rate to "slower" in Webmaster Tools.
Googlebot: Hey! Website, thanks for all of your questions, it's time to crawl away. you've been wonderful, but I'm going to have to say "FIN, my love."
Website: Thank you, Googlebot for everthing! Keep Crawling!