Duplicate content is becoming a major issue not only for all the search engines, but also for Webmasters from all around the world. In the session, the representatives of the big three, namely Google, Yahoo! Search and MSN Live, explain some of their strategies with regard to duplicate content.
- Rand Fishkin
- Ben D'Angelo, Software Engineer of Google
- Derrick Wheeler, Senior Search Engine Optmization Architect of Microsoft
- Priyank Garg, Director Product Management of Yahoo! Search
The session was initiated by Ben D'Angelo, who started of by pointing out the crucial duplicate content issue of multiple URLs pointing to the same page or quite similar pages. Duplicate content is also found across other websites as syndicated or scraped content. The perfect situation is when one URL would be simply leading to one piece of content.
There are a number of examples of duplicates, such as www & no www, session IDs, URL parameters, print version pages, CNAMEs, etc. Then there are also similar content on different URLs as well as sites in different countries with same content.
Ben want on to explain how Google handles duplicate content. They basically cluster pages together and choose the page that best represents the search. Google employs different kinds of filters for the different kinds of duplicate content. But this is simply a filter and not any kind of penalty
So how to prevent this from happening with you. You can take some of the following measures:
- To prevent exact duplicates, one could use a 301 redirect.
- To prevent near duplicates, one could use robots.txt.
- A different language us not a duplicate. One could use unique content specific to the countries.
- Don't put extraneous parameters in the URLs.
But there is a chance that other sites would cause duplicate content. In case you are syndicating your content out then ensure that there is a link back to your original article or content. One could also give a short summary about the same. In case you are syndicating another's content then you could do the reverse.
It would be an extremely rare case that scrapers would be impacting you or your content. However, one can't rule out the possiblity and in case the same happens, then one should file a DMCA and/or Spam Report.
Ben was followed by Priyank Garg, who explained how Yahoo! Search deals with the same. Yahoo! uses duplicate filters through all the steps in the pipeline. He went on to showcase some examples and stated that most duplicate content is accidental. A large number of duplicates come from soft 404, not the real 404s. Many of them are also abusive forms just as scrapers.
The final speaker of the session was Derrick Wheeler of Microsoft. Derrick made no bones in making it clear that duplicate content was Microsoft's worse nightmare. Microsoft follow the methodology of CIRTS, which goes as:
- C= Crawl
- I= Index
- R= Rank
- T= Traffic
- A= Action
He offered the following tips on how to handle the problem of duplicate content:
- Try to detect when an engine comes to your site.
- Sometimes, such as in the case of session Ids, it can also be helpful.
- Be fully aware of your parameters.
- Make sure that you link to your parameters in a consistent order.
- Exclude any form of duplicates using robots.txt, noindex, nofollow, etc.
- Try to get hold of a tool that can crawl your site. This will enable you to see how an engine will be looking at your site.
- Always focus on the strong URLs of your website first