It seems Google is working hard on the content duplication issue. After introducing a new parameter for handling duplicate content problems, Google is now talking about reunifying duplicate content in its latest Google Webmaster Official Blog. Site owners have always found it difficult to handle duplicate content on their site. Websites grow with time, new features get added, changes are introduced and removed, content is edited, add & removed. After some time, many websites collect systematic cruft as multiple URLs that show same content. If there is duplicate content on the site, it won't create a problem but search engines will find it difficult to crawl and index the pages. Along with this, the PageRank and related information that is found via incoming links can also get diffused across the pages and you won't recognize them as duplicates. But, it will lower down the page rank of your preferred version in Google. So, here are a few steps to deal with the duplicate content in your site:
Recognizing Duplicate Content on the Website
The primary step is to recognize the duplicate content on website. You can take a unique text snippet from a page to lead the process. Search for the snippet by using a site:query in Google as this will limit the results to pages of in own website. If there are multiple results for the same content, it means content duplication.
Determining Preferred URLs
Before you fix the duplicate content issue, find out your preferred URL structure. Determine the URLs that you would prefer to use for the particular content.
Having a Consistency Within the Website
After choosing the preferred URLs, you have to ensure that these are used in all the possible locations within the site. This will include Sitemap file also.
Applying 301 Permanent Redirects
Try to redirect duplicate URLs to the preferred URLs with the help of a 301 response code as it will help the visitors and search engines in locating your preferred URLs. If the website is available on several domain names, you can pick and use 301 redirect from other domains but make sure that you forward the right page and not only the root of the domain. In case you support both www and non-www host names, choose one, use the preferred domain settings in the Webmaster Tools and redirect.
Implementing the rel="canonical" Link Element Where you Can on Your Pages
If there are pages where 301 redirects are not possible, the rel="canonical" link element will give a better understanding of the site and preferred URLs. Major search engines like Yahoo!, Bing and Ask.com also support the use of this link element.
Using URL Parameter Handling Tool in Google Webmaster Tools
In case some or all the duplicate content of website is coming from URLs with query parameters, this tool will be of great help. It will help site owners in notifying Google about important and irrelevant parameters within the URLs of the site. Robots.text File You can disallow crawling of duplicate content with robots.txt file. Google is now recommending site owners not to block access to the duplicate site content by robots.txt file or other methods. It is infact asking to use the rel="canonical" link element, 301 redirects and URL parameter handling tool. In case you totally block the access to duplicate content, search engines will start treating those URLs as unique pages as they will not be able to know that these pages are just different URLs for same content. Therefore, it is always better to let these URLs be crawled but you should clearly mark them as duplicate through one of the above mentioned methods recommended by Google. In case yo allow Google to crawl these URLs, Googlebot will identify duplicate content by looking at the URL, therefore, you should avoid unnecessary recrawls. There are a few cases where duplicate content leads to too much crawling of website. You can avoid this by adjusting the crawl rate setting in the Webmaster Tools.