Log on to your Google Search page and type in “filetype:pdf” and you will be astounded to find the sheer number of PDFs files online – 542,000,000.
Despite such high number of PDF files as a webmaster you are often left wondering the best practices around PDF indexing and when will your non-HTML file get priority over other standard webpages. However there has not been much clarity on the issue on Google's part until recently, with Matt Cutt's video resolving one of the PDF related query.
Here's a look at the very informative Matt Cutt's video:
Perhaps this tutorial was the driving force behind Google wanting to resolve other PDF indexing related queries as well. And the result; Google's, Webmaster Trends Analyst, Gary Illyes setting out to answer “most often-asked questions about PDF indexing”.
- Can Google index all types of PDF file?
In most cases Google can index textual content of any language from the PDF files as long as they are not password protected or encrypted. Also, “The general rule of the thumb is that if you can copy and paste the text from a PDF document into a standard text document, we should be able to index that text.”
However if your text is embedded as images, they process it with the OCR algorithms that let's Google “convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found.”.
- Do your images get indexed?
As of now Google does not index your images. However if you do want your images to be indexed you are recommended to create HTML pages for them. Further information about image indexing can be found in Google Help Center.
- What about the links in your PDF documents?
The links in PDF files are treated like the links in HTML whereby they can pass PageRank and other indexing signals. Google can also follow them after the PDF file is crawled.
However it is “not possible to “no follow” links within a PDF document”.1
- What if you do not wish your PDF file to show up in the search result? Or if they are currently appearing, how do you prevent it?
All you have to do is, add an X-Robots-Tag: noindex in the HTTP header used to serve the file. “If they’re already indexed, they’ll drop out over time if you use the X-Robot-Tag with the noindex directive.” and if you want to remove them even faster, use the Google Webmaster Tools' URL Removal Tool.
- What about your PDF files' rank; can it rank highly in the search results?
The good news is, it is ranked similarly to other HTML files. So if your file has high quality content, is embedded right and linked from other webpages you have nothing to worry about!
- Would it be considered duplicate content if you have a copy of pages in both HTML and PDF?
Google recommends “serving a single copy of your content. If this isn’t possible, make sure you indicate your preferred version by, for example, including the preferred URL in your Sitemap or by specifying the canonical version in the HTML or in the HTTP headers of the PDF resource.”
- How do you influence the title shown in search results for the PDF document?
Google uses two main elements to determine the title shown:
1. The title metadata within the file
2. The anchor text of links pointing to the PDF file.
To give the Google's algorithms a strong signal about the proper title to use, it is recommended that you update both.