SEO for PDFs

As my partner in crime Travis recently pointed out, misconceptions abound in the SEO industry. Here’s another misconception: “PDF pages are so SEO-unfriendly that you can’t rank for any halfway competitive keywords with them“.

Some SEOs are still so set against the Portable Document Format pages that they don’t feel they should even be landing pages. Some such SEOs recommend replacing all PDFs with HTML pages or building additional HTML landing pages targeting the same keywords as the PDFs.

The truth is: the biggest reason PDF pages often rank so horribly is that they are rarely properly optimized.

Don’t get me wrong. In an overall SEO showdown, I’d still pick HTML over PDFs any day of the week, and you’re not likely to catch me creating brand new web content for my clients in Adobe Acrobat. The real reason HTML is SEO-superior in 2013 is the user-experience. Most people are more comfortable with HTML and experience less freezing and slow loading with HTML. It’s easier to incorporate interactivity and social functionality into HTML pages. People also link to HTML pages and share them more frequently than PDFs(this is big).

Why Use PDFs then?

Don’t get me wrong twice — there’s still reasons to keep PDFs as SEO landing pages. Below are a few common use cases:

When you already have many PDF pages on your site that people consider valuable. Before replacing PDFs, be sure to check to see if your PDFs have backlinks decent engagement metrics, and good traffic.

When you have really sexy PDF’s that would be difficult to turn into an equivalently sexy and user-friendly HTML.

When you have content that is meant to be printed or downloaded, like spec sheets, MSDSs, product manuals, brochures, forms meant to be printed and filled by hand, etc…

When the cost-benefit ratio just isn’t in favor of replacing PDFs. This might be the case if you only have a few PDFs and you don’t want to spend the upfront time or money converting the pages into HTML and redirecting the URLs. (That said, a good PDF-to-HTML converter may be worth the investment if you’ve got a lot of un-uploaded PDFs laying around.)

The Best Practices in SEO for PDF Files

The big myth that search engines can’t digest PDF content used to be the case years ago, but the search engines have come a long way, baby. So if you have reason to stick with your PDFs, just follow the simple tips below. I’ve listed the important stuff first.

Always use text-based PDFs

Search engines understand text waaay better than images (though the engines do have rudimentary optical character recognition capabilities), so make sure the words in your PDF are basic copy-and-paste-able text, not pictures of words. Most of the big PDF creators, like those in Adobe Creative Suite, have your back here. If you happen to have a scanned document you want to turn into a solid SEO landing page, you’ll need to use a little OCR yourself and convert the document into text.

Set your title in the document properties

This is such a common and easy-to-fix error that it drives me crazy. It’s common knowledge that the title tag is a huge ranking factor. To do this to a PDF, one must set the title in the document properties. Almost all PDF creators support this functionality including Adobe applications such as InDesign. Per usual, you want to smartly utilize keywords and optimize your title tags.

Set an SEO-friendly URL/filename

Typically, the PDF filename will become part of the URL, so give your document a good key-word rich filename. Often, search engines use the filename/URL snippet for the title tag when the title is not set. Also, some document creators will default the title as the filename. So please set a descriptive title and filename. I’m sick of search results that look like this:

Do good SEO

What do keyword-rich title tags and descriptive URLs have in common, besides being PDF SEO best practices? They also follow standard SEO best practices. Follow your other usual basic SEO best practices to optimize your PDFs as well. This includes:

internal linking to the PDF page to give it some link juice and authority (I see high-potential PDFs unnecessarily buried too deep in many websites). Speaking of internal linking and common pitfalls, please link from your PDF page to your other pages when relevant. It helps your SEO efforts and the user experience, and it isn’t done enough (seriously, I cringe when I have to copy and paste a URL from a PDF into the browser).

Keep the file size light

Huge sized files will load slower, affecting user experience and the search engines’ crawl. Adobe has the “PDF Optimizer” function which will allow you to reduce file size, and you’ll want to use it for heavy PDFs. Learn the nitty-gritty on reducing PDF file size here.

Avoid duplicate content

Having both HTML and PDF versions of the same content can sometimes be a wise choice, but only if you take measures to prevent the duplicate content issue. Also, if you tweak a PDF and re-upload it, don’t create a duplicate by accidentally changing the filename and change the URL.

Set the other document properties too

Hey, while you’re in there(setting the title)… you might as well complete the other properties such as Author, Subject, and Keywords. I couldn’t honestly tell you I know how much impact this will have, but I keep reading on the Internets that it’s worth it. So fill out all the properties you can — I just wouldn’t spend all day on it. Some sources say the Subject will become the Meta Description (but I have yet to verify this with much validity.)

Touchup the Reading Order

“Touchup” the Reading Order and set alternate text as well as headings. The headings are said to be handled by the search engines similarly to how header tags are handling in plain HTML.

Don’t save as the latest Acrobat version

Many readers might not have the latest Reader version (and no one wants to upload it just for your stupid page). Search engines sometimes fall behind the times too, so save your PDF in an older version.

Write-protect your document

If you don’t write-protect your document, then someone can upload the whole file to their site and change it however they want (including editing out your links.)

Reid Bandremer is a Senior Search Project Manager. His background before joining LunaMetrics in 2011 includes eCommerce marketing experience and a pair of business degrees. He is a rabid fan of data, music, and holistic ROI-driven search marketing strategy. Other strengths include SEO metrics, migrations, and searcher segmentation (keyword research 2.0). Contrary to popular theory, Reid is not homeless – he just often stays at the office late because he is obsessed with maximizing the value of clients’ search traffic.

34 Responses to “SEO for PDFs”

Good article – something I actually needed this morning so it was nice that it was conveniently right on the front page of Inbound.

Re: duplicate content. Can you set the canonical for the HTML to be the PDF address? I don’t even know if that will work? rel canonical=”yoursite.com/this.pdf”? Would Google acknowledge that? Just something I don’t think I’ve ever tried.

Great question Matt. A question I don’t know the answer to for sure. I’ve yet in my research found reason to believe you can’t point the rel=”canonical” to the pdf, but I have yet to confirm you can do it, either. I can only say for sure that you can do it the other way around: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139066#5. Sorry I can’t give you a more definitive answer; please let me know if you find one.

Good one Reid. If one uses a rel=”canonical” he can only set the pdf version as the canonical or the preferred version from the html page and not the html as a prefered version from a pdf, as I don’t think there are options to set canonical version (html page)in any of the pdf editors.

I think the best solution to avoid duplication would be just to link from the html page to the pdf saying “download the pdf version of this content here” or some thing which users can understand, Search engines are so intelligent these days that they can almost easily understand all the stuff that the users can understand.

You’re right. You can not track visits to a PDF landing page in Google Analytics to my knowledge. Thanks for pointing that out. In fact, it may indeed be the biggest challenge to getting more SEO love to oft-neglected PDF pages. One thing you can do (which is no substitute for full GA data) is if your pdf links to a conversion-page (for example, your PDF whitepaper has a call-to-action linking to a request-a-trial page), you can campaign tag the URL in the link and track the qualified traffic sent from the PDF.

Even if you can not track the number of views on pdf, you can put the tracking code to count clicks on button/link from your site which leads to your pdf. But the problem wit tracking is not resolved as you cant know how many people visit it from other sites (with external link).

Question, we sell a pdf product which often ends up on customer web sites. I’m thinking we could use these to generate back links but would we potentially suffer from over optimisation penalties unless we varied the backlink anchor text? What do think?

Seems like the backlink here would be treated as if it were from an infographic or widget. Some could disagree with me, but I think it is possible to trigger a penalty, especially if: 1) it’s the same piece of content (rather than different kinds of products) 2)it’s found on many low-authority or untrusted or dodgy sites 3) the anchor text is the same for each link 4) the anchor text is keyword-rich (rather than generic or branded anchor text). Also, Google may simply put less value into those links than it would others types. I think the more of those factors you can reduce the more value and lower risk you’d have with this backlink tactic. But if you used branded (like “buyaplan.com”) or generic anchor text (like “here”), then there’s less need for varying anchor text.

Great article. Lots of helpful tips. I would like more control over what Google selects for the snippet on SERPs. It appears that Google will scrape the site and produce text from the PDF that matches the terms that were searched. But I sure wish I could get Google to use what I have composed in the Description tag in Doc Properties. Any suggestions? Or any other way(s) to control this?

Patricia, you’ve kind of stumped me. I’ve read that the subject can serve as the Meta description(ie, the description in the SERP snippet), but I have no proof. Also, text near the top of the document is more likely to be used as the description, so if leading with an abstract or detailed subheading that summarizes the doc would be appropriate, maybe you can put that in, and there’d be a decent chance that becomes the SERP description.

I have read that the Subject in the Document Properties can serve as the SERP snippet, too, but have not actually seen that happen in practice (albeit on a really small testing group of PDF’s). I know that I have spent time adding tasty keyword descriptions to the Subject field in Document Properties, but so far I have not seen them show up in SERP’s. (But like I said, I haven’t tested a large group and it is also possible that it could take several weeks for the bots to catch onto our newly tagged PDF’s.)

I did look at some old, previously tagged PDFs that had the Subject property filled out, though, and I noticed that it was not picking up the Subject field as the snippet, either.

One article (http://www.jm-seo.org/seo-tutorial/adobe-pdf-seo.html) indicated that it was actually the Keywords in the Document Properties that would then show up in the description snippet in SERPs(!). The author of that article said that Keywords in Doc Properties was equivalent to the Meta Description field on HTML. Is this true? I tried to test for this by looking at old PDF’s that had both the Subject and Keyword fields filled in on Document Properties–but neither one was used for the snippet! Has anyone else tested or experienced this?

What I have seen, is that in a google search, it seems to scrape the whole PDF looking to grab a snippet of text that is close to the terms that were searched–even if those terms were in a footnote or at the bottom of the page. It seems logical that it would look for the best match at the top of the page, like you said Reid, but have not been able to confirm it.

Thoughts? Just trying to get an attractive description/snippet in SERPs for our PDFs. So it helps the user to decide whether it is worth the time to open up the PDF — and helps our (CTR) click through rates!

As far as the article, I’m skeptical that the keywords in the doc properties would serve as the description(due to age of the age of the article, because it seems arbitrary, and because the author has no proof). It couldn’t hurt to test though. I’d look forward to hearing anyones results on that experiment.

(Also, I’d stay away from the author’s recommendation to create an html landing page for the pdf – thats not going to help in 2013.)

Descriptions are always tricky – even for html pages, you can’t force Google to adopt the Meta description 100% of the time. They will utilize Meta descriptions when they feel it is relevant. From what I’ve been able to tell, words at the top of the page are much more likely to be used as descriptions than words at the bottom. But the words must contain words the user is searching for.

Thanks Reid for a great article. Good to know that I’m doing the right thing with the PDFs (and Word docs) on our website!

It is a WCAG 2.0 AA requirement that we not only make PDFs accessible (tagging, set language, etc) but also provide any downloadable content in PDF (i.e. all PDF content) in an alternative, downloadable format such as Word or plain text (preferably Word, due to semantic structure etc).

So, my question is would the two versions of the same content trigger a penalty? Even if the Word document includes document properties and is optimised for accessibility and SEO?

The short answer is “yeah – you’re going to want to set a canonical URL in the http header”

Long answer:
This would likely inhibit your SEO a bit, but it wouldn’t technically “trigger a penalty”. Technically, I only call something a penalty when Google inhibits the rankings power of your site due to a violation of quality guidelines, usually resulting in a sudden drop in traffic.

But having two versions of a page instead of one will constitute non-malicious duplicate content and carry the associated issues.

Hi, I have a excisting PDF document, this is scanned from a magazine. I can select the text (copy paste) so it was scanned ‘text based’. But with which program I can add title, author, subject and keywords?

Great article, thanks! It’s very re-assuring to see that all my hard work and time spent optimising my PDF’s is worth it. I do everything on the list, except Write Protect the document, (and I often forget hyperlinks, so I cringed when you said that you hate when you have to copy and paste a link in the PDF, I promise to do better).
Adobe Acrobat could be a lot easier to use when using the Touch-Up Reading Order feature, I find that you have to save your work regularly in case Acrobat makes a mess of it. I sell hundreds of products and have decent PDF specifications for almost all of them. I know that PDF optimisation works beautifully because sometimes when I am in a hurry I will spend ages optimising the PDF, upload it and then forget to create an actual webpage for the product, but I still get calls for the product and people ask me how much the item is and then I realise I forgot to create the page! Another way is to Google specific phrases that you entered only in the PDF and you will see them listed in the Google results.

I don’t how what “rel canonical=” means, but I will look it up.
My website ranks No. 1 in Google for just about every product I do, so I’m obviously doing something right.

PDF links pass link equity. Unfortunately, I do not know of any way to make the links nofollow. I’m actually hard-pressed to come up with a good use-case where you’d really want to make the links nofollow, unless they’re linking to pages you do not want crawled. If that is the case(?), there may be other things you can do.

It technically is duplicate content, but it should only be an issue for you if the content on those sites comes up in the search results instead of yours. If your site is authoritative and the page you host the PDF is on, this shouldn’t be a problem. It is also less likely to be a problem if the PDF has been up on your site for a while before you submit it to those other sites.

I enjoyed your article.
I was wondering if you had any ideas on how best to utilize the rel=”author” tag with PDF files. The normal method is to link to the author’s Google+ profile, but that obviously works with regular webpages.
Any help is appreciated.
-AW