Using Rel=Canonical On A Pdf Document

I have a website with a few dozen pages, but it also has hundreds of .pdf documents. These .pdf documents are downloaded and printed by a lot of people. Each pdf contains an image file that has been formatted to print at a reliable scale on anyone's printer.

The pdf files also have a small amount of text and a clickable link to the homepage of my website. They are also optimized to display an SEOed title tag and rank well in the SERPs. Lots of people have linked to these pdfs.

I think that these pdfs are causing a duplicate content problem or a trivial content problem. To solve that I need to use rel=canonical in a way that attributes them back to the html page that the visitor visits to download them. Unfortunately there is no way to place an rel=canonical in a .pdf document (or I don't know of any way to do it). So I am going to fix this by htaccess following instructions from SearchEnginePeople.http://www.searcheng...ccess-file.html

Indicate the canonical version of a URL by responding with the Link rel="canonical" HTTP header. Adding rel="canonical" to the head section of a page is useful for HTML content, but it can't be used for PDFs and other file types indexed by Google Web Search. In these cases you can indicate a canonical URL by responding with the Link rel="canonical" HTTP header, like this (note that to use this option, you'll need to be able to configure your server):Link: <http://www.example.c...ite-paper.pdf>; rel="canonical"

You could try iFraming the PDF files. Move them to a new directory as Jon suggests. Block the crawlers from that directory. Then create iFraming pages at the old *.pdf URLs which allow you to implement the SEO meta directives you want to use. The iFraming pages would simply link to or load the .PDFs into their content.

I have seen this done on a few Websites. I'm not entirely clear on how they do it (possibly with AJAX). It has never occurred to me before to wonder how it's done.

I dislike and refuse to use rel=canonical (but then iamlost ) so can't offer suggestions on how best to utilise it in your specific situation.

Given your usage of the pdf's - I'd have made them into html pages with the pdf's as a download/print option blocked via robots.txt (and .htaccess) - I would recommend:

<meta name="robots" content="noindex, follow">

This means that the SEs crawl the pdf's as normal and all values flow as usual through/via the various links and citations but the pdf's themselves are removed from the public index, i.e. will not show up in search query results.

Note: technically the 'follow' is the default and so not necessary but I believe in redundancy.

Of course the optimal solution would be to determine how/why they are causing a duplication issue and correct it. In all your spare time.

You know, I was thinking similar. If all pdfs were pages, then you could slap ads on them and still link to the pdf as Iamlost suggests. If each pdf also linked to the html page and it was in a follow/noindex directory, then the pages should get the pagerank?

If I move these pdfs to a new folder and block them from the crawlers then their value as linkbait evaporates.

Yes? No?

The pages that would frame the PDF files would take up the old URLs.

In other words, if you have a PDF at:

www . example . com / my-cool . pdf

You would move that to:

www . example . com / blocked-directory / my-cool . pdf

and put up a normal HTML page at

www . example . com / my-cool . pdf

in which you use an iFrame to link to

www . example . com / blocked-directory / my-cool . pdf

Hence, all the links still pointing to (www . example . com / my-cool . pdf) would still lead people to your PDFs. You're just wrapping them in HTML envelopes that allow you to control crawler access to the PDF files and set some robots meta directives.

Of course the optimal solution would be to determine how/why they are causing a duplication issue and correct it. In all your spare time.

In my spare time! OK!

Each of my html pages links to several pdfs... so if I can rel=canonical several pdfs to a single html page then that single html page should be kickass in the SERPs. They already are kickass but this would make it even better.

If all pdfs were pages, then you could slap ads on them and still link to the pdf as Iamlost suggests. If each pdf also linked to the html page and it was in a follow/noindex directory, then the pages should get the pagerank?

PDFs do accumulate pagerank and pass pagerank to any links that are embedded within them.

Also, you can place ads in .pdfs. Adsense does not work but you can sell ads to others... or place ads in them that link to your own product pages.

And, most shopping carts can be triggered from "buy button" links placed in pdf documents. Most people just never think to try this.

.....and put up a normal HTML page at

www . example . com / my-cool . pdf

This is a really interesting idea.... a little sneaky... but I am going to think about it. Thanks.

I hope it doesn't seem sneaky. It's just a way to wrap an object in HTML. People have been doing that for Flash, so why not for PDFs?

ON EDIT: I suppose you could also set up alternative URLs for the framing pages and just implement 301-redirects from the old PDF URLs to the new framing pages, but that is not very efficient in my opinion.

Well, what would you do ideally?
Ideally, you'd have visitors use your HTML pages and link to them.

So, it makes sense to encourage them to print whatever they want from the HTML page, so they could link to it later, if necessary. Does it mean that you'd have to create HTML pages for the current PDF links or use 301 redirects? Not sure entirely.

As for PDFs, ideally, you'd:
- link to and from them to HTML pages
- noindex PDFs
- use the HTML pages as link magnets.

Since your visitors will still be linking to PDFs it doesn't make much sense to change URLs permanently with 301 redirects instead of using the rel="canonical" header (if rel passes PR as efficiently as a 301 does). Otherwise, you'd have to repeat this again with the new PDF URLs. Also, redirecting would make for slightly worse visitor experience for those, who have bookmarked the files.

So, to keep the links you'd rather use rel=canonical, than 301 redirects.

I agree that the pdfs should either be noindexed or have substantive content added.

I don't want to add substantive content because that would make them two pages long and they are currently formatted to print on a single page - and I don't want people who print them get the page they want plus a big page of text that they don't need.

I have one question... if I noindex the pdfs will PR still pass through them?

I have one question... if I noindex the pdfs will PR still pass through them?

So long you DO NOT include nofollow as in

<meta name="robots" content="noindex, nofollow">

NOTE: DO NOT DO THE ABOVE UNLESS YOU REALLY UNDERSTAND THE CONSEQUENCES.

As I mentioned previously:

<meta name="robots" content="noindex, follow">

could simply be

<meta name="robots" content="noindex">

because the 'follow' is an implied default. I prefer to include it because I hate to rely on others to always correctly apply implied defaults.

One caution that I would like to throw out as a sea anchor: how any SE applies anything is subject to change often without notice. Therefore I recommend that if you decide to proceed with noindexing the pdf's that you only do so on a limited number, i.e. 10%, and wait a month to see what, if any, change occurs. If all goes as expected (and I have a zillion such on pages without a problem) then phase in the remainder.Note: the reason that I recommend phasing in changes is that SEs, especially G, have been known to get jittery with massive wholesale changes.

Matt Cutts: A NoIndex page can accumulate PageRank, because the links are still followed outwards from a NoIndex page.

Eric Enge: So, it can accumulate and pass PageRank.

Matt Cutts: Right, and it will still accumulate PageRank, but it won't be showing in our Index. So, I wouldn't make a NoIndex page that itself is a dead end. You can make a NoIndex page that has links to lots of other pages.

For example you might want to have a master Sitemap page and for whatever reason NoIndex that, but then have links to all your sub Sitemaps.

Eric Enge: Another example is if you have pages on a site with content that from a user point of view you recognize that it's valuable to have the page, but you feel that is too duplicative of content on another page on the site

That page might still get links, but you don't want it in the Index and you want the crawler to follow the paths into the rest of the site.

Matt Cutts: That's right. Another good example is, maybe you have a login page, and everybody ends up linking to that login page. That provides very little content value, so you could NoIndex that page, but then the outgoing links would still have PageRank.

NOTE: DO NOT DO THE ABOVE UNLESS YOU REALLY UNDERSTAND THE CONSEQUENCES.

I know!

One of my competitors had their site redesigned and the designer tossed it up with noindex. They disappeared next day and it took them a couple weeks to figure out what was wrong.

Also.... I accidentally added that to an article on my site. After I removed it google did not like the article and ignored it for a couple of months before indexing - and this is a site that gets TONS of spidering.

I don't want to add substantive content because that would make them two pages long and they are currently formatted to print on a single page - and I don't want people who print them get the page they want plus a big page of text that they don't need.

Actually, by having visitors print from HTML pages instead of PDF files, you'd have to make exclusive CSS files for the print media, which would also grant you complete control over your visitors' printing experience.

In this case, you would:- preserve the HTML content instead of PDF- be able to place additional content to HTML files- allow your visitors to get the printing experience they want, but you..- have to create the best quality printing experience via CSS and your HTML page (hint: you can hide some content from printers and it shouldn't be punishable for cloaking, but correct me, if I'm wrong about this ).

I have to point out at this point that Matt C (for what it's worth) has pointed out specifically that Google does not like canonical relations between pages that are not at all similar. I think it was my bud Jon who linked to that interview (+ transcript)