Ben D’Angelo is up first. He’s been with Google a little more than three years. I think that means he went to Google straight out of grade school.

What are duplicate content issues? There are actually multiple disjoint problems.

Duplicate content within your site or sites:

Multiple URLs point to the same page or similar pages

Different countries (same language)

Duplicate content across other sites:

Syndicated content

Scraped content

The guiding principle behind the search engines’ indexing is ONE URL for one piece of content. Why? Because users don’t like duplicates in results. It saves resources in Google’s index, leaving more room for other pages from your site. And it saves resources on their server. [So Ben is telling us to keep duplicate content low to save Google money? Man, that stock price must really be suffering.]

Sources of duplicate content:

Multiple URLs pointing to the same page

www vs. non-www

Session IDs, URL parameters

Printable versions of pages

CNAMEs

Similar content on different pages

Manufacturer’s databases

Different countries

How does Google handle this? They cluster like content and pick the best representative. There are variations on this depending on where it is in the pipeline. Different filters are used for different types of duplicate content. In general, it’s just a filter and it’s not going to destroy your site.

The problem comes in when Google doesn’t choose the page you want or makes a mistake in clustering. You need to take back control.

Use 301 redirects for exact duplicates, like tracking URLs, and to solve www vs. non-www issue. You can also address exact duplicates in Google Webmaster Tools, but that only solves the problem for Google. He demos briefly.

For near duplicates, no index or block with robots.txt. Things like printer pages and site clones should have this.

Domains by country are a little different. Different languages are not duplicate content. Same language, different country? Don’t worry about it — the right one will usually be okay. You can geo-target in GWT or use different TLDs to help Google recognize where the content belongs. Best of all is creating unique content for that country.

Leave out URL parameters if you can. Put that data into a cookie instead.

In Webmaster Tools you can check for all sorts of other problems too, like duplicate Title and Meta data. Fix those things.

If another site has content that duplicates yours, there’s less that you can do.

Duplicate content from syndication should include a link back to your site to make the canonical origin clear. Another option is to syndicate different content than what you publish on your site. If you’re publishing content you have syndicated, manage your expectations.

Priyank Garg is next up. He’s got a sore throat so he’ll be brief. His voice is all scratchy. Aw.

Much of this will be similar to Ben’s presentation — I’ll pull out the Yahoo-specific stuff. Like Google, Yahoo filters at several places in the pipeline. Session IDs and other “content neutral” parameters can really hurt your crawl queue. They might never get to the rest of your content because they’re crawling the same page over and over with a session ID. “Soft” 404 pages can also cause duplicate content problems. Repeated elements (perhaps with just a keyword replace) lead to problems.

Robots-nocontent can be used for syndicated content that may be useful to the user in context but not for search engines.

You can do dynamic URL rewriting in Site Explorer. Tell them which parameters are content neutral for your sites:

Ability to indicate parameter to remove URLs from site

More efficient crawl with less duplicates

Better site coverage as fewer resources are wasted on duplicates

Fewer risks of crawler traps

Cleaner URL, easier for user to read and more likely to be clicked

Better ranking due to reduced link juice fragmentation — it’s equivalent to 301ing all the duplicates back to one URL, saves time because they don’t have to crawl it

Derrick Wheeler is up. Here’s a bit of vintage Derrick for you all: “This crowd is a perfect Web site. You’re all unique. I would crawl, index and rank all of you.” Rand interjects “That’s dirty.” Derrick: “But I wouldn’t click or take action.” Hee.

Look for spider traps, adding a parameter and creating new pages every time you go back and forth several times.

Make sure that when you’re creating sites for users, you still avoid spider traps. Just because you don’t think the search engines will need to index it, doesn’t mean that you don’t have other pages that the search engines won’t get to because they’re busy with your trap.

Document why you’re doing things. One site removed session IDs for search engines and got 10 million pages indexed. Down the line, someone forgot why it had been done, started giving session IDs to the engines again and their index pages plummeted again.

Look for things that might be causing problems, like dynamic breadcrumbs, based on how someone clicked through the site (Brookstone does this), related products, etc. They might be helpful for users but you’re probably going to get into trouble. Make your internal linking consistent and useful. Some products might be able to live in multiple categories, but you need to make a decision.

Anytime you see related, sort or compare, think “possible duplicate content”. When you see “select region” or “sign in”, think duplicate content. Disallow those pages in your robots.txt. “Email an article”, “send to a friend” — think duplicate content.
Once you screw up the parameter order, it’s hard to fix. Keep it consistent.

Use absolute links, not relative links, especially when switching between http:// and https://. Other people could link to you with https:// as well and you can’t really do anything about that.

Priyank suggests going after the low-hanging fruit. Try the dynamic URLs first so that you can see the benefit right away.

Brent Payne asks: How do you credit a story properly when you’re the Chicago Tribune? Can I get a link attribute or something? Just linking back doesn’t work. Google tells me it’s not a big deal but it is.

There’s not so much that the reps can say to that. They’re trying and he’s already doing the right thing. Poor Brent.

Derrick doesn’t think there is a solution right now. (He also reminded everyone that he’s an in-house SEM, not a search engine representative.)

How detrimental are different link IDs?

Priyank: Every different URL linking to the same content is duplicate content. That’s why you should use dynamic URL rewriting.

Ben: We try to handle that automatically. We might have to crawl the page once but we try to learn which parameters don’t affect the page content.

About the Author

Susan Esparza is former managing editor at Bruce Clay, Inc., and has written extensively for clients and internal publications. Along with Bruce Clay, she is co-author of Search Engine Optimization All-in-One Desk Reference For Dummies.

HQ Hours of Operation:
8:30am to 5:30 pm Pacific timeDays of Operation:
Monday through Friday — email works other times in many casesSupport Operations:
M-F 9:00 to 5:00 Email Support FormTraining Facility:
Please see the training facility map