SMX Notes – Duplicate Content Summit

It’s always tough to be the first to address an “advanced” audience. How advanced is advanced? Tough to say. Not everyone there is going to be a Cuttlet, a Mozzer or a Bruce Clay employee (you guys need a term. How about Bruisers?)

That being said, I think a lot of the people there already knew what duplicate content was and some of the presentations were a little low level. However, I’m sure that at least a few people there benefited greatly (if only the search engine reps getting feature requests).

Eytan Seidman, Lead Program Manager, Live Search, Microsoft
Takeaways:

Dupe content is bad because it:

Fragments anchor text

Creates different versions of the page you might want people to link to

Makes it hard (confusing) for others to link in

Can be difficult for search engines to determine which is the canonical URL

Tips:

Session parameters in URLs: keep them simple. Wherever possible, avoid feeding them to the engine (especially lots and lots of parameters

Employ client-side versus server-side redirect. (He would later clarify this statement to include 301s as â€˜client side,’ whereas server-side is like Wikipedia where the client is not taken to a new URL/meta refresh.) Examples: JCPenney.com has the same content as JCPenny.com. Wikipedia does this a lot (look at the URLs for the pages on startrek vs. Star trek).

So, how do you avoid having people copy your content? (Speaking for myself, to prevent scraping, I have this gun… No, not really, I don’t own a gun.) Tell people not to take it and call out people who do. Verify user agents and block unknown IP addresses from crawling your site.

Not all dupe content is bad: proper attribution and links from other sites can add value and drive traffic to your site.

If you think you have duplicate content, make sure you’re adding value, attribute and consider blocking robots.txt for local copies, Linux pages, etc.

How Live Search handles duplicate content: We don’t have sitewide penalties, generally speaking. We use session parameter analysis at crawl time and keep them hidden from the crawler for the most part. We filter dupes at run time to prune the results returned to include only useful, unique results.

Peter Linsley, Senior Product Manager for Search, Ask.com
Dupe content is an issue for search engines because it impairs the user experience and consumes our resources. It’s an issue for webmasters because:

There’s a risk of missing votes (links)

There’s a risk of selecting the wrong candidate (the wrong URL)

Some cases are beyond our control

While concerns are valid, issues are rare.

How Ask handles dupe content: There’s no penalty; it’s similar to not being crawled. We pick one version to be the best one and keep that in our index. We only compare indexable content, not templates or HTML code. We filter them out when our confidence factor is high: we have low tolerance on false positives. The proper candidate is identified from numerous signals, much like ranking. Usually, it’s the most popular version of the site.

Can the four search engines agree on a variable we can use to indicate to spiders that they should strip all parameters after that parameters?
People don’t know about parameters, and they won’t know about these. Can all CMSes handle this? Would we do it thruugh a meta tag, the site map, Webmaster Console? Robots.txt? Wild card?

We nofollow links to the noncanonical versions of our page. Does that raise a red flag with search engines?
Vanessa: That’s not a good way to prevent dupes. Other people can link to the page. Robot them out instead.

WordPress as a CMS: Multiple author designations, tags, tag/author combinations have resulted in posts appearing on lots of pages. Can search engines work with major blogging platforms to fix this?Vanessa: Probably. They’d come up in SERPs, but we could work with CMSes to put the burden on their side.Eytan: What’s your goal? Do you want to designate which of those is the primary page or tag?Vanessa: Mostly we can sort these things out.Amit: In general, the linking structures lend themselves to the right things happening, noting where things are important.

Danny Sullivan: Characterize the three major ways of duplicate content: scraping, syndication (including without permission), site-specific (on your site, for whatever reason, you have two versions of the same page). Vote: people were mostly concerned with site-specific, then syndication and scraping. (“All of the above!” comes a shout from the back.)

Is there some way to provide a date-stamp, time-stamp or date of discovery? If our content is scraped or syndicated right away, often the scrape is discovered first. Do you use date of discovery and how do you differentiate valid dates? How do you know who’s first? (Focused mostly on blogs, but applies to all pages)Eytan: It’s a factor, but not a big part of it. We’re looking for a lot of other signals to indicate which is the canonical content. There’s some aspect of that on news with a time-based element.Peter: It’s gameable by scrapersAmit: If you want your content to show up earlier in the indexes: sitemap updates when you update your sites.Danny: To me, best is first. To you is it the most linked to?Eytan: The best is the one that ranks highest in our run-time ranking (lots of factors). Over time, we won’t look at time as the defining variable.Danny: How many would be interested in a way to tell search engines that mine really is the canonical version?Peter: That’s still gameable: we can’t solve it for mom & pops that way.Danny: Then never solve it for anyone? Maybe they can learn. It just seems like there ought to be a way to tell search engines that “this is my doc, I just pinged you with it, I know you trust me.”

You collaborated on the site maps standard. Is work continuing?Danny: They walked over hand in hand. You missed it; it was beautiful.Vanessa: Yes. We all support the ability to specify the location of your sitemap in robots file and we regularly discuss this.Follow up: Do you pull the files or do you leave them on our servers?
Yahoo & Google pull the site map files. MSN is doing this on a fairly selective basis right now, but they’re ramping it up now. Ask is similar to MSN.

Amit Kumar—Poll: how many have sitemap files? All. How many specify sitemap in robots file? Much fewer. How many update sitemap weekly? Some.

Comment: With tracking parameters: if you use a robots exclusion parameter, I want to consolidate link popularity—I don’t want to throw them all out since our pages are getting popular with session ids in incoming links.

Amit Kumar: How many have submitted site maps to Site Explorer? 10-15

How should we address duplicate content as we begin syndicating our video?Vanessa: In video serps? (Exactly).Amit: Does the modus indication actually point back to you? (Yes and no; we also drive traffic back to our site from YouTube).Vanessa: Ask that they block it with robots when making the syndication deal.

Can we use a digital signature to avoid scraping? There are no good reporting tools from engines telling us which pages they consider as dupes. Why not?Danny: Maybe permission to have one hidden link; standardize the format to make sure that scrapers don’t steal that one?
(Questioner) No, more like my special secret word that I set in the webmaster console.Eytan: we’ve done this in email with moderate success. The challenge is adoption. How do we get people to broadly adopt it?Vanessa: Of course, scrapers could steal content from unauthenticating sites and authenticate it themselves.

Danny: To your [the search engines’] credit, you’ve done lots of cool stuff. Some have worked and been widely adopted.Amit: Robots nocontent came from conferences.

We’ve given resellers our content. Hundreds or thousands have our content out there. Now what? How do you prove it’s yours? What can we do to address it?Vanessa: Unfortunately, you may just have to rewrite your content to regain control of it.Amit: they might be outranking you for other reasons. You might want to work on other aspects of your marketing (get more authority/links/trust).Danny: It’s a trade off: maybe resellers/affiliates will do a better job marketing for you.

Websites like eBay with good SEO teams are creating subdomains/multiple websites with unique content, and they’re receiving more than the two results per SERP. Is this subdomain spam?Vanessa: We’re trying to serve up a variety of resultsAmit: send test cases to Yahoo. We’re working on it!Danny: It’s tough to fix this. Can you only show one blog hosted on wordpress.com in a SERP, then?Eytan: We want to show a lot of unique, but relevant content. We don’t want to bring in a bunch more sites that are less relevant.

Why should we only see two listings or one listing of a piece of content? For example, searching for a fact. Why not just have something where you can click and see all dupes?Peter: We try to be as accurate as possible. Dupes are the EXACT same content. Would that enhance real user experience?

http://www.marketingpilgrim.com Andy Beal

LOL@ – “So, how do you avoid having people copy your content? (Speaking for myself, to prevent scraping, I have this gunâ€¦”

Jordan McCollum

Totally wish I could claim credit for that, but really it comes from a friend in high school. We were doing mock college interviews and the teacher asked him what he’d do if there were a professor who wouldn’t work with him.

I thought of it immediately when the speakers brought that topic up.

http://www.technologyevangelist.com Ed Kohler

It seems like blog platforms could to a better job helping bloggers by addressing more of this stuff in their default themes. For example, avoid republish full posts in category pages.

http://www.alchemistmedia.com jonah stein

Jordan:

Just an FYI, I was the person asking the search engines to provide a duplicate content reporting tool as well as the ability to ask the engines to recognize a digital signature that we could register to detect scraping.

Thanks for the right up.

Jonah

Vikleus

I had big problem wih duplicate content..
I added every time new content to my sitehttp://www.iptelshop.com and after that I saw that other people copy my content, I was very angry because of it… I asked people who copied my content to add my link after articles…and only 60% people added my link…