FAQ on Duplicate Content and Moving your site by Matt Cutts at PubCon, 2009

Matt Cutts from Google did a phenomenal keynote at Pubcon 2009. Here is the audio and show notes for the FAQ session. In this session, he explained some of the key steps Webmaster should remember while moving a site, duplicate content penalty, and steps to take while optimizing your site.

Matt Cutts was asked about the impact of Duplicate Content and when to use Canonical Link Tag?

Matt Cutts: Guys, don’t worry about duplicate content. Why? Because who makes the duplicate content right? The white hats, they’re like, I make a valid site and maybe I had it in a different country. Maybe it’s (dot)uk, and I don’t want to be penalized for duplicate content; or I have a printable version of an article, and I don’t want that to be viewed as duplicate content. It’s really funny. Whenever someone asks me about duplicate content, I can pretty much assume that they’re a white hat. It’s kind of a nice little feature. So most of the time you do not need to worry. Search engines go out of their way to try to figure out, ‘What is the duplicate content, how do we remove it without causing any problems whatsoever?’ People very, very rarely get penalized for duplicate content.

How can original content owner get the most page rank?

Most of the time if we detect one page and then another page being duplicate, we will just say, ‘OK this page we won’t show; we will show this page instead.’ If you’re the person who originally wrote the content, it’s very smart to embed a link to the original location of that article. And then if it ever gets scraped or copied or syndicated or whatever, that duplicate article will link to your original article. That pretty much guarantees that you, the original person who wrote that article, will have the most page rank and show up first.

When and why to use canonical tag?

We just introduced something about a few weeks ago called the ‘canonical link tag.’ And the idea is there are so many big sites that have duplicate content on their own site. Like, has anybody ever run into having www and non-www version of the site. Sometimes you tell the developers, ‘Put a 301 from the www version to the non-www version,’ and they don’t do it, and you end up with like 18 copies of the same site, or the http version of the site and the https, secure version, of the site, sometimes those both show up. Sometimes you just have the same content, you know, sorted by ascending price and descending price.

In an html page, you have an element in the head, and that element basically says the canonical version of this page is the preferred version of this page, the pretty version, the clean version of this url, is right here. So, for example, maybe you want all the sort prices to be increasing in price. You can say, ‘okay the canonical version of this page is sort equals ascending,’ or something like that. And it’s very very simple. We treat it just like a 301 pretty much.

What about hijacking?

Can’t I say that if I hack into this guy’s site right here, maybe the canonical version is my black hat site. In order to prevent that from being an issue, you can only do the canonical tag within your own site. So that’s the automatic safety valve that prevents against any kind of hijacking.

Now is it possible to have problems?

It is. IBM shot themselves in the foot with the canonical link tag. Whenever you visited the home page, they did a redirect and they also set the canonical link tag to be a page that was like a 404 and had never been crawled by Google. Yeah. That’s not good. So for a couple of days, we were like, ‘IBM.com?null,’ you know, or something like that. So don’t just jump into this. Take a little bit of time. Plan out how you’re going to do the canonical tag. Make sure that it makes sense. There’s time. It’s not an emergency, but it’s a really great way to solve any duplicate content problems.

When to file Digital Millennium Copyright Act (DMCA) request and when to Report Spam?

Now if your duplicate content problem is an external person, someone who has scraped you, a spammer, there are two things you can do:

1. If they’re a legitimate company, do a Digital Millennium Copyright Act (DMCA) Request. Google would take those pages out of the index.

2. Now if they’re a true spammer, if it’s casinos-viagra-cialis-onlinegambling.-info.US.biz or whatever; and they’re really scraping you and there’s nothing good on there at all, do a Spam Report. We do love to get those, and when we pull on that thread we can sometimes find the entire spam network.

In short, you normally don’t need to worry about duplicate content. If you are interested, you’re more than welcome to use the canonical link tag. It works very well. Take your time, but it’s just like a little 301, permanent redirect on your site. And if you see somebody scraping you or who is a spammer, either do a DMCA Request or do a Spam Report.

Q: Moving a site: We’ve talked 301’s, 302’s. What’s the best way to go about moving a site these days or a sub-site?

Matt Cutts: Okay. You want to know about moving to a new IP address, or moving to new domain name?

Q: Both.

Matt Cutts: All right. Here we go. IP address is easier, so let’s do that first. You have an old site at an old IP address. You bring up the same site at the new IP address, right? And the whole idea of that is no matter which IP address you go to, you’ll still get the exact same site. So you want to mirror or duplicate across IP addresses. Before you even do that, there’s something in DNS, the Domain Name System, whenever I type in ‘Google.com’ or ‘WebMasterWorld.com’ or ‘Pubcon.com’, there’s a Domain Name Resolver that says, ‘Okay, the IP address for Pubcon.com is right here.’ And normally you say, ‘Okay that’s cached for 24 hours, so if I search for Pubcom.com again, I’ll go straight to that IP address.’ I don’t recheck for another 24 hours. There’s a setting called ‘TTL’ which stands for Time To Live. So you can set your DNS TTL or Time To Live to five minutes. And if you do that, then every five minutes, if you type in Pubcon.com, it’ll just recheck for the new IP address.

Step 1: Set your DNS Time To Live down to about five minutes.
Step 2: Bring up your site so that it exists on both IP addresses.
Step 3: Point from the old IP address to the new IP address.
Step 4: Wait about five minutes because that will let all those DNS caches flush.

In worse case, if you want to be really paranoid, wait 24 hours in case there’s a few people that cache things and aren’t really well behaved. As soon as you see Google bot fetching things from the new IP address, you’re basically safe. At that point feel free to take down the old site. So that’s how to move to a new IP address.

Moving to a new domain is a little trickier because you want to move to a new site, and you want to be as safe as possible. So typically what we recommend is doing what’s called a ‘Permanent Redirect.’

Whenever you try to fetch the webpage, you return an http status code. So, for example, if you return the web page totally fine, it would be called a 200. Whenever you try to fetch a page, and you can’t find it, that’s a 404, and the web service says 404. Whenever you try to fetch a page, and it’s moved, it can be a 301 or a 302. 301 means a permanent redirect. 302 means a temporary redirect. So you want to have a 301, permanent redirect, from the old site to the new site. And Google handles that completely fine, right. But this is your site. You want to be really safe. You want to be really cautious.

So here’s the extra step. Don’t just move the entire domain from the old domain to the new domain. Start out and then move a sub-directory or a sub-domain. Move that first; if you’ve got a forum, move one part of your forum. Move that over to the new domain, and make sure that the rankings for that one part of your site don’t crash. Sometimes it takes a week or so for them to sort of equalize out, because we have to crawl that page to see that it’s moved. So if you move a part of your site first, and it goes fine, then you know that you’re pretty safe. So instead of doing one huge move, if you can break it down into smaller chunks and start out by moving a small part of your site first, you’ll know that you’ll be gold.

301 old pages to new pages. No limits for 301 page

The other thing is a lot of people worry about: what if I have a 301 from every single page on my old site to every single page on my new site? That’s totally fine. In fact, it’s really good if you’ve got five different pages on this site to 301 them to the exact five different pages on the new site. So don’t just go straight to the root page. We don’t have any limits at all on how many 301 redirects you can have on the old site, so that should be totally good.

How to do a 301 redirect on your old site if it does not exist anymore?

So here’s the tip that I would give for that. Go to Webmaster Tools: Google.com/Webmasters, verify your site, prove that you own your site. It’s a pretty simply process, and then you can see pretty much an exhaustive list of all the back-links that Google knows about to your site. So you can take those back-links and you can say, ‘Oh, that’s a really important one. I need to email her to ask her to update her link to my new site,’ right? So you don’t have to email everybody that back-links to your old site, but if they’re, you know, Time.com and CNN.com that links to your old site, at least try to contact them and say, ‘Hey, I’ve moved to a new site. Why don’t you shift it over here.’ So you can at least cherry-pick amongst your back-links and try to get the important ones moved over directly to the new site.

Q: You visit a page, a SERP and if you click off to a site, and have some experience on that site, do you directly go back to Google and perform another search and click on another result, or do you stay on that site? Somehow that interaction, that bounce rate or time-on-site time, is being used to affect Google rankings. Is it true or not true?

Matt Cutts: So it’s really funny because they do use things like bounce rate and ads, but like three or four Pubcons ago, right after we bought Google Analytics, somebody said do you promise never to use Google Analytics, and I made a promise at that time. I said, ‘I promise my team will never go and get data from Google Analytics,’ and we never have. So the web spam team absolutely does not use it.

People have asked about toolbar, click-through, bounce rate, all that sort of stuff, and I’m not going to take any signal off the table. I’m not gonna say we might not use it. But here’s why it might be problematic. It’s a really noisy signal, like I think Microsoft has actually confirmed that they do use click-through data for search results. And I was on the panel, and the guy from Microsoft was right next to me, and you could hear the entire audience start getting mad. That’s not cool. And it was really funny because they said that they used click-through, and so I genuinely believe that they get hit by a lot of spam.

Noisy stuff like toolbar and behavioral metrics, if you use that, would be really really spam. So I personally am a pretty big skeptic of stuff like, you know, just raw click-through behavioral metrics, usage data, all that sort of stuff. I’m not gonna take it off the table. I’m not gonna say we will never use it, but you know, you can imagine there’s some creative people in this room, a few that are wearing black hats, ‘Oh, it’ll help me to go and click on my result everyday.’ They would totally do it. So it’s a very appealing idea, but in practice you really have to worry about spam.
Q: When you have so many data sources coming in, you’ve got toolbars, you’ve got services, you’ve got Google Analytics, you’ve got [ad sense], you know, adwords. You have all the onsite stuff, offsite stuff. You’ve got ISP stuff coming in. What’s the weirdest thing you’ve heard? What’s the biggest misconception like this, like bounce rates, you know, that you hear that you just absolutely won’t do?

Matt Cutts: It’s kind of the whole thing, it’s the same thing about refers. It’s like everybody assumes, ‘Oh there’s this master plan, you know, and it’s really one team and Google is trying to make things really fast.’ Like everybody assumes there’s this master data warehouse at Google, and all the data comes in, and all gets mixed together into one global profile of Brad. And, ‘Oh he’s doing some unusual searches this morning,’ right? And it’s really not that way. Like just like any company, you typically have silos, and so I don’t think I’ve talked to an ads person in a while, right? Web spam, search quality, we’ve got our stuff that we care about, and the vast majority of the time that’s links, that’s PageRank, things like that. And not, you know, thinking about whether to use Google Analytics. We just don’t.
Q: Okay, great. If you were an SEO, what would you have done?

Matt Cutts: I’ll give a real quick one which is, I still think there’s a ton of room for services. There’s so much room for services that do cool things with Twitter. There’s so many ways to get good links. So there was a site called ‘Delores Labs,’ and they specialize in using Mechanical Turk from Amazon. So they took the top 200 people on Twitter, and they said, ‘Okay, Here is the Twitter stream for Barack Obama. Give me one word about Barack Obama.’ And they sent that out to five people. So they got five words about Barack Obama – president, winner, spam, Democrat, bore, right? This is a random cross-section of people. Some of them are outside the United States.

So this is pretty interesting, and they didn’t just display it this way. They also made a graph, right. So Gary B, you know, you can mouse over, and you can see he’s pedestrian, active, simplistic, handsome, football. If you got a message that said, ‘Hey, five random people think you are a snore, a programmer, knowledgeable and cute,’ how could you not link to that or write a blog post about that, right?

So these guys have targeted 200 incredibly influential people. These guys got a lot of links, a ton of links. How much money do you think they spent? $25.00, right? So links can be had. It’s all about creativity, right? Now if you just repeat this experiment, ‘Oh, I’m gonna do it for the top 250 Twitterers,’ you probably won’t get as many links. But if you’re the first person to come up with a creative idea, you absolutely can get a ton of links as a result. So I would probably be coming up with some service, some product that a lot of people would like and want to talk about.

Q: Let’s go to that second one there. What would you do to optimize your own site?

Matt Cutts: So a web site is a lot like a house. It has architecture. You want the foundation to be strong. So I would say the first one would be your URL structure, your site architecture, right? If you’re using something like Word Press or Joomla, a lot of that is built in for you, you don’t have to worry about it. But think about DMOZ. They have a very clean structure, very nice fan-out. It’s a very tree-like structure, you know, 13 top level categories. Each top level category goes down. If you have a URL structure where everybody’s linking crazy around and you get confused, you’re not gonna have visitors like it quite as much. So first off I would say URL structure.

Second, think about titles. Now this question was all about on-page, so I’m not going to talk about back-links at all. Get good titles. Get good keywords not just in your title, but also in your URL. So, URL structure, titles, keywords and keywords in the content.

If you want a very simple tip, go to the Google adwords, keyword tracker, search for keyword tools. Find out all the things associated with search. You’ve got a list of four, five, phrases. You incorporate that into your copy naturally, right? You end up with one page that mentions, ‘Yeah a cat and cats’ in a completely natural way.

Good tip: If you go to my blog and search for something like Firefox printing in Linux, you’ll notice the title of my posts are different than the URL of my post, because those are two different places you can put keywords, right? And you can totally say like, ‘Default printer in Firefox’ for the URL and then ‘How to print in Firefox’ in the title.

I would check your server logs. You are already ranking for a tons of terms that you do not expect to be ranking for, right? And so if you see that you’re on page 2 for ferminator, maybe if you add one more article about ferminator, cross-link them a little bit or something, you might be on page 1 for ferminator.

So look at the stuff that you’re really really close on. You’re already on page 2 and figure out how you can rank on page one.

And then the last tip, add at least one page of content every day.

So that’s what I would say, like titles, URLs, keywords naturally in the content, a page of content everyday, and check your server logs.

Q: Okay. What’s your favorite up there? We’ve got time for one more up there.

Matt Cutts: Does Google have a fast index? Yes, we do. We absolutely do. Google Update in 2003 was the switch from monthly to daily. We now can go much much faster than daily.

And since that was a fast question, let me just say does Google give people with more followers in Twitter more page rank? No, we don’t. You know people ask us about .edus and .govs in Twitter. They say, ‘Oh, surely you give Twitter URLs boosts. How can you not?’ We don’t.

If you get a lot of page rank to your Twitter page or your Twitter profile, then we’re likely to show that in the search results. But we don’t say, ‘Oh,’ you know, ‘Guy has got 20,000 followers in Twitter, therefore give him a little extra page rank’ or anything like that. We just let the links flow out naturally the way that they’re gonna flow out.

Hello Steen, Basically, search engine do not put huge penalty for duplicate content if it is done by mistake.. You could use canonical tag to define which version of your site/page is the correct version.. Let me know if you have any concern.