If you’ve ever had any questions about the canonical tag, well, have we got the Whiteboard Friday for you. In today’s episode, Rand defines what rel=canonical means and its intended purpose, when it’s recommended you use it, how to use it, and sticky situations to avoid.

Click on the whiteboard image above to open a high-resolution version in a new tab!

Video Transcription

Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week, we’re going to chat about some SEO best practices for canonicalization and use of the rel=canonical tag.

Before we do that, I think it pays to talk about what a canonical URL is, because a canonical URL doesn’t just refer to a page upon which we are targeting or using the rel=canonical tag. Canonicalization has been around, in fact, much longer than the rel=canonical tag itself, which came out in 2009, and there are a bunch of different things that a canonical URL means.

What is a “canonical” URL?

So first off, what we’re trying to say is this URL is the one that we want Google and the other search engines to index and to rank. These other URLs that potentially have similar content or that are serving a similar purpose or perhaps are exact duplicates, but, for some reason, we have additional URLs of them, those ones should all tell the search engines, “No, no, this guy over here is the one you want.”

So, for example, I’ve got a canonical URL, ABC.com/a.

Then I have a duplicate of that for some reason. Maybe it’s a historical artifact or a problem in my site architecture. Maybe I intentionally did it. Maybe I’m doing it for some sort of tracking or testing purposes. But that URL is at ABC.com/b.

Then I have this other version, ABC.com/a?ref=twitter. What’s going on there? Well, that’s a URL parameter. The URL parameter doesn’t change the content. The content is exactly the same as A, but I really don’t want Google to get confused and rank this version, which can happen by the way. You’ll see URLs that are not the original version, that have some weird URL parameter ranking in Google sometimes. Sometimes this version gets more links than this version because they’re shared on Twitter, and so that’s the one everybody picked up and copied and pasted and linked to. That’s all fine and well, so long as we canonicalize it.

Or this one, it’s a print version. It’s ABC.com/aprint.html. So, in all of these cases, what I want to do is I want to tell Google, “Don’t index this one. Index this one. Don’t index this one. Index this one. Don’t index this one. Index this one.”

I can do that using this, the link rel=canonical, the href telling Google, “This is the page.” You put this in the header tag of any document and Google will know, “Aha, this is a copy or a clone or a duplicate of this other one. I should canonicalize all of my ranking signals, and I should make sure that this other version ranks.”

By the way, you can be self-referential. So it is perfectly fine for ABC.com/a to go ahead and use this as well, pointing to itself. That way, in the event that someone you’ve never even met decides to plug in question mark, some weird parameter and point that to you, you’re still telling Google, “Hey, guess what? This is the original version.”

Great. So since I don’t want Google to be confused, I can use this canonicalization process to do it. The rel=canonical tag is a great way to go. By the way, FYI, it can be used cross-domain. So, for example, if I republish the content on A at something like a Medium.com/@RandFish, which is, I think, my Medium account, /a, guess what? I can put in a cross-domain rel=canonical telling them, “This one over here.” Now, even if Google crawls this other website, they are going to know that this is the original version. Pretty darn cool.

Different ways to canonicalize multiple URLs

There are different ways to canonicalize multiple URLs.

1. Rel=canonical.

I mention that rel=canonical isn’t the only one. It’s one of the most strongly recommended, and that’s why I’m putting it at number one. But there are other ways to do it, and sometimes we want to apply some of these other ones. There are also not-recommended ways to do it, and I’m going to discuss those as well.

2. 301 redirect.

The 301 redirect, this is basically a status code telling Google, “Hey, you know what? I’m going to take /b, I’m going to point it to /a. It was a mistake to ever have /b. I don’t want anyone visiting it. I don’t want it clogging up my web analytics with visit data. You know what? Let’s just 301 redirect that old URL over to this new one, over to the right one.”

3. Passive parameters in Google search console.

Some parts of me like this, some parts of me don’t. I think for very complex websites with tons of URL parameters and a ton of URLs, it can be just an incredible pain sometimes to go to your web dev team and say like, “Hey, we got to clean up all these URL parameters. I need you to add the rel=canonical tag to all these different kinds of pages, and here’s what they should point to. Here’s the logic to do it.” They’re like, “Yeah, guess what? SEO is not a priority for us for the next six months, so you’re going to have to deal with it.”

Probably lots of SEOs out there have heard that from their web dev teams. Well, guess what? You can end around it, and this is a fine way to do that in the short term. Log in to your Google search console account that’s connected to your website. Make sure you’re verified. Then you can basically tell Google, through the Search Parameters section, to make certain kinds of parameters passive.

So, for example, you have sessionid=blah, blah, blah. You can set that to be passive. You can set it to be passive on certain kinds of URLs. You can set it to be passive on all types of URLs. That helps tell Google, “Hey, guess what? Whenever you see this URL parameter, just treat it like it doesn’t exist at all.” That can be a helpful way to canonicalize.

4. Use location hashes.

So let’s say that my goal with /b was basically to have exactly the same content as /a but with one slight difference, which was I was going to take a block of content about a subsection of the topic and place that at the top. So A has the section about whiteboard pens at the top, but B puts the section about whiteboard pens toward the bottom, and they put the section about whiteboards themselves up at the top. Well, it’s the same content, same search intent behind it. I’m doing the same thing.

Well, guess what? You can use the hash in the URL. So it’s a#b and that will jump someone — it’s also called a fragment URL — jump someone to that specific section on the page. You can see this, for example, Moz.com/about/jobs. I think if you plug in #listings, it will take you right to the job listings. Instead of reading about what it’s like to work here, you can just get directly to the list of jobs themselves. Now, Google considers that all one URL. So they’re not going to rank them differently. They don’t get indexed differently. They’re essentially canonicalized to the same URL.

NOT RECOMMENDED

I do not recommend…

5. Blocking Google from crawling one URL but not the other version.

Because guess what? Even if you use robots.txt and you block Googlebot’s spider and you send them away and they can’t reach it because you said robots.txt disallow /b, Google will not know that /b and /a have the same content on them. How could they?

They can’t crawl it. So they can’t see anything that’s here. It’s invisible to them. Therefore, they’ll have no idea that any ranking signals, any links that happen to point there, any engagement signals, any content signals, whatever ranking signals that might have helped A rank better, they can’t see them. If you canonicalize in one of these ways, now you’re telling Google, yes, B is the same as A, combine their forces, give me all the rankings ability.

6. I would also not recommend blocking indexation.

So you might say, “Ah, well Rand, I’ll use the meta robots no index tag, so that way Google can crawl it, they can see that the content is the same, but I won’t allow them to index it.” Guess what? Same problem. They can see that the content is the same, but unless Google is smart enough to automatically canonicalize, which I would not trust them on, I would always trust yourself first, you are essentially, again, preventing them from combining the ranking signals of B into A, and that’s something you really want.

7. I would not recommend using the 302, the 307, or any other 30x other than the 301.

This is the guy that you want. It is a permanent redirect. It is the most likely to be most successful in canonicalization, even though Google has said, “We often treat 301s and 302s similarly.” The exception to that rule is but a 301 is probably better for canonicalization. Guess what we’re trying to do? Canonicalize!

8. Don’t 40x the non-canonical version.

So don’t take /b and be like, “Oh, okay, that’s not the version we want anymore. We’ll 404 it.” Don’t 404 it when you could 301. If you send it over here with a 301 or you use the rel=canonical in your header, you take all the signals and you point them to A. You lose them if you 404 that in B. Now, all the signals from B are gone. That’s a sad and terrible thing. You don’t want to do that either.

The only time I might do this is if the page is very new or it was just an error. You don’t think it has any ranking signals, and you’ve got a bunch of other problems. You don’t want to deal with having to maintain the URL and the redirect long term. Fine. But if this was a real URL and real people visited it and real people linked to it, guess what? You need to redirect it because you want to save those signals.

When to canonicalize URLs

Last but not least, when should we canonicalize URLs versus not?

I. If the content is extremely similar or exactly duplicate.

Well, if it is the case that the content is either extremely similar or exactly duplicate on two different URLs, two or more URLs, you should always collapse and canonicalize those to a single one.

II. If the content is serving the same (or nearly the same) searcher intent (even if the KW targets vary somewhat).

If the content is not duplicate, maybe you have two pages that are completely unique about whiteboard pens and whiteboards, but even though the content is unique, meaning the phrasing and the sentence structures are the same, that does not mean that you shouldn’t canonicalize.

For example, this Whiteboard Friday about using the rel=canonical, about canonicalization is going to replace an old version from 2009. We are going to take that old version and we are going to use the rel=canonical. Why are we going to use the rel=canonical? So that you can still access the old one if for some reason you want to see the version that we originally came out with in 2009. But we definitely don’t want people visiting that one, and we want to tell Google, “Hey, the most up-to-date one, the new one, the best one is this new version that you’re watching right now.” I know this is slightly meta, but that is a perfectly reasonable use.

What I’m trying to aim at is searcher intent. So if the content is serving the same or nearly the same searcher intent, even if the keyword targeting is slightly different, you want to canonicalize those multiple versions. Google is going to do a much better job of ranking a single piece of content that has lots of good ranking signals for many, many keywords that are related to it, rather than splitting up your link equity and your other ranking signal equity across many, many pages that all target slightly different variations. Plus, it’s a pain in the butt to come up with all that different content. You would be best served by the very best content in one place.

III. If you’re republishing or refreshing or updating old content.

Like the Whiteboard Friday example I just used, you should use the rel=canonical in most cases. There are some exceptions. If you want to maintain that old version, but you’d like the old version’s ranking signals to come to the new version, you can take the content from the old version, republish that at /a-old. Then take /a and redirect that or publish the new version on there and have that version be the one that is canonical and the old version exist at some URL you’ve just created but that’s /old. So republishing, refreshing, updating old content, generally canonicalization is the way to go, and you can preserve the old version if you want.

IV. If content, a product, an event, etc. is no longer available and there’s a near best match on another URL.

If you have content that is expiring, a piece of content, a product, an event, something like that that’s going away, it’s no longer available and there’s a next best version, the version that you think is most likely to solve the searcher’s problems and that they’re probably looking for anyway, you can canonicalize in that case, usually with a 301 rather than with a rel=canonical, because you don’t want someone visiting the old page where nothing is available. You want both searchers and engines to get redirected to the new version, so good idea to essentially 301 at that point.

Okay, folks. Look forward to your questions about rel=canonicals, canonical URLs, and canonicalization in general in SEO. And we’ll see you again next week for another edition of Whiteboard Friday. Take care.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Wouldn’t it be useful to know the total organic sessions and conversions to all of your products? Every week?

If you have access to some analytics for an e-commerce company, try and generate that report now. Give it 5 minutes.

…

Done?

Or did that quick question turn out to be deceptively complicated? Did you fall into a rabbit hole of scraping and estimations?

Not being able to easily answer that question — and others like it — is costing you thousands every year.

Let’s jump back a step

Every online business, whether it’s a property portal or an e-commerce store, will likely have spent hours and hours agonizing over decisions about how their website should look, feel, and be constructed.

The biggest decision is usually this: What will we build our website with? And from there, there are hundreds of decisions, all the way down to what categories should we have on our blog?

Each of these decisions will generate future costs and opportunities, shaping how the business operates.

Somewhere in this process, a URL structure will be decided on. Hopefully it will be logical, but the context in which it’s created is different from how it ends up being used.

As a business grows, the desire for more information and better analytics grows. We hire data analysts and pay agencies thousands of dollars to go out, gather this data, and wrangle it into a useful format so that smart business decisions can be made.

It’s too late. You’ve already wasted £1000s a year.

It’s already too late; by this point, you’ve already created hours and hours of extra work for the people who have to analyze your data and thousands will be wasted.

All because no one structured the URLs with data gathering in mind.

How about an example?

Let’s go back to the problem we talked about at the start, but go through the whole story. An e-commerce company goes to an agency and asks them to get total organic sessions to all of their product pages. They want to measure performance over time.

Now this company was very diligent when they made their site. They’d read Moz and hired an SEO agency when they designed their website and so they’d read this piece of advice: products need to sit at the root. (E.g. mysite.com/white-t-shirt.)

Apparently a lot of websites read this piece of advice, because with minimal searching you can find plenty of sites whose product pages that rank do sit at the root: Appleyard Flowers, Game, Tesco Direct.

At one level it makes sense: a product might be in multiple categories (LCD & 42” TVs, for example), so you want to avoid duplicate content. Plus, if you changed the categories, you wouldn’t want to have to redirect all the products.

But from a data gathering point of view, this is awful. Why? There is now no way in Google Analytics to select all the products unless we had the foresight to set up something earlier, like a custom dimension or content grouping. There is nothing that separates the product URLs from any other URL we might have at the root.

How could our hypothetical data analyst get the data at this point?

They might have to crawl all the pages on the site so they can pick them out with an HTML footprint (a particular piece of HTML on a page that identifies the template), or get an internal list from whoever owns the data in the organization. Once they’ve got all the product URLs, they’ll then have to match this data to the Google Analytics in Excel, probably with a VLOOKUP or, if the data set is too large, a database.

Shoot. This is starting to sound quite expensive.

And of course, if you want to do this analysis regularly, that list will constantly change. The range of products being sold will change. So it will need to be a scheduled scrape or automated report. If we go the scraping route, we could do this, but crawling regularly isn’t possible with Screaming Frog. Now we’re either spending regular time on Screaming Frog or paying for a cloud crawler that you can schedule. If we go the other route, we could have a dev build us an internal automated report we can go to once we can get the resource internally.

Wow, now this is really expensive: a couple days’ worth of dev time, or a recurring job for your SEO consultant or data analyst each week.

This could’ve been a couple of clicks on a default report.

If we have the foresight to put all the products in a folder called /products/, this entire lengthy process becomes one step:

Load the landing pages report in Google Analytics and filter for URLs beginning with /product/.

Congratulations — you’ve just cut a couple days off your agency fee, saved valuable dev time, or gained the ability to fire your second data analyst because your first is now so damn efficient (sorry, second analysts).

As a data analyst or SEO consultant, you continually bump into these kinds of issues, which suck up time and turn quick tasks into endless chores.

What is unique about a URL?

For most analytics services, it’s the main piece of information you can use to identify the page. Google Analytics, Google Search Console, log files, all of these only have access to the URL most of the time and in some cases that’s all you’ll get — you can never change this.

The vast majority of site analyses requires working with templates and generalizing across groups of similar pages. You need to work with templates and you need to be able to do this by URL.

It’s crucial.

There’s a Jeff Bezos saying that’s appropriate here:

“There are two types of decisions. Type 1 decisions are not reversible, and you have to be very careful making them. Type 2 decisions are like walking through a door — if you don’t like the decision, you can always go back.”

Setting URLs is very much a Type 1 decision. As anyone in SEO knows, you really don’t want to be constantly changing URLs; it causes a lot of problems, so when they’re being set up we need to take our time.

How should you set up your URLs?

How do you pick good URL patterns?

First, let’s define a good pattern. A good pattern is something which we can use to easily select a template of URLs, ideally using contains rather than any complicated regex.

This usually means we’re talking about adding folders because they’re easiest to find with just a contains filter, i.e. /products/, /blogs/, etc.

We also want to keep things human-readable when possible, so we need to bear that in mind when choosing our folders.

So where should we add folders to our URLs?

I always ask the following two questions:

Will I need to group the pages in this template together?

If a set of pages needs grouping I need to put them in the same folder, so we can identify this by URL.

Are there crucial sub-groupings for this set of pages? If there are, are they mutually exclusive and how often might they change?

If there are common groupings I may want to make, then I should consider putting this in the URL, unless those data groupings are liable to change.

Will I need to group the products together? Yes, almost certainly. There clearly needs to be a way of grouping in the URL. We should put them in a /product/ folder.

Within in this template, how might I need to group these URLs together? The most plausible grouping for products is the product category. Let’s take a black midi dress.

What about putting “little black dress” or “midi” as a category? Well, are they mutually exclusive? Our dress could fit in the “little black dress” category and the “midi dress” category, so that’s probably not something we should add as a folder in the URL.

What about moving up a level and using “dress” as a category? Now that is far more suitable, if we could reasonably split all our products into:

Dresses

Tops

Skirts

Trousers

Jeans

And if we were happy with having jeans and trousers separate then this might indeed be an excellent fit that would allow us to easily measure the performance of each top-level category. These also seem relatively unlikely to change and, as long as we’re happy having this type of hierarchy at the top (as opposed to, say, “season,” for example), it makes a lot of sense.

What are some common URL patterns people should use?

Product pages

We’ve banged on about this enough and gone through the example above. Stick your products in a /products/ folder.

Articles

Applying the same rules we talked about to articles and two things jump out. The first is top-level categorization.

For example, adding in the following folders would allow you to easily measure the top-level performance of articles:

Travel

Sports

News

You should, of course, be keeping them all in a /blog/ or /guides/ etc. folder too, because you won’t want to group just by category.

The second, which obeys all our rules, is author groupings, which may be well-suited for editorial sites with a large number of authors that they want performance stats on.

Location grouping

Many types of websites often have category pages per location. For example:

Cars for sale in Manchester – /for-sale/vehicles/manchester

Cars for sale in Birmingham. – /for-sale/vehicles/birmingham

However, there are many different levels of location granularity. For example, here are 4 different URLs, each a more specific location in the one above it (sorry to all our non-UK readers — just run with me here).

Cars for sale in Suffolk – /for-sale/vehicles/suffolk

Cars for sale in Ipswich – /for-sale/vehicles/ipswich

Cars for sale in Ipswich center – /for-sale/vehicles/ipswich-center

Cars for sale on Lancaster road – /for-sale/vehicles/lancaster-road

Obviously every site will have different levels of location granularity, but a grouping often missing here is providing the level of location granularity in the URL. For example:

This makes it very easy to assess and measure the performance of each layer so you can understand if it’s necessary, or if perhaps you’ve aggregated too much.

What other good (or bad) examples of this has the community come across? Let’s hear it!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

It was once commonplace for developers to code relative URLs into a site. There are a number of reasons why that might not be the best idea for SEO, and in today’s Whiteboard Friday, Ruth Burr Reedy is here to tell you all about why.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Let’s discuss some non-philosophical absolutes and relatives

Howdy, Moz fans. My name is Ruth Burr Reedy. You may recognize me from such projects as when I used to be the Head of SEO at Moz. I’m now the Senior SEO Manager at BigWing Interactive in Oklahoma City. Today we’re going to talk about relative versus absolute URLs and why they are important.

At any given time, your website can have several different configurations that might be causing duplicate content issues. You could have just a standard http://www.example.com. That’s a pretty standard format for a website.

But the main sources that we see of domain level duplicate content are when the non-www.example.com does not redirect to the www or vice-versa, and when the HTTPS versions of your URLs are not forced to resolve to HTTP versions or, again, vice-versa. What this can mean is if all of these scenarios are true, if all four of these URLs resolve without being forced to resolve to a canonical version, you can, in essence, have four versions of your website out on the Internet. This may or may not be a problem.

It’s not ideal for a couple of reasons. Number one, duplicate content is a problem because some people think that duplicate content is going to give you a penalty. Duplicate content is not going to get your website penalized in the same way that you might see a spammy link penalty from Penguin. There’s no actual penalty involved. You won’t be punished for having duplicate content.

The problem with duplicate content is that you’re basically relying on Google to figure out what the real version of your website is. Google is seeing the URL from all four versions of your website. They’re going to try to figure out which URL is the real URL and just rank that one. The problem with that is you’re basically leaving that decision up to Google when it’s something that you could take control of for yourself.

There are a couple of other reasons that we’ll go into a little bit later for why duplicate content can be a problem. But in short, duplicate content is no good.

However, just having these URLs not resolve to each other may or may not be a huge problem. When it really becomes a serious issue is when that problem is combined with injudicious use of relative URLs in internal links. So let’s talk a little bit about the difference between a relative URL and an absolute URL when it comes to internal linking.

With an absolute URL, you are putting the entire web address of the page that you are linking to in the link. You’re putting your full domain, everything in the link, including /page. That’s an absolute URL.

However, when coding a website, it’s a fairly common web development practice to instead code internal links with what’s called a relative URL. A relative URL is just /page. Basically what that does is it relies on your browser to understand, “Okay, this link is pointing to a page that’s on the same domain that we’re already on. I’m just going to assume that that is the case and go there.”

There are a couple of really good reasons to code relative URLs

1) It is much easier and faster to code.

When you are a web developer and you’re building a site and there thousands of pages, coding relative versus absolute URLs is a way to be more efficient. You’ll see it happen a lot.

2) Staging environments

Another reason why you might see relative versus absolute URLs is some content management systems — and SharePoint is a great example of this — have a staging environment that’s on its own domain. Instead of being example.com, it will be examplestaging.com. The entire website will basically be replicated on that staging domain. Having relative versus absolute URLs means that the same website can exist on staging and on production, or the live accessible version of your website, without having to go back in and recode all of those URLs. Again, it’s more efficient for your web development team. Those are really perfectly valid reasons to do those things. So don’t yell at your web dev team if they’ve coded relative URLS, because from their perspective it is a better solution.

Relative URLs will also cause your page to load slightly faster. However, in my experience, the SEO benefits of having absolute versus relative URLs in your website far outweigh the teeny-tiny bit longer that it will take the page to load. It’s very negligible. If you have a really, really long page load time, there’s going to be a whole boatload of things that you can change that will make a bigger difference than coding your URLs as relative versus absolute.

Page load time, in my opinion, not a concern here. However, it is something that your web dev team may bring up with you when you try to address with them the fact that, from an SEO perspective, coding your website with relative versus absolute URLs, especially in the nav, is not a good solution.

There are even better reasons to use absolute URLs

1) Scrapers

If you have all of your internal links as relative URLs, it would be very, very, very easy for a scraper to simply scrape your whole website and put it up on a new domain, and the whole website would just work. That sucks for you, and it’s great for that scraper. But unless you are out there doing public services for scrapers, for some reason, that’s probably not something that you want happening with your beautiful, hardworking, handcrafted website. That’s one reason. There is a scraper risk.

2) Preventing duplicate content issues

But the other reason why it’s very important to have absolute versus relative URLs is that it really mitigates the duplicate content risk that can be presented when you don’t have all of these versions of your website resolving to one version. Google could potentially enter your site on any one of these four pages, which they’re the same page to you. They’re four different pages to Google. They’re the same domain to you. They are four different domains to Google.

But they could enter your site, and if all of your URLs are relative, they can then crawl and index your entire domain using whatever format these are. Whereas if you have absolute links coded, even if Google enters your site on www. and that resolves, once they crawl to another page, that you’ve got coded without the www., all of that other internal link juice and all of the other pages on your website, Google is not going to assume that those live at the www. version. That really cuts down on different versions of each page of your website. If you have relative URLs throughout, you basically have four different websites if you haven’t fixed this problem.

Again, it’s not always a huge issue. Duplicate content, it’s not ideal. However, Google has gotten pretty good at figuring out what the real version of your website is.

You do want to think about internal linking, when you’re thinking about this. If you have basically four different versions of any URL that anybody could just copy and paste when they want to link to you or when they want to share something that you’ve built, you’re diluting your internal links by four, which is not great. You basically would have to build four times as many links in order to get the same authority. So that’s one reason.

3) Crawl Budget

The other reason why it’s pretty important not to do is because of crawl budget. I’m going to point it out like this instead.

When we talk about crawl budget, basically what that is, is every time Google crawls your website, there is a finite depth that they will. There’s a finite number of URLs that they will crawl and then they decide, “Okay, I’m done.” That’s based on a few different things. Your site authority is one of them. Your actual PageRank, not toolbar PageRank, but how good Google actually thinks your website is, is a big part of that. But also how complex your site is, how often it’s updated, things like that are also going to contribute to how often and how deep Google is going to crawl your site.

It’s important to remember when we think about crawl budget that, for Google, crawl budget cost actual dollars. One of Google’s biggest expenditures as a company is the money and the bandwidth it takes to crawl and index the Web. All of that energy that’s going into crawling and indexing the Web, that lives on servers. That bandwidth comes from servers, and that means that using bandwidth cost Google actual real dollars.

So Google is incentivized to crawl as efficiently as possible, because when they crawl inefficiently, it cost them money. If your site is not efficient to crawl, Google is going to save itself some money by crawling it less frequently and crawling to a fewer number of pages per crawl. That can mean that if you have a site that’s updated frequently, your site may not be updating in the index as frequently as you’re updating it. It may also mean that Google, while it’s crawling and indexing, may be crawling and indexing a version of your website that isn’t the version that you really want it to crawl and index.

So having four different versions of your website, all of which are completely crawlable to the last page, because you’ve got relative URLs and you haven’t fixed this duplicate content problem, means that Google has to spend four times as much money in order to really crawl and understand your website. Over time they’re going to do that less and less frequently, especially if you don’t have a really high authority website. If you’re a small website, if you’re just starting out, if you’ve only got a medium number of inbound links, over time you’re going to see your crawl rate and frequency impacted, and that’s bad. We don’t want that. We want Google to come back all the time, see all our pages. They’re beautiful. Put them up in the index. Rank them well. That’s what we want. So that’s what we should do.

There are couple of ways to fix your relative versus absolute URLs problem

1) Fix what is happening on the server side of your website

You have to make sure that you are forcing all of these different versions of your domain to resolve to one version of your domain. For me, I’m pretty agnostic as to which version you pick. You should probably already have a pretty good idea of which version of your website is the real version, whether that’s www, non-www, HTTPS, or HTTP. From my view, what’s most important is that all four of these versions resolve to one version.

From an SEO standpoint, there is evidence to suggest and Google has certainly said that HTTPS is a little bit better than HTTP. From a URL length perspective, I like to not have the www. in there because it doesn’t really do anything. It just makes your URLs four characters longer. If you don’t know which one to pick, I would pick one this one HTTPS, no W’s. But whichever one you pick, what’s really most important is that all of them resolve to one version. You can do that on the server side, and that’s usually pretty easy for your dev team to fix once you tell them that it needs to happen.

2) Fix your internal links

Great. So you fixed it on your server side. Now you need to fix your internal links, and you need to recode them for being relative to being absolute. This is something that your dev team is not going to want to do because it is time consuming and, from a web dev perspective, not that important. However, you should use resources like this Whiteboard Friday to explain to them, from an SEO perspective, both from the scraper risk and from a duplicate content standpoint, having those absolute URLs is a high priority and something that should get done.

You’ll need to fix those, especially in your navigational elements. But once you’ve got your nav fixed, also pull out your database or run a Screaming Frog crawl or however you want to discover internal links that aren’t part of your nav, and make sure you’re updating those to be absolute as well.

Then you’ll do some education with everybody who touches your website saying, “Hey, when you link internally, make sure you’re using the absolute URL and make sure it’s in our preferred format,” because that’s really going to give you the most bang for your buck per internal link. So do some education. Fix your internal links.

Sometimes your dev team going to say, “No, we can’t do that. We’re not going to recode the whole nav. It’s not a good use of our time,” and sometimes they are right. The dev team has more important things to do. That’s okay.

3) Canonicalize it!

If you can’t get your internal links fixed or if they’re not going to get fixed anytime in the near future, a stopgap or a Band-Aid that you can kind of put on this problem is to canonicalize all of your pages. As you’re changing your server to force all of these different versions of your domain to resolve to one, at the same time you should be implementing the canonical tag on all of the pages of your website to self-canonize. On every page, you have a canonical page tag saying, “This page right here that they were already on is the canonical version of this page. ” Or if there’s another page that’s the canonical version, then obviously you point to that instead.

But having each page self-canonicalize will mitigate both the risk of duplicate content internally and some of the risk posed by scrappers, because when they scrape, if they are scraping your website and slapping it up somewhere else, those canonical tags will often stay in place, and that lets Google know this is not the real version of the website.

In conclusion, relative links, not as good. Absolute links, those are the way to go. Make sure that you’re fixing these very common domain level duplicate content problems. If your dev team tries to tell you that they don’t want to do this, just tell them I sent you. Thanks guys.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

In this week's Whiteboard Friday, we are going to be going through some different ways you can track down old URLs after a site migration. These tactics can be incredibly useful for new clients that have just performed a redesign with less than ideal preparation.

I'll be presenting eight ways for you to track down these old URLs, but I would love to see some of your own methods in the comments below. Happy Friday everyone!

Video Transcription

Greetings and salutations SEOmoz fans. My name is Michael King. I'm the Director of Inbound Marketing at iAcquire. I'm also iPullRank on the SEOmoz boards and on Twitter.

So today what we're going to talk about is eight ways to figure out old URLs after a failed site migration. I know you have this problem. You get a new client, they just redesigned, and you have no idea what the old URLs are. They didn't do 301 redirects. They have no idea what the social numbers are anymore, and you have no idea where to start. Well, I'm going to show you how.

Now one of the first tactics you want to use is the Wayback Machine. You just put the site in there, the URL, the domain, what have you, and see what it has in that index. Once you get that, you can easily just pull off those URLs on the site through the links using Scraper for Chrome or whatever tool you want to use. You can actually pull down a code and pull them out using Find and Replace, whatever you want to do. That's just one of the tactics that we're using.

A lot of times people will also not change or update their XML sitemap. So you can just download that XML sitemap and then open it in Excel, and it puts you in these tables. You can just take that first column and copy and paste it into a text file, open it in Screaming Frog, and then crawl and list mode to see if those URLs still exist. Anything that's a 404, that's a URL that you can use, and you can easily map those ultimately to the new URLs on that site.

You also want to use your Backlink profile. When I say that, I don't want you to essentially use one tool, I want you to use as many tools as possible. So definitely start from Open Site Explorer. Also use Majestic, Ahrefs, whatever you want to use, and collect as much link data as possible. Also Webmaster Tools has your links, so use those as well. Then crawl all those links, all the targets of those links and make sure those pages are still in existence. All the 404s, again, you know these are old URLs that you can then redirect to new pages.

Then you also want to check the 404s from Google Webmaster Tools and map those pages to new pages as well. Then you can also use analytics. So pull your historic analytics from before the site redesign and find all those URLs and see which ones are still in existence. Again, go back to Screaming Frog with list mode and make sure that they're 404ing or 200ing. The ones that are 200, you don't have to worry about. The ones that are 404s are the ones that you need to remap.

Then you can also use CMS Change Log. So, for example, when you make a change in WordPress to a URL, there's a record of that, and you can actually pull those URLs out and use those again for mapping.

Then, for those of you that are a little more adventurous, you can go into your log files and see what URLs were driving traffic before it. Same thing as what you would do with the analytics, but just from a server side standpoint rather than just your click path stuff.

And also social media. So people share these URLs. Any shared URL has equity beyond just link equity. So you definitely want to make sure that you're pushing those social shared numbers to the right URLs that you're mapping towards, and I wrote a post on that on Search Engine Watch for how you can do that. But you can use the Facebook recommendations tool. So it's not really a tool. It's a demo for widget that goes on your site. But essentially, you can go through this tool and put in the domain name, and it's going to give you all the shared URLs, all the shared content. The way it comes in the box is it's 300 pixels tall, but if you expand that to a 1,000 pixels, you'll see the top 20 pieces of content that were shared. So real easily identify a popular URL that you can then redirect.

Also you can Topsy the same way. If people have tweeted these URLs, you can just put that domain name in there. It's going to search for them. It's going to give you all the URLs that Topsy has indexed. You can also use Social Mention, any social listening tool you can use the same way. And then also social bookmarks, so things like Digg, Delicious, and such, look and see what people have actually shared and bookmarked for your site.

So that's a quick one. Hope you guys found that useful, and I'd love to know how you guys have found this to be worthwhile. So holler at me in the comments down there, and thanks very much. Peace.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

I've got good news. Today marks a new Linkscape index (only 14 days after our previous index rollout) which means new data in Open Site Explorer, the Mozbar, the Web App and the Moz API. It's also more than 60% larger than our previous update in early January and shows better correlations with rankings in Google.com; I'm pretty excited.

For the past couple years, SEOmoz has focused on surfacing quality links and high quality, well-correlated-with-rankings metrics to help provide a link graph that shows off a large sample of the web's link graph. However, we've heard feedback that this isn't enough and may not be exactly what many who research links are seeking (or at least, it's not fulfilling all the functions you need). We're responding by moving, starting with today's launch, to a new, consistently larger link index.

Today's data is different from how we've done Linkscape index updates in the past. Rather than take only those pages we've crawled in the past 3-4 weeks, we're using all of the pages we've found since October 2011, replacing anything that's been more recently updated/crawled with a newer version and producing an index more like what you'd see from Google or Bing (where "fresh" content gets recrawled more frequently and static content is crawled/updated less often). This new index format is something that will let us expose a much larger section of the web ongoing, and reduces the redundancies of crawling web pages that haven't been updated in months or years.

Below are two graphs showing the last year of Linkscape updates and their respective sizes in terms of individual URLs (at top) and root domains (at bottom):

As you can see, this latest index is considerably larger than anything we've produced recently. We had some success growing URL counts over the summer, but this actually lowered our domain diversity (and hurt some correlation numbers of metrics) so we rolled back to a previous index format until now.

This means you'll see more links pointing to your sites (on average, at least) and to those of your competitors. Our metrics' correlations are slightly increased (I hope to show off more detailed data on that in a future post with help from our data scientist, Matt), which was something we worried about with a much larger index, but we believe we've managed to retain mostly quality stuff (though I would expect there'll be more "junk" in this index than usual). The oldest crawled URLs included here were seen 82 days ago, and the newest stuff is as fresh as the New Year.

Despite this mix of old + new, the percent of "fresh" material is actually quite high. You can see a histogram below (ignore the green line) showing the distribution of URLs from various timeframes going into this new index. The most recent portion, crawled in the last 2/3rds of December, represents a solid majority.

Let's take a look at the raw stats for index 49:

58,316,673,893 (58 billion) URLs

639,806,598 (639 million) Subdomains

135,392,083 (135 million) Root Domains

617,554,278,005 (617 billion) Links

Followed vs. Nofollowed

2.10% of all links found were nofollowed

56.50% of nofollowed links are internal

43.50% are external

Rel Canonical – 11.79% of all pages now employ a rel=canonical tag

The average page has 87.36 links on it

73.06 internal links on average

14.29 external links on average

In addition to this good news, I have some potentially more hilarious and/or tragic stuff to share. I've made a deal with our Linkscape engineering group that if they release an index with 100+ billion URLs by March 30th (just 72 days away), I will shave/grow my facial hair to whatever style they collectively approve*. Thus, you may be seeing a Whiteboard Friday with a beardless or otherwise peculiar-looking presenter in the early Spring.

As always, feedback is welcome and appreciated on this new index. If some of the pages or links are looking funny, please let us know.

If you read my previous Link Week entry, Why Canonicalization Matters From A Linking Perspective, you know that canonicalization – the process of selecting and using one specific URL for each page on your website for indexing in search – is vitally important for consolidating potential…