A search for Dmoz now brings up PhilCs article in the no.1 position. Things have been going down hill since they set up the 301 to the www. Has Google really had enough with directories? Or have Aol seriously messed something up?61 Comments

... and the reason that it doesnt rank is that the ODP added a set of site-wide 301 redirects from dmoz.(org¦com) and from www.dmoz.com all pointing to www.dmoz.org just a few weeks ago. I expect things to be in turmoil until Google completely recalculates PR and backlinks. I already see some interesting effects, and some patterns, here and there. I guess this will take several, to many, months to right itself: http://www.webmasterworld.com/google/3437548.htm

Too funny to see the amount of so-called SEOs that dont understand simple domain canonicalisation problems and mistake it for a ban. This is the real reason: http://www.google.com/search?num=100&q=site%3Anewhoo.com+-inurl:www http://www.google.com/search?num=100&q=site%3Awww.newhoo.com http://www.google.com/search?num=100&q=site%3Admoz.com+-inurl:www http://www.google.com/search?num=100&q=site%3Awww.dmoz.com http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz http://www.google.com/search?num=100&q=site%3Admoz.org+-inurl:www http://www.google.com/search?num=100&q=site%3Awww.dmoz.org Beware of rogue SPACES that Sphinn has inserted into some of those search URLs.

G1smd, you do not know what you are talking about, the 301 from the dmoz homepage was recognized weeks ago, there is no reason why this would cause the homepage to get deindexed, something is not right. BTW I never said they were banned, at least not in this post, my blog post was hyped up for digg and I never submitted it here.

*** you do not know what you are talking about ***Hate to disagree.You obviously did not try the searches posted above, or you failed to understand them. The post at "Digg" includes the word "ban", and is wrong.

Still, please check the search strings posted above and all will become very clear. *** Didnt I just say the blog post that was not submitted here by myself was hyped up for digg? ***Ah, thats the other Sphinn thread.... http://sphinn.com/story/6606

I have checked your URLs, all I have said in this thread is that the homepage does not rank number 1. The blog post was digg bait, that despite being more popular then any other post this weekend got buried.

Totally off-topic, and I havent looked at the Digg post (I hate Digg) but......Forgive the stupidity on my part, but why on Earth would you post a fabrication on Digg? Is that the New Thing - posting falsehoods on Social Sites? I can see if it was a storytelling post - but if its about a real, true event - why the lies?

"why the lies?"DLPerry, for exactly the same reason why every media organisation lies. If you dont get that, you shouldnt be in marketing. Or even waste your time at a site that deals with marketing.You didnt think the "Boy eats own head" headline was true did you? ;)

<sigh> I guess I was hoping for something a bit more positive. Oh well - it seems the familiar old because everyone else does it and because thats the way it is mentality is still the way to go.</sigh>I mustve missed the "Boy eats own head" headline - sounds like a tabloid piece - which I personally would never believe - mainly because of the sensationalist tactics they employ, and the Liar Liar Pants on Fire reputation they have earned.And you know - I do keep seeing posts complaining about how SEO and online marketing have been getting bad a reputation. Maybe this lie tactic has something to do with that? just my old and idealistic .02 :)

Cross linking a few items. Also on Sphinn here: http://sphinn.com/story/6606 http://sphinn.com/story/6644 From what I can see, the home page is NOT in Google. Nor is Dmoz ranking for things it totally should be ranking for: http://www.google.com/search?q=open%20directory%20project http://www.google.com/search?hl=en&q=odp Like some others, I wondered if it had been hit by the Great Directory Ban Of Sept. 2007? http://searchengineland.com/070920-085657.php http://sphinn.com/story/4415 http://www.seomoz.org/blog/what-makes-a-good-web-directory-and-why-google-penalized-dozens-of-bad-onesNope, internal pages arent missing. But the home page is -- and why seems a mystery to me. I dont see Dmoz itself having any blocks on it. Other thoughts?

A boy ate his own head?? I really think its just a dupe content issue brought up by all the redirects theyve set up. Further complicating the issue is the fact that while all of their internal home page links are pointing to http://dmoz.org, theyre all being redirected to http://www.dmoz.org. They should update those links ASAP. You dont use a 301 to fix a link when you can just change the link without causing any harm, especially when its on every page of your domain.

It is a very simple domain canonicalisation problem, just like the many hundreds that I have written about many times before. It shouldnt take you very long to discover which URL the Root Page, and all the Top Level categories, are indexed under. Ill also guess that it wont take Googles indexing system much more than a month to realise what is going on and fix the problem. Heck, in one of ther posts above, I even listed some Google searches that totally show the reason for the problem.Maybe I give people far too much credit in being able to understand site: searches and what they show you.

Hey all, I dug into this a little bit with the help of a couple crawl folks. It looks like when Googlebot tried to fetch http://www.dmoz.org/, we got a 301 redirect back to http://www.dmoz.org/ . It looks like that self-loop has been going on for several days. We were last able to fetch the root page successfully on Sept. 10th, but from that point on DMOZ was returning these 301-to-itself pages, and after a few days Googlebot gave up on trying to fetch the url. It looks like the rest of the site is fine, so I suspect that if DMOZ gets 301/redirects for their root page sorted out on their webserver, well recrawl and index the page pretty quickly. DLPerry, keep the faith. If you read back over the comments, several people (g1smd, jdevalk) suggested reasonable explanations instead of going right to "ZOMG! Google hatez da Moz!?!" :)

JohnWeb, just from a cursory glance (i.e. take this with a grain of salt), it looked like we might have been able to fetch a valid page earlier today, so they might have already made a change. Its always possible that they were doing something specific for Googles IP range, but my hope is that folks on the DMOZ side have figured it out themselves and this will sort itself out without too much extra trouble.

That makes sense, 16 hours ago DMOZ (if thats his/her real name) said they were looking into it. Ive seen dozens of sites in GWHG with their homepage removed and would never have thought of "Googlebot gave up" as the reason. Live and learn I guess.

I was working with a client whose home page poofed. In his case, he had one url 301 redirecting to the home page. He requested that redirecting URL to be removed. A few days later, the removed URL was showing 11K backlinks in Webmaster tools, as if it was the home page. And then the home page was nowhere to be found in the SERPs. So I told him to remove the redirect and have the url issue a 404 instead, and then I told him to request a URL inclusion. Few days after that, his home page came back. Cant really say what fixed the problem though, but that was the first time I saw the home page just disappear from the SERPs.

Thanks, Matt. Wow -- lesson learned, never 301 back to the same page. Not that I would.And remember -- never type Google into Google or youll break the internet:http://googlified.com/2007if-you-type-google-into-google/

I am not sure how an infinite loop could possibly have happened. I have looked at the canonical (www) root page several times per day in the last few weeks, and it did not ever redirect for me. There were redirects set up for (www.)newhoo.com and for (www.)dmoz.com and for dmoz.org. All of those pointed to www.dmoz.org/ . No-one would have been able to access the Root Index page at all had it been redirecting. Users would have been simply presented with a "Redirection Limit Exceeded" error message from their browser. Such a redirect would have been noticed long ago. There have been no such reports.
Matt. Are you sure you didnt misread the logs and were actually looking at requests for www.dmoz.com/ being redirected www.dmoz.org/ perhaps?
As you know, the ODP made some (hardware) infrastructure changes almost a month ago, and as a part of that, a non- canonical URL was accidentally exposed for indexing. I gave some very big clues in the search strings above. One of them returns the root page and 105 000 categories. The Root and the whole of the directory Top Levels are now all fully indexed under that alternative domain. The URL is that of one of the load-balancing servers.
See:http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz
and that is an error. Google has been picking up those URLs since at least the beginning of August.AOL server techs are well aware of the issues and are working on various fixes. It just goes to show that when doing a large amount of work, server upgrades, and implementing various load-balancing changes, as well as starting to sort out various domain canonicalisation issues, that something can easily go wrong if you do things in the wrong order or miss out a step. One issue that is already being addressed is that some internal links (mostly on informational pages) are hard-coded to point to dmoz.org (non-www) URLs and those are now all being edited to point to the www version instead. That will be ongoing for many weeks.

g1smd Another great find again. Canonical URLs & Google ....so many miss this fix straight off they spend years wondering why Google doesnt give them the love they think they should get. Keep up the good work and remember Matt C works for Google.com, so not all he posts is 100% factual. Peace!

g1smd latest post looks strange in my Firefox. I have tried to gather it.g1smd wrote: I am not sure how an infinite loop could possibly have happened. I have looked at the canonical (www) root page several times per day in the last few weeks, and it did not ever redirect for me. There were redirects set up for (www.)newhoo.com and for (www.)dmoz.com and for dmoz.org. All of those pointed to www.dmoz.org/ . No-one would have been able to access the Root Index page at all had it been redirecting. Users would have been simply presented with a "Redirection Limit Exceeded" error message from their browser. Such a redirect would have been noticed long ago. There have been no such reports. Matt. Are you sure you didnt misread the logs and were actually looking at requests for www.dmoz.com/ being redirected www.dmoz.org/ perhaps?As you know, the ODP made some (hardware) infrastructure changes almost a month ago, and as a part of that, a non- canonical URL was accidentally exposed for indexing. I gave some very big clues in the search strings above. One of them returns the root page and 105 000 categories. The Root and the whole of the directory Top Levels are now all fully indexed under that alternative domain. The URL is that of one of the load-balancing servers. See: http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz and that is an error. Google has been picking up those URLs since at least the beginning of August. AOL server techs are well aware of the issues and are working on various fixes. It just goes to show that when doing a large amount of work, server upgrades, and implementing various load-balancing changes, as well as starting to sort out various domain canonicalisation issues, that something can easily go wrong if you do things in the wrong order or miss out a step. One issue that is already being addressed is that some internal links (mostly on informational pages) are hard-coded to point to dmoz.org (non-www) URLs and those are now all being edited to point to the www version instead. That will be ongoing for many weeks.

So essentially IF Matts post on what Googlebot encountered was not Googlebot getting it totally wrong, then DMOZ has to have been cloaking it to serve it only to Google (and possibly other search engines). Otherwise tons more people would have noticed the redirecting issue because it effectively closes down that page - Ive done it before when testing :) DMOZ/AOL, please weigh in on this. Itd be interesting to find out if googlebot just royally stuffed up a 301 or if it really was an incorrectly implemented cloaked 301.

What would happen if the Webmaster Tools perferences were set to "non-www" and the site then had the non-www to www redirect implemented a few months later, without changing the Webmaster Tools setting?Whatever was going on with (www.)?(dmoz|newhoo).(com|org) this search is the key http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz to unravelling the overall effect.

"What would happen if the Webmaster Tools perferences were set to "non-www" and the site then had the non-www to www redirect implemented a few months later, without changing the Webmaster Tools setting?" Interesting question. I dont think cloaking 301s like this is standard practice so itll be interesting to find out if this was indeed a googlebot bug or not....

So who is control of 301s at DMOZ, the Metas? (sorry I couldnt resist). The 301 loop sounds about on-par for how DMOZ is run in general. I continue to stand amazed that no one touches AOLs unwanted step child, as she stands broken and abused by the "system". For those that speculated that Google is anti-DMOZ, its far from the truth, and in fact, on the contrary, they are "in bed" with DMOZ... and thats the only divorce I will openly support.Matts on this thread, so perhaps he can shed some light as to why Google would partner with DMOZ amidst all of the negative PR in the industry, and the fact that it is WAY outdated because no one is really working to keep things in order. I am just amazed that they would consider it a quality resource given that over the past few years, thousands of site have not been added, or even worse, removed (not speculation, first hand info here - I am a former editor).

JohnWeb, just to be 100% clear, "Googlebot gave up" is not the root reason. I was just introducing a bit of levity. The real reason was of course the infinite redirect loop that lasted for days. If I 301 page A to point back to page A and do that infinite loop for a week (or more), its probably a bad user experience to return that infinite loop to users. But if the loop stops, then our system is set up to get the page again fairly quickly. iBrian, well-said. Danny, its pretty rare to see a site do an infinite redirect loop like that, but it does happen. g1smd, Im pretty sure that I was looking at www.dmoz.org, not dmoz.com, but I was just doing a quick/lightweight check, so I wont claim to be 100% positive.

Matt: I see no evidence of any sort of loop being created for www.dmoz.org at any time in the last few months, so I am quite purplexed. I have been keeping a careful eye out for redirect and canonicalisation issues as the changes have been made, as you might imagine. I do see Google gobbling up alternative URLs for one of the load-balancing servers for the last couple of months though.

g1smd, I only did a cursory dig and thats what it looked like at that point. Ive been asking about it more, and it looks like dmozs 301 might have interacted badly with a heuristic on Googles side. Im still keeping an eye on it and Ill bug the crawl team until everything looks good.

Out of curiosity, Ive managed to create a page that will return a 301 but not redirect so that the browser will show the content of the page. http://www.jlh-design.com/2007/09/googlebot-gave-up/#comment-5301 Could DMOZ screw up their code this much? Id imagine its possible. Curious enogh though a browser shows the page content (appearing normal to a user) the online header checker I used makes an "assumption" that the page should redirect to itself.This may not have been the exact mechanism for the DMOZ page dropping out of Google, but at least I can replicate it somewhat. Coming up next week on Myth Busters, Jamie and JohnWeb...

JLH, by "URL inclusion" I mean going into Webmaster Tools, going to the "Removed URLs" console and clicking on "reinclude" or something like that. I havent tried it myself so Im not sure what the UI is called exactly. "part of an index recognizing, adjusting and updating in real time" You really believe the people here are going to fall for that?Whats amusing to me is the blog post is tagged "Truth" :)

Well, about 80 000 pages from www.dmoz.org reappeared overnight (UK time) in Googles index, starting about the time that Matt Cutts posted here... so how much more realtime do you wanna get with this stuff?

Matt, I think the issue here is bigger than what initially appears.An incredible number of pages which URLs have been changed to http://www.dmoz.org/... from http://dmoz.org/... during their process of canonicalisation, are currently not in Google cache and their links are not being considered by Google. As an example:A google search for "Academy of Canadian Cinema and Television" site:www.dmoz.com, brings no results, although these words are clearly on the page: http://www.dmoz.org/Regional/North_America/Canada/Arts_and_Entertainment/So at the moment Google does not see this link from DMOZ, together with millions of other links in other pages on DMOZ.This is affecting millions of websites and of course an incredible amount of Google search results, until the canonicalised http://www.dmoz.org/... pages are indexed and cached in Google.I think that the engineers at Google should have a proper look into this, trying to index all the DMOZ pages with new URLs as soon as possible. In fact a huge number of searches and even tests at Google on new algorithms might be altered by this effect.

I would give them a couple of weeks or more to discover everything.At one page per second, they can spider 86 400 pages per day.The ODP has at about 20 times that amount of pages (also counting category descriptions, guidelines, FAQ pages, profiles, etc).See also: http://www.google.com/search?num=100&q=site%3Acore-n02.dmoz.aol.com+-inurl:chefmoz

The ODP content was previously available through more than 30 different domains and direct IP addresses. These had been hosted at various times by Netscape, Mozilla, and AOL.A few months ago, along with some necessary hardware changes and upgrades, everything was reconfigured so that just www.dmoz.org became the canonical domain.At first, there were a few glitches showing in the listings within Google SERPs. Several domains were missed in the canonicalisation fixes, and were rapidly indexed in preference to www.dmoz.org by Google.Once those holes were plugged, Google began to slowly re-index the other non-canonical versions of the directory. Some of the URLs dropped into the Supplemental Index, but most of them were de-indexed.After just a few months, there are just a few hundred incorrect URLs showing up. Most of the problem URLs have now been completely de-indexed.The main listings for www.dmoz.org show almost one million URLs indexed in Google when using the site:www.dmoz.org search.The job is now just about complete.

Some of the comments above are now displayed in the wrong order after being edited to remove some formatting issues.The correct order can be deduced from the post number (behind the # link on each post) rather than from the post date.

Everything is now back on track.See that the Duplicate Content has fallen to almost zero URLs indexed:http://www.google.com/search?num=100&q=site%3Anewhoo.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.newhoo.comhttp://www.google.com/search?num=100&q=site%3Admoz.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.dmoz.comhttp://www.google.com/search?num=100&q=site%3Acore.dmoz.aol.comhttp://www.google.com/search?num=100&q=site%3Adirectory.mozilla.orghttp://www.google.com/search?num=100&q=site%3A207.200.81.183http://www.google.com/search?num=100&q=site%3A207.200.81.184The Canonical Domain now has almost a million pages indexed:http://www.google.com/search?num=100&q=site%3Awww.dmoz.orgSome Supplemental Results can hang around for a very long time:http://www.google.com/search?num=100&q=site%3A207.200.81.154

Everything is now back on track for ODP site re-indexing.See that the Duplicate Content has fallen to almost zero URLs indexed:http://www.google.com/search?num=100&q=site%3Anewhoo.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.newhoo.comhttp://www.google.com/search?num=100&q=site%3Anewhoo.org+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.newhoo.orghttp://www.google.com/search?num=100&q=site%3Admoz.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.dmoz.comhttp://www.google.com/search?num=100&q=site%3Acore.dmoz.aol.comhttp://www.google.com/search?num=100&q=site%3Adirectory.mozilla.orghttp://www.google.com/search?num=100&q=site%3Agnuhoo.com+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.gnuhoo.comhttp://www.google.com/search?num=100&q=site%3Agnuhoo.org+-inurl:wwwhttp://www.google.com/search?num=100&q=site%3Awww.gnuhoo.orghttp://www.google.com/search?num=100&q=site%3A207.200.81.135 http://www.google.com/search?num=100&q=site%3A207.200.81.139 http://www.google.com/search?num=100&q=site%3A207.200.81.140http://www.google.com/search?num=100&q=site%3A207.200.81.175http://www.google.com/search?num=100&q=site%3A207.200.81.183http://www.google.com/search?num=100&q=site%3A207.200.81.184http://www.google.com/search?num=100&q=site%3A207.126.111.202http://www.google.com/search?num=100&q=site%3A207.126.111.231 The Canonical Domain now has almost a million pages indexed:http://www.google.com/search?num=100&q=site%3Awww.dmoz.orgSome Supplemental Results can hang around for a very long time:http://www.google.com/search?num=100&q=site%3A207.200.81.154That IP address has been out of use for a long time.Including the direct IP address accesses, and various sub-domain and load-balancer URLs, there used to be ~34 ways to get to ODP content as hosted by Netscape/AOL servers. Now there is only one way.