Google News Archive Forum

Some people may not like this post, and criticism of it would not at all surprise me. However, I suggest that before reading it you ask yourself whether Google actually DESIRES the current fluctuating situation. Is it where they actually want to be? Is it what they want to project to webmasters and the public?

Against that background perhaps the following analysis and theory may fall into place more easily.

DATA ANALYSIS AND BACKGROUND[Last week I posted a message requesting members to sticky mail details of their own specific situations and sites with respect to the fluctuations.

After spending days analyzing, and watching the picture continue to change before my eyes, I eventually found a theory to hang my hat on. No doubt it will be challenged, but at least it currently fits the data bank I have (my own sites, plus a third party observation set I use, plus those that were submitted to me by the above).

Two general phenomena seem to be dominating the debate:

a) Index pages ranking lower than sub-pages for some sites on main keyword searches

b) Sites appearing much lower than they should on main keyword searches, yet ranking highly when the &filter=0 parameter is applied.

These problems are widespread and there is much confusion out there between the two (and some others).

The first has probably attracted most attention, no doubt because it is throwing up such obvious and glaring glitches in visible search returns (eg: contact pages appearing as the entry page to the site). The second is less visible to the searcher because it simply torpedoes the individual sites affected.

By the way, in case anyone is still unaware, the &filter=0 parameter reverses the filter which screens out duplicate content. Except it does more than that.... it is currently screening out many sites for no obvious reason (sites that are clearly clean and unique).

So why is all this happening? Is there a pattern, and is there a relationship between these two and the other problems?

Well at first I wrestled with all sorts of theories. Most were shot down because I could always find a site in the data set that didn't fit the particular proposition I had in mind. I checked the obvious stuff: onsite criteria, link patterns, WHOIS data... many affected sites were simply 'clean' on anyone's interpretation.

Throughout though, there was the one constant: none of the sites affected were old (eg: more than 2 years) or at least none had old LINK structures.

This seemed ridiculous. There would be no logic to Google treating newer sites in this manner and not older ones. It is hardly likely to check the date when crawling! But the above fact was still there.

I have been toying with all sorts of ideas to resolve it... and the only one that currently makes any sense is the following.

THE GOOGLE TWILIGHT ZONEIn addition to WebmasterWorld I read a number of search blogs and portals. On one of these (GoogleWatch) a guy called Daniel Brandt quotes GoogleGuy as stating: "That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents".

Now, assuming that is true (and it's published on the website so I would imagine it isn't invented), or even partially true, all sorts of explanations emerge.

1) The 2BN+ Set If you are in here, as most long standing and higher PR sites will be, it is likely to be business as usual. These sites will be treated as if they were crawled by the old GoogleBot DEEP crawler. They will be stable.

2) The Twilight Set But what of the rest? It sounds like Google may only have partial data for these, because the crawlers 'wound down' before getting the full picture. Wouldn't THAT explain some of the above?

To answer this question we need to consider Google's crawling patterns. One assumes that they broadly crawl down from high PR sites. They could also crawl down from older sites, sites they know about and sites they know both exist and are stable. That too would make sense.

You can probably see where this is heading.

If your site or its link structure is relatively new, and/or say PR5 or below, you may well reside in the twilight zone. Google will not have all the data (or all the data AT ONCE) and you will be experiencing instability.

I have sites in my observation set that enter and exit both the problem sets above (a) and (b). It's as though Google is getting the requisite data for a period and then losing some of it again. As if the twilight zone is a temporary repository, perhaps populated and over-written by regular FreshBot data.

The data most affected by this is the link data (including anchor text) – it seems to retain the cache of the site itself and certain other data. This omission would also partially explain the predominance of sub-pages, as with the loss of this link data there is nothing to support the index above those sub-pages (Google is having take each page on totally stand alone value).

IS IT A PROBLEM?I also wonder whether Google sees all of this as a problem. I certainly do. Problem (a) is clearly visible to the searching public. They DON'T want to be presented with the links page for example when they enter a site! That is a poor search experience.

Do they see (b) as a problem? Again, I do. Sites are being filtered out when they have no duplicate content. Something isn't right. Google is omitting some outstanding sites, which will be noticeable in some cases.

The combination of (a) and (b) and perhaps other less well publicized glitches gives a clear impression of instability to anyone watching the SERPS closely (and that's a growing body of people). Together they are also disaffecting many webmasters who have slavishly followed their content-content-content philosophy. As I inferred the other day, if following the Google content/link line gets them no-where at all, they will seek other SEO avenues, which isn't good for Google in the long term.

WHY HAVE A TWILIGHT ZONE?Some people speculate that there is a software flaw (the old 4 byte / 5 byte theory for URL IDs) and that consequently Google has a shortage of address space with which to store all the unique URL identifiers. Well... I guess that might explain why a temporary zone is appealing to Google. It could well be a device to get around that issue whilst it is being solved. Google though has denied this.

However, it may equally be a symptom of the algorithmic and crawler changes we have seen recently. Ditching the old DeepBot and trying to cover the web with FreshBot was a fundamental shift. It is possible that for the time being Google has given up the chase of trying to index the WHOLE web... or at least FULLY index it at once. Possibly we are still in a transit position, with FreshBot still evolving to fully take on DeepBot responsibility.

If the latter is correct, then the problems above may disappear as Freshbot cranks up its activity (certainly (a)). In the future the 'wind down' may occur after 3BN, and then 4BN.... problem solved... assuming the twilight zone theory is correct.

At present though those newer (eg: 12 months+) links may be subject to ‘news’ status, and require refreshing periodically to be taken account of. When they are not fresh, the target site will struggle, and will display symptoms like sub-pages ranking higher than the index page. When they are fresh, they will recover for a time.

VERDICT?Certainly evidence is mounting that we have a more temporary zone in play. Perhaps problem (b) is simply an overzealous filter (very overzealous indeed!). However, problem (a) and other issues suggest a range of instability that affects some sites and not others. Those affected all seem to have the right characteristics to support the theory: relatively new link structure and/or not high PR.

The question that many will no doubt ask is that, if this is correct…. how long will it last? Obviously I can’t answer that. All I have put forward is a proposition based upon a reasonable amount of data and information.

I must admit, I do struggle to find any other explanation for what is currently happening. Brett’s ‘algo tweak’ suggestion just doesn’t stack up against the instability, the site selection for that instability, or the non-application to longer established sites.

The above theory addresses all those, but as ever…. if anyone has a better idea, which accounts for all the symptoms I have covered (and stands up against a volume of test data), I’m all ears. Maybe GoogleGuy wishes to comment and offer a guiding hand through these turbulent times.

3. Filter of most optimized keyword? Maybe 4. Filter eliminating most used keyword on page? Maybe

I'd like to get some input on this, sticky mail would be fine too. A possible problem with one site that I'm promoting whose index page has dropped only for certain searches is there is a ton of text on the home page, 1406 words. To achieve a decent (yet still low) keyword density of 2.92%, we say our number one keyword 41 times.

Maybe we can either eliminate this as a possible cause of home page droppings by other examples or determine that this in fact is tripping a filter? TIA

I don't know if my story of my little piece of the internet means a hill of beans, but here goes:

What's happening to us isn't unstable. In fact, it's pretty damn predictable. It's been going on for 3 weeks now.

Google shuffles our index page out of the serps for our main keyword phrase 'Blue Widgets'. We retain the #3 spot for 'Blue' and the #12 spot for 'Widgets', but aren't in the serps at all for 'Blue Widgets'.

What happens is like clockwork. On Monday evening, 2 datacenters shuffle us out. Tuesday during the day, 2 more. Wednesday, it looks like all hope is lost, until we start reappearing in our rightful spot at the top of 'Blue Widgets'. By the time everything settles down, it's Friday. (which is just in time for the weekend, when we dont' really do any business.)

Then, everything's peachy for the weekend. Monday is busy and the whole cycle starts again...this is the 3rd week in a row that is exactly the same...

"The owner of one forum has suggested the word is that it won't be sorted until the end of July."

I strongly agree with that statement and wouldn't be surprised if this fiasco lasts until labor day.

I don't see the logic of offering bad serps from May 5 to labor day in order to gain whatever benefit they are hoping to obtain. The serps are a joke. In all my kws there might be 2 relevant listings in the top 10. Millions of index pages are wiped out and pages about blue widgets showing up as #1 for red widgets when red widgets is only mentioned once or twice on the page. Of course, since the page is about blue widgets, it's nowhere to be found under a search for blue widgets.

Alltheweb is blowing them out of the water with far superior serps. Only the respectable, strong, well built, fed and cared for sites/pages are showing up in the top 20.

Many people here are trying to make sense of the algo. Although a noble pursuit, let's not forget Algo shakeups like this one is something Google tries once a year. Anyone remember the September algo change? It's almost been two months since they started this "thing" so I expect in another 6 months we'll see another major shake up of their algo. Having this "thing" sorted out by July is almost optimistic IMHO.

If I sound pessimistic, it's only because I've been watching this mess for almost 2 months. Many people here are hoping Google will "sort things out eventually." But I think we need to start preparing for the fact there is nothing to sort out and we'll see this serp junk until they decide to do the next annual algo shakeup. = Months from now. Maybe a year.

Probably the longest period of stability we have seen was from late 2001 up to Sep 02. But stability is never guaranteed, those who market strongly for other SE's as well have also seen numerous changes with a similar effect.

Through all of this the major sites with high PR haven't moved much at all, a place or two up or down is all I see between updates. For a category I monitor closely in the last 2 years the primary search term (used about 30,000 times per day) has seen no new sites hit page 1, no sites disappear and the biggest move for any site has been 4 positions. Dominic did nothing to these sites, they don't even know Esmeralda occurred, the first 10 results are identical to before the update.

I'm sure Google compares search results before it starts rolling the indices across data centers. I'm sure they have several thousand common terms they try out and an analysis program to warn them of drastic changes in SERPs. Their QA test is most likely based upon keeping something close to the status quo for the most popular terms. Logic dictates that if users liked the most popular historic results Google shouldn't rock the boat too much in any one update.

More specialist searches typically returning pages with PR5 or less have however moved considerably. They shifted a lot in Sep 02 and again with Dominic and some more with Esmeralda.

In most cases the returned results are now fairly relevant. Yes "spam" gets in, yes the occasional dead site shows up, yes the occasional OT site gets in because of manipulation, but all of these issues have always existed with Google in whatever type of results it has delivered.

The issue with index pages showing below interior pages is in my opinion simply because PR ranking weight has been lowered a little. Assume PR is not a whole number, before your index page was 5.54 and the contact us page was 5.14. However the contact us page has better on page ranking criteria than the index page for a particular term. Reduce the weight of PR, increase the weight of content and the contact page now appears first.

Does anyone have a contact us page appearing higher where the actual content doesn't contain the search term? That would be some proof that Google is really broken! But I don't see any of that flavor.

I have this index situation with dozens of sites (not the contact page, but true content pages). It isn't really a big problem, often the visitor gets to where they want to be faster! That helps with conversion rates. In many ways Google is doing a better job, although some webmasters and SEOs may not like pages that weren's intended to be appearing at the top of SERPs, personally I can live with it......all free traffic is good;)

To know why this is happening I think you have to look at the content of both your site and the competitors. For example if your index page has "Widgets" appearing 15 times, and your content page has "Widgets" appearing only once but it ranks higher, then take a look at the competition's pages.....do you see a pattern? I do, when it happens to my sites it seems to be related to what other sites have for on-page content. This is a bit like the theory of "over optimization". To me it makes perfect sense that Google should compare any one page with the competitor's on-page content and rank based upon a comparison, of course PR/back links need to be something similar to maintain the overall Google strategy of presenting the most "voted for" sites higher. It may also throw some light on the &filter=0 situation. When that filter is applied Google is ignoring duplicate content filters, it is also possibly ignoring a lot of its on-page content ranking algo at the same time.

I understand that no one really wants people arriving on a contact us page, but my suggestion is not to have your targeted terms in the tags and content on those types of pages if it really is a problem. When searching for a company name this is difficult to do, but I say live with it, if someone searching for your brand or company ends up on the contact page first it won't hurt them too much, it will probably reassure them they are in the correct site.

The whole concept of the web was for people to land in any position and move from thread to thread. I see nothing wrong with Google trying to help people find the most appropriate starting page based upon the term used. Trying to direct all traffic to an index page isn't really a good idea anyway IMHO, it severely limits the number of possible terms you can target.

In summary I think what we are seeing is a move back towards on-page content, a much better method of comparing that on-page content between pages/sites, and a lowering of the importance of anchor text and Page Rank.

I think Google needs to do some more crawling to get it exactly right. I am concerned that DeepFreshBot doesn't crawl as deep as DeepBot did, but unlike Dominic we are now heading in the right direction.:)

No they aren't. They are first pages. They have content themselves and links to other content.

Define "first pages". If "first pages" is defined as the first page that most visitors see whent hey enter your site, then I think we agree. If you site was a house, the "first page" would be the doorway.

Also note that I did not say anything about the content of such pages - of course they (can) have content! What I said was that the main purpose for many (not all) home pages is as an entry point to content contained elsewhere on the site. Your description of them as "first pages" suggests that you agree with this assertion.

Any you're right, no one has yet suggested "lack of content" with regard to missing/poorly-ranked index pages -- and that's presisely why I am asking the question. People have been looking for a tie that binds these cases together, perhaps this is one. Napoleon, how does your research look in this context?

>> The owner of one forum has suggested the word is that it won't be sorted until the end of July. <<

I think that GoogleGuy has almost hinted as much on several occasions.

As for patterns over time; I posted this before in a longer version:

Main site at #1, and a pointer page at #2. Every 10 to 12 days or so, the pointer page moves down to #3 or #4 (once down to #14 or so), whilst the main site moves down to #60 or #64. This has happened about 6 times in the last couple of months. The drop lasts a couple of days, and several times occurred when everything else was fresh tagged up except for the site and pointer page. Both pages are called index.html, with the main site one being full of content, and the pointer page one having one paragraph of text and one link to the main site.

Things are so unstable right now, it is really hard to come to any conclusions at all. Napoleon's post is excellent, but on any given day, the parameters change. It is not flux as some people have said. It is something much different.

In my case, for some keyphrases that I'm #1 in anchor text for the index pages shows up where I should. For others that I'm #1 in anchor text, the index page is nowhere to be found. One day they are there. The next they are not, yet I remain #1 for the anchor text.

If there was a penalty being applied for optimized anchor text, I think it would be across the board and not on specific phrases.

The lack of GoogleGuys further input is curious as well.

I think the "more than weeks, less than months" quote will end up being a "few months" while they shake out the bugs from their new system. This will take us into mid July.

We all see the effects much more so than an average surfer since we watch our sites. While waiting for this to settle can be painful, there really is no other choice other than baiting a few more hooks and dropping them in to see what bites;)

>Then, everything's peachy for the weekend. Monday is busy and the whole cycle starts again...this is the 3rd week in a row that is exactly the same...

I've been seeing the same thing for at least 3 weeks. On a certain very non-competitive single keyword SERP, one of my sites home page cycles every few days from being in the top 3 positions, to below #40. That site just dropped from #1 to #67 a few hours ago. This site has been around a full year, so this has nothing to do with freshbot, etc. Based on how other search engines rank this page for this keyword, and the fact I have this home page well optimized for Google, I'd expect it to be somewhere on page 1 for that SERP. It is this bizarre periodically dropping below #40 that has me baffled. It looks to me like Google has 2 algos, and which is live changes every few days.

Assuming a site has some decent linking from external websites to the homepage and some decent linking from external websites to interior pages, what are the pros and cons of placing a Google No Follow on your homepage, especially considering the recent drop of index pages?

Not all index pages are affected, in fact, most aren't. I think it's a big mistake to make any changes to try to outguess the latest theory.

I agree with the view that there seems to be two algorithms. A new site I know of just received an ODP link. -ex and -cw show that link, while the other centers don't. But -ex and -cw don't have my index page, even though I seem to have an increased PR and more backlinks.

This smells to me like a database or server problem.

Anything from Googleguy since the 'see you in a couple of days' comment on Friday?

>I would agree, one applying a "main keyterm" filter, the other applying seemingly no filter - just good old fashioned quality SERPS!

Clarify what you mean by that? I also noticed on that SERP where my site is bouncing from #1 to nowheresville, the same is happening to a highly relevant, and well linked to, page on another site. Curiously enough, when I'm up they are down, and vice-versa.

If I were you I would forget all about trying to control where visitors arrive from Google or any other SE. Let them drop in on what Google considers the most relevant page and then give them some easy navigation so that they can easily find their way around.

I was just looking at the stats for one of my sites with 30,000+ pages. Less than 5% of SE referrals go to the home page, 90%+ of SE referrals find the products pages first.

I look at it this way. If the visitor always went to the home page first they may or may not find the page where I turn them into a sale. I think most people would, by why bother? If SE's want to take them straight to the product page it saves the visitor time, me bandwidth and certainly doesn't hurt the conversion ratio.

If a site is designed in a way that people are likely to be put off because they drop in half way through some type of sale spiel then maybe it is the site design that needs looking at, not how Google treats it?

Take WebmasterWorld for example....I wonder how many of Brett's new SE visitors find the home page first....my guess is very few, most probably arrive in a thread somewhere.

Maybe Brett will give us some actual stats.;)

This site is designed to allow people to drop in anywhere with no detrimental affects. You can "buy" from every page if you so wish! :)

2) I have one www.blue-widgets.ext site, and one www.red-widgets.ext site (where widgets is a major keyword, and where the sites are in different markets, albeit widgets). One is number 2 for its highly competitive search term, the other is page 9 or something. Both sites are the same design / layout, linking structure, etc., etc. This can only mean that site A made it on one algo, Site B lost out to another algo (and I think I know which one).

3) I have a www.mydomain.com with heavily linked to /blue.widgets/ and /red.widgets/ sub directories (Yahoo and Google link to both of these, as does every search engine). My route domain has NEVER really done that well, but the sub directories do just fine. So I do not see that missing index pages = broke. What I think is, Google have made a huge error in showing us which pages rank best for a search term and why (I always looked at the second page before with Google to determine what html they like to rank a site high).

Has Google lost data, gone off the rails, or is this just multiple algos delivering the results?

I would say it was the later, but for one thing. I have another site that is really strongly linked to and should be #1 for its search term no problem. The link: shows a good percentage of the backlinks (although not all). With this site I was careful NOT to over optimize, in fact I left heading tags out. It should be very high just based on the backlinks I see, but it is not. Why? Because I believe Google are showing backlinks, they have have not yet computed the PR or keyword relevance that they should bring.

So, I think it is a real mixture. But I also believe you can find a lot of benefit from looking hard at the results right now. Just don't compare one site with another. Assume Google have a number of main ranking algos, maybe 2 or 3, and then compare ;-)

drewls - I as well am seeing the same cycle, and it has been occuring for the same 3 weeks.

I disagree that the update cycle is finished, as there is one other small feature that I almost missed while checking that cyclic effect. 1 - the -fi results have the index page in its weekend position, while the other data centers have it all over the place for the same keyphrases.

2 - the key phrases that are not affected by this swing have another anomoly..the size reported by Google is from the Feb results, while phrases that are affected are from the newest results. The cache page is exactly the same in both cases, the page from June 21st. The size difference is 4k which is something that would not have affected spidering or ranking when it was added.

I also noticed on that SERP where my site is bouncing from #1 to nowheresville, the same is happening to a highly relevant, and well linked to, page on another site. Curiously enough, when I'm up they are down, and vice-versa.

Yes, I'm seeing the exact same thing.

I'm glad others are seeing some of the same things I am. I haven't felt that was the case through most of this, with Dominic and all.

So GG, now that we've spotted this pattern: Can you throw us a bone and give us a cryptic pseudo explanation or something. :D

Ive been interested to read all the posts looking to Googleguy for an answer - he doesn't owe any of us an answer - in fact for all we know he may not be privy to the information we seek. In the same light, posting a message that suggests (or even insinuates) any member here "should" offer their thoughts is not the way to go. Every member here (including, Brett, GG, Mods, Admin) offer their contributions voluntary and are all no doubt aware of this thread. I am sure they would contribute if they wanted to, but I guess they don't because dozens of you would jump on and analyse their every word or that they just don't have any new information to offer.

Scott

You owe me nothing. Brett, Mods, and Admin's owe me nothing. GoogleGuy and Google themselves, owe me nothing. However, it is the right thing to do. Google has a responsibility. Google has the power to play with peoples livelihoods. They should respect that power and respond to the masses that play this "game of ours" by the rules.

I wouldn't demand an answer from GoogleGuy. I agree that he owes me absolutely nothing. I'll ask for one though.

He can tell me to go fly a kite if he wants to, but I'll still ask.

Lots of people around here love to dismiss everyone who's watching this as 'crazy' and tell them to 'wait until the update is finished'. I'd be happy to wait until the update is finished...the only problem is I'm seeing an update that appears to be in week 3 of a seemingly neverending loop.

I agree with you anon27. When an organization like google is responsible for a large scale operation affecting people's livelihood, it is ethical that they accept the responsibility. I am new member- and have been following this thread. It took me two yrs part time to build my site and has been indexed by google 6 months ago. Within a month, the site got a page rank of 6, (thru dmoz, internal and external links). After the dominic update, there are sub-pages that still hold the same page rank but the SERPs have gone from 1 down to 100. I did change the URL and the title for all the pages a month ago (with a 301 redirect), but if this was the reason then all the pages should have been messed with. I noticed that most of the keywords that have been messed up with are the ones that attract high traffic. Has anyone else noticed the same? Like others I also noticed that some of my pages have been alternating from being #1 to down to the dumps for some targeted keywords.

If I were you I would forget all about trying to control where visitors arrive from Google or any other SE.

I would love to agree with you, except that instead of the sub-pages showing up where the index file would have, they are showing up in the 10+ results.

Besides, we are trying to figure out what is going on. The more we know, the better chance we have at #1 results. No matter what Google does, there has to be an algo to prioritize results - its our job to pull together and figure out what that is, with or without their help.

For one second, maybe we can assume this is all still just a dance. I'd still like to be able to place higher during the dance too :-)