Preferred Member

Senior Member

joined:May 21, 2001
posts:2149
votes: 0

Hello all,

My first post here....

I have been studying Google closely, and it has been acting weird all yesterday. I get searches that return half the amount they normally do, then my new pages are listed, MSN's page rank went down (hehehe), but now it is all back to where it was.

I love Google - just hope my pages end up a little better than they were, I had figured out what the page rank should be for one - and it was off by way too much.

Anyway glad to be here - hope eveyone's pages work out well (well not everyone, but everyone here :) ).

One other interesting note. There is an error (actually several) in the google web directory in the adult section, it is pretty severe (categories empty), I hope they fix it - but oddly enough - the category button showed the correct category - even though google shows it as empty. I don't want to post where, because it is an adult setion, but check out the A & B sections of any of your categories that use the alpha bar - it might be empty too.

Preferred Member

Google's cycle has definitely changed. Data from the crawl that ended for me on May 11 is now showing up in www2, and can be expected to be stable throughout Google by the end of this week.

This is a 10-day processing cycle, which is shorter than the 14-day I've come to expect. The crawl was a bit shallower than the previous crawl, even after disregarding the fact that they skipped one of my sites entirely.

There is also evidence that Google is doing overlapping cycles. My site has been crawled extensively the last few days, and half-heartedly about five days earlier. It used to be that there was a single cycle, but now it looks like smaller, overlapping cycles.

For any site where Google was not able to make it through all the pages before stopping under the old crawl pattern, the question for this new pattern is obvious:

For these mini overlapping cycles, is Google starting from square zero each time, and crawling shallower, or are they continuing where they left off the last time?

Too early to tell, but I have a bad feeling that it's the former -- more frequent, but also more shallow.

Preferred Member

joined:Feb 17, 2001
posts:409
votes: 0

Everyman, I think I remember your previous msgs about the spidering problems of your "deep" web sites. How deep Googlebot goes before it gets lazy? Is there some fixed rule, or at least some decision pattern you can see?

Preferred Member

No fixed rule that I can detect. And everything I had observed from October (when Google started going deep) to April has to be revised in light of this new behavior.

The only thing I had concluded based on the consistency of the October to April pattern was that they had to stop crawling and start processing all at once. Typically it would crawl for a week, and get downright feverish by the end of the week, and then stop cold and never come back to crawl until the cycle restarted three or four weeks later.

I speculated on the basis of this that they had to turn off the crawl so that those PCs could start processing the data. My site was deep enough so that I noticed the cutoff. Other sites might not notice, because Google would get through their site before it became time to stop.

Junior Member

joined:Oct 12, 2000
posts:116
votes: 0

everyman, when you say deep, would this also include subdomains as part of the main domain or does the spider view subdomains as separate new domains which it has to crawl,. Basicaly is a subdomain treated as a new domain, Alsoi think you mentioned your site is in the 100k range for documents, how far does the spider crawl in that many documents before it cuts off. thanks nube

Preferred Member

www.XXXXX.org/cgi-bin/YYYY.cgi?AAAA_BBBB_CCCC where AAAA_BBBB_CCCC is a proper name.

The XXXXX is always the same.

The YYYY alternates between two cgi programs, but I've locked Google out of one of these by returning a "Server too busy" because all the names are covered with the other cgi program.

The AAAA_BBBB_CCCC is always changing.

Each page returned from the above link has from several to several hundred additional links on it in the same form, but with new names in the links. Each of these also links to a page with from several to several hundred in the same form. The page itself is usually less than 50K bytes.

And so on, and so on. That's deep.

It would be possible to run out of names after 115,000 pages if: 1) Google got that far, and if: 2) Google could detect on the fly whether it already got that name, and if: 3) Google stopped asking for that second cgi program that repeats the name and always comes back "Server too busy" because I've locked them out of this search that returns a Java applet.

As it stands now, Google would actually have to get 230,000 pages to run out of names, assuming it can detect skip duplicates on the fly. Half of these would be "Server too busy."

With six crawlers working at once, I don't think Google can detect duplicates on the fly, because I don't think the crawlers are talking to each other much, if at all. So I suspect that it's getting the same name several times, and these get purged later into just one page for each name. Very inefficient.

Usually I end up with from 20,000 to 40,000 useful pages in the index before it quits. By "useful," I mean a page that isn't merely "Server too busy." (Actually these "too busy" pages aren't entirely useless, because the name is in the link and folks hit on it. It's just Google's cache copy that's useless.)

In all, Google often ends up with lots of obscure names when they ought to be going after the least obscure names.

That's why I'd just like to send them a CD-ROM once a year, with the best data, all laid out and Linux-ready per their specs. No response from Larry Page on this, and it's been six months.

I've been traveling and haven't had much chance to check online, but chanced to look at www.google last night and noticed a new site had been indexed. Checking again tonight, I find that it's not in the main Google index but is in www2.... Last time we went through this it took about a week for things to settle down.

Junior Member

New User

joined:Sept 3, 2004
posts:5
votes: 0

Re:>post of May 21 > I have been studying Google closely, and it has been acting weird all yesterday. I get searches that return half the amount they normally do, then my new pages are listed, MSN's page rank went down (hehehe), but now it is all back to where it was.

I get weird results from Google today, May 23; sometimes my site shows up as it used to, ten minutes later it does not show at all, not even in all the pages Google lists. All gone. Even other sites I have that were not well-ranked, are sometimes now not there at all; other times they are. Any clues?

Senior Member

joined:May 21, 2001
posts:2149
votes: 0

I had a somewhat similar situation in that I had - lets say 5 pages - one was fairly well ranked and pointed to the other four. When that one went away, the page dropped all the way to the bottom of any searches. Are you sure it is not there at ALL?

Did the page(s) that pointed to your site go away?

It could just be an update thing. There is a part of the google update proceedure that goes like this:

(Drop all child links - redo something or other - and then add them back in)

At one point or another - those on the bottom of the foodchain are dropped for google to do some sort of iteritive process - than they are added back in. Maybe that is what you are seeing.

I would be suprised if the totally got rid of your pages all together after everything is said and done.

Here is a quote from one paper:

"Dangling links are simply links that point to any page with no outgoing links....Often these are ....simply pages we have not downloaded yet.....we siply remove them until all the PageRanks are calculated. After all the PageRanks are calculated, they can be added back in...."

This wasn't the quote I was looking for - but it is something like that.

New User

joined:Sept 3, 2004
posts:5
votes: 0

Chris,

Thanks for the info. Yes they dropped every page. 3 sites, and every page on those 3. I can tell because if I do a keyword search for the brand product I sell (say Black and Decker), other sites which I am familiar with that sell the same brand, or sites that are linked to me, show. But not my site(s). But then, as I say, about 10 minutes later I will show, and the other listings that show on that page when I am shown are completely different than the listings that show when I am not coming up.