Sitemaps Interview

Sebastian has posted a good interview with the Sitemaps team. The most useful tidbit (which I didn’t know until now) is that Google treats a 404 HTTP status code (page not found, but it may reappear) and a 410 HTTP status code (page not found, and it’s gone forever) in the same way. I believe that we treat 404s as if they were 410s; once Googlebot has seen a 404 at that location, I think we assume that the document is gone forever. Given how many people use 404 instead of 410, that’s probably a good call for the time being.

Most of the interview is not about HTTP status codes though, I promise. The only thing I’d change (we’ll see if Sebastian reads this) is to make the questions a different color from the answers so it’s easier to browse. 🙂

Aaron it may be related but we have a “downranking to Google oblivion” problem that started Feb 2 LAST year and persists. (1.5 million monthly Google visits to 15k) It was actually discussed at length in the WebmasterRadio interview plus Matt’s site reviews in Vegas plus I’ve asked some of the best in the biz. It appears related to extensive 302 problems and accidental duplication of pages but remains unsolved after a massive site reconfiguration we did to consolidate several state level domains into our oldest one using 301s. I’m not seeing improved results with BigDaddy, in fact our 301 pages are back in the index after they were removed by me with removal tool!

Whoops…OT? Matt this could be considered a sitemaps post right since we’ve got a HUGE one for this massive site.

Irritating, Google has done some updates during last week, and now the wrong information is in the index, unexisitng pages at my site are still in the index.
Suppose it will correct itself, but as it was an update it is really strange. IMHO…

1. permit hidden keywords on a page in the keywords meta tag. You can determine if those keywords are ontopic for that page by comparing the visible page text with the corpus of web text you have. So suppose a website puts “chocolate” in the meta keywords and ‘candy’ in the page, you can calculate the probability of that ‘chocolate’ word being related to ‘candy’, and below a threshold and its spam, above a threshold and it’s the webmaster telling you what the page is about. You could even weight the relevence of the keywords tag relative to the page. So choc gets a 0.8 weighting and ‘viagra’ gets a 0.00001 because it’s off topic. That way sites can give you all the relevent text to a page without being forced to slap HTML text on it.

“I believe that we treat 404s as if they were 410s; once Googlebot has seen a 404 at that location, I think we assume that the document is gone forever. Given how many people use 404 instead of 410, that’s probably a good call for the time being.”

Matt,
can you please double check on this? It seems like removed pages linger on the supplementals for years.

“404 Not Found: The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This [404] status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.”

So it makes sense to treat a 404 as a 410, regardless of how commonly it’s done, because the HTTP specs allow 404 to replace 410 in any case.

That sounds a bit aggressive on the 404’s based on my experience. I’ve renamed stuff (and not bothered with redirects), but seen Googlebot come by again several times. Actually makes sense to me – seems like the spider should “check” a few more times before it makes that decision things are gone.

And if there are inbound links to the (now) missing page, shouldn’t Googlebot come back around to see if it has re-appeared? I realize this is just the spidering activity and may not be reflected in the index’s, but you might want to clarify where the “assume it is gone forever” is applied.

joe – thanks for clarifying, that sounds like a mess dude, good luck with it. i am now concerned with scraper sites that have pr and trust, they are showing duplicates of my articles in the engines days before mine even appear if they ever appear at all.

I’ve also found 404’s persistently remaining in the index, (for a while it was my fault – 404 header was incorrect) but I fixed it and they’re still there.
Also 301’s… I moved 120 pages 2 years ago, set up 301’s and removed ALL links to those pages.
18 months later I removed the 301’s and BANG – 50% of the old urls re-appeared in the index… so I reinstated all the 301’s and now (6 months later) the old urls are STILL showing up in site:mysite.com and repeatedly showing as errors in my sitemap stats!
(Makes me wonder about sites like wayback – obselete & incorrect urls still have live links on the web at places like that)!

“Results 1 – 100 of about 2,240,000 for 404 “file not found”.”
😀 Keep saving those, Google, they might be good in a few years!

Please let us know, Matt — how can we tell Google that we really mean 404 (like go away) when we say 404? There are so many people with this problem – old pages never dropping out of the index, junk accumulating, Google’s search results filled with pages that just don’t work (for a long time now). Is there a trick to saying 404 in a way that Google will accept as a final answer?

With Sitemaps the webmasters now have a chance to look under the cover of Googles search engine – and they see those ever lasting 404’s — old links or whatever that point to some URL that has been long gone; Google shows it as an “error” and the webmasters have no way of correcting it, of even finding the link that pointed to that URL (if that link even still exists)…

What can we do to make it easier for you to index our real content and forget about our past-life?

walkman, a supplemental result has to be recrawled for that 404 to go into effect. Since supplemental results go longer between recrawls, that’s the reason why some 404 pages linger. See below though; supplemental results need to be recrawled by the supplemental results Googlebot before they are processed.

alek, you make a good point. We might try to recrawl a site several times if the page never loads, for example. I’m not sure if we do that with a 404 though.

JohnMu, one thing to keep in mind is that we look for an HTTP status code of 404. For example, for the search [404 “file not found”], the #1 result isn’t a 404; it’s a gag site. See how it says “Please, for the love of god, try the following: … Go outside now” and it’s clickable to go back to the main site.

A lot of the time, a site will return a 404-looking page (not a gag like the site above), but the HTTP status code will be 200 (as if the page was found just fine). We call that a crypto-404. We may look for certain phrases like “file not found,” but if you want to be safe/conservative, I’d doublecheck that your 404 pages really return an HTTP status code of 404, not 200. After we recrawl the page with the appropriate spider (e.g. for a supplemental result page, it needs to be crawled by the supplemental result spider), the 404 or 301 or whatever will be processed.

I run a site that has (according to G’s seeming innacurate count) 580k indexed pages that is merging with another site that has 81k pages indexed. The two sites will be merging into a new third site. I see two ways of going about this that *might* work.

1) As soon as the new site is up (the code and content is based largely off the new site), everything will be 301 directed that can be (probably all of the pages from the larger site and a small number from the smaller site). The concern is that it will take many months for G to index most of the pages on the new site and we will therefore lose most of our traffic for that period.

2) Create the third site and wait until it is indexed fully and then do the redirects (per above). The problem I see with this is that we could be hit hard with the duplicate content penalty as there will be the exact same content on at least two sites.

About a year after I first learned how to create custom 404s in ASP, I discovered that by default they return a 200 Status code. If you don’t code in the Response.Status = “404 File Not Found” in ASP, you’ll see a lot of those custom 404s in the index.

I’m sure there are 404s that don’t meet this description, but at least some of them do, and Google can’t be faulted for that. This should provide at least a partial explanation for the behaviour.

I use my 404 pages to randomly generate a never-ending chain of pages that randomly construct useless content from an array of over 800,000 sentences, into a random number of paragraphs for an end result of 300 – 600 words per page.

We’re doing another site (only for search engines) that has _exactly_ the same text/image than the AJAX site. The only difference is that you find links to pages which you do not in the AJAX site and that in the plain html site you got a javascript that redirects the user to the AJAX site.

I would like to have some directions on Google’s part on how to have a AJAX site properly indexed and not delisted given that I am not trying to boost my page rank. I just want to be (fairly) indexed.

Sorry to post this here but I could not comment this on the other thread.

I’m not convinced that Google does understand 404’s properly. I tried removing a url with the Google removal tool about 4 months ago and got a message back saying request denied for page removal. When I emailed Google about this they replied

“Thank you for your note. We’re sorry about any confusion you’ve experienced. We’d like to reassure you that because this page currently returns a true 404 error, we can remove it from our search results. This removal should be processed within three to five days. We appreciate your patience.”

Looking around the forums, there are a considerable number of comments stating ‘Since integrating sitemaps into our website, we have lost all of our positions’ and ‘our site has disappeared’.

This makes one hesitant in creating a sitemap, for fear of losing either positioning or the site altogether, as far as the SERP’s go. On the other hand, if one doesn’t create a sitemap, you feel left behind, and not keeping up with the available technology.

It’s a catch 22 situation, whereby you could win, or you could lose, according to many webmasters in the forums. Any comments would be appreciated.

In an older post you mentioned google tries to contact webmasters when they violate the webmaster guidlines, while they still have good content. just someone in their website team did bad stuff.
Can’t this be combined into sitemaps? Just like the crawling errors?
Sitemaps could be turned in THE commuication channel from google TO webmasters since you already veryfied(by the empty file-thing) that
the person with the sitemap account is responsible for that website.

I have been severaly kicked by last autumns update and i would kill for some feedback from google why my pages now rank behind notorious page spammers, ppl with keyword.domain.tld urls and sites with nothing on it except links to affiliate marketing networks.

It bothers the webmaster a tiny bit, but it really bothers the user who expects to find working links in a search engine like Google. Finally finding what you’re looking for, only to see it 404 in the browser is … depressing …

What can the webmaster do to speed up Googles dead-page-removal? (without manually submitting each and every URL to the page-removal tool)? Or how about Google offering to go to the last cached page when it last saw that the URL 404’s? I’m sure there must be a way, with so many Google experts at work :-).

I know that domain isn’t really *that* important, but if Google really indexes 500+ URLs from there that are certainly returning 404 as a result code, there must be something slightly off-balance? Or is there reason to a listing like that?

Paul, I’ve investigated lots of these “sitemaps has tanked my site” reports and they all have one thing in common: no proof, no evidence, 100% speculation, always other causes which have nothing to do with Sitemaps. Don’t blame the sitemap when your site gets tanked. If a sitemap submits junk to Google, the standard procedure will handle it, in the same way as with add-URL-page submissions and links found elsewhere. OTOH creating an XML sitemap is a chance to look at a site’s structure and its contents. Grab any tool to generate a sitemap, double-checking the collected URLs before you submit it often leads to unexpected insights you can use to enhance your architecture.

I checked headers of our custom 404 page returned from Tomcat. Its returning a 404 puh 🙂 Wont these crypto-404 result in a lot of content duplicates? We have plenty of 404 URL’s in the index. The wont have any title or description, but they are still listed. Is there a rough rule of thumb, how long these invalid URL’s are gonna stay in the index?

I am not getting a response from Matt, so maybe someone can explain: it appears that aggregated scraper “news” sites get credit for blogs without PR like mine. I would like an answer on this Matt, I really do not ask for much seeing that I am being used by the engines and scapers at the same time. Your content is also appearing on the same “new feed” if that is what it is called. I call it lame. Is my content all now considered duplicate?

Ryan, that’s yesteryears spam :-). Those tags are all valid (look up “Dublin Core Meta tags”) – but I doubt they’re doing them much good. If they’re ranking for any of that, then it will be because of other factors. (see seomoz’s excellent http://www.seomoz.org/articles/search-ranking-factors.php for more ideas).

Back in September I setup a custom 404 error page that would redirect users to the home page if a page had been deleted. All the talk about using 301’s instead of 302’s made me use a 301.

Apparently Google had a lot of old urls in their index even though these pages were probably returning a 404 for a long time. I guess Google tried to crawl them once again and got a 301 back to the homepage.

The jagger update dropped my site from the 1st page to the 10th page. I speculate I have a duplicate content penalty for using a 301 instead of a 404 that still exists almost 4 months later:( I read a thread somewhere about someone doing the same thing and being penalized, which is what made me realize I did something wrong and change it to respond with a 404 status.

My site also has many more pages indexed with the non-www version of my site even though I am using a 301 to the www version. My www version starts getting more pages indexed and then it drops back to 1 page, my homepage. I have seen it do this several times.

Google needs to do a better job of updating it’s index when it sees a 404 or a 301. It seems they are spending too much time on showing a smaller url when someone does a 302. MSN doesn’t have as good of results but it updates their index very frequently.

I just don’t see the reasoning in keeping results like that online. And I imagine Google’s results will be the first to have manual tweaks – so I don’t want to know how long other peoples 404s stay live …

NOW bookmark any of the medscape links, or cut and paste into a new browser… guess what, you will no longer see that content… you will be required to first create a free account.

Medscape is clearly tricking Google. Up until a few weeks ago, their cache was different from what you actually saw. But is this fair? Here they are clearly looking at referrer and if from Google they deliver the content, but when you try to reaccess, well, you have to login.

The point of this is that “cloaking” is in fact useful, and widespread in Google. Google’s spiders are machines — as such, Webmasters increasingly build content JUST for them… the result are pages that don’t read well for users. Similarly, as in the Medscape example, sites need a way to let users know what lies behind their members area… such uses of cloaking are quite legit, as is the BMW case.

Aaron I’ve heard of that problem and was under the impression that when the bot sees the SAME content it has trouble determining the originating site. I’d guess it uses PR to make the call and since you are lower than the quoting sites your site is assumed to be the duplicator. But I’m not sure of this at all.

Perhaps this is a violation as well, but it’s my way of getting around things. In other words, I need to comment on the removal of BMW.de from Google’s index, and even if this post is removed, someone will have to read it before.
Looks like Google people are becoming too arrogant, behaving like Web police (don’t get me wrong, I believe in Google more than any other tech company, but some of the last steps are quite surprising). I mean, if someone has a way of boosting their score, why would you not just try to be smarter than them and lower it when a trick is encountered (which is what Google’s unofficial policy has been so far, right?) – wouldn’t that be more fair than setting the score to 0? To put it bluntly, how dare you punish anyone?
Furthermore, and this is even more surprising, Matt said that they might be asking for information on WHO created the redirects at BMW.de. And what about all the privacy stuff that Google is so concerned about? Doesn’t that apply in their case? If I understood this well, this is quite schocking to me.
Cheers everyone, and apologies for intruding your topic.

Are there any hints you can give on how many times Google will check a 404 to see if the document has returned…. Many times if a site has a 404 it is an oversite or problem that is begin dealt with… Other times it’s from a client not paying hosting and it takes them a few days to realize there site is down.

I have often wondered how much time you have with a 404 to be corrected… Like many others I’m sure, if I realize my site is 404’ing on a page, I’m panicking to get it fixed as soon as possible…. Would be comforting to know just how many times it takes till the page is removed from the index because of the 404….

John, I just checked the first couple of URLs on that Google query I posted — some of the 404’s are cached as the 404 page since last June :-).

I second Hariths questions (except for the Franks sister part). How does a spam report go into Googles web-spam workflow? Does the number of reports for a site matter? Or can a single report cause a manual check?

It would be great if Sitemaps did evolve into a way for webmasters to communicate with Google as Sebastian suggested.

One area that springs to mind is where several domain names point to one site e.g. mysite.com and mysite.co.uk. A webmaster could use Sitemaps to tell Google which domain should be listed and which domains are just there to prevent domain squatters.

Well, it wasn’t devious intentionally. I was just trying to find a good spot to put them that wouldn’t interfere with the actual content. I don’t want to drive people away with a big fat old ad in the middle of a story. I personally hate that. I want to make a little $$ off of it, but not so much that I’m going to compromise the design. I tried ads at the bottom before, but they were never touched. The people keep coming back and telling their friends, and I’m getting lots of new members each day…. I got like 40 new members yesterday alone. I’ve got two volunteers translating the newsletters (and now working on the site) into Chinese, which is pretty cool. Now I just have to do some more updates to the site…and get working on our next monthly newsletter! This thing took off faster than I could have ever imagined. And I was just trying to come up with a better alternative to the “official” Google Librarian stuff, which if you ask me, is pretty sparse and not quite on target for librarians.

I am going to assume that my target demographic is a bit smarter than average (due to the fact that most have MLS degrees and higher, and many members are upper-level management). I find that the Google Search generates more clicks on ads than the content ads. Librarians are curious and inquisitve, so they like to search. They can’t resist the search box. lol. Sounds funny, but its true.

Going even more off topic 🙂 — Julie, you can improve your ads relevancy with small comment markers for the Adsense-Bot, see http://seside.net/google/2005/10/25/improving-googleads-relevancy ; I saw that the tuturials-page for example just listed tutorials for languages (probably catching the “view site in ….” tags on the side). As always, better relevancy of the ads = better ad targeting = better visitor experience = more clicks = more $ = happy Google + happy webmaster 🙂

Interesting and certainly frustrating for you Aaron. Funny cuz I was also thinking it would be fun AND educational to start a blog/forum for people who are experiencing what appear to be technical difficulties with Google that hurt their legitimate rankings. By collecting complete contact info for legit sites one might help the site AND Google separate legit from spammy concerns.
But we drift OT here so I’ll end with a sitemap rap.
“da dot com be crappin,
so i go sitemappin,
but da G says
“iz juz not not gonna happen!”

About 3 months ago somebody took every article that I had ever submitted to article synidates (like go articles), changed the name and bio to information about him, and re-submitted them to the same article syndicates…

And now he kills me in the search results for the title of an article that I wrote.

Mistah – your point about several domains linking to one server is a good one and one I am struggling to use sitemaps to deal with at present. My .com site is well indexed and ranks well for several important keywords but I have recently purchased the .co.uk to try and get my .com show up in .co.uk only results too. I have the .com and .co.uk on the same server (parked?) and am using absolute .co.uk links within the sitemap for the first few .co.uk pages so Google should see .co.uk links and then the rest of my site uses absolute .com links to hopefully make G realise .com is a Uk based site – is there a better / recommended way to do this in sitemaps / google?

Sitemaps are really useful but this type of thing and a form of communication to the big G would be even better 🙂

I do not blame Google, I blame all the folks who are content spammers, it is not about feeding the engines crap, it is about presenting your passion to the world. If you are into “beach erosion” blog about it, just keep an eye on Mr. SEO News, he still is having his way with your content I believe.

I shouldn’t complain. I have 2 sites that do nothing but take headlines from various other sites and aggrogate them.. However, I just display headlines, and put an actual real link to the webpage my crawler found it on.. I feel that’s fair.. AND a great way to get content for some of my sites.

It’s the people who take the article word for word, or remove the links from it that piss me off.

I have experienced the same with one of my articles though Aaron.. I posted it first on my blog, then syndicated it a month later.. Do a google search for the article title however, and my blog is nowhwere near the first page, but the syndicates without links are #1.

I would say Google needs to do something about “where was it first”, but given the nature of it’s spider, it won’t really know… and if they checked dates on page, i’m sure people would easily start faking those.

With syndicated articles being all the rage, there really is no good way for google (or anybody for that matter) to tell where it originated… I almost suggested some sort of pinging system, but even that could be spoofed / faked.

SiteMaps continues to report several 404 errors on pages that never existed under this domain, but for a previously used domain that we 301’ed to this new domain over 2 years ago. Could these 301 redirect prevent GoogleBot from recognizing the 404 response on the current domain?

Ryan in your case I’d think you could file a DMCA complaint. Copyright rests with the author even if you don’t file for it. Often just the threat of this will get the content removed or, sometimes better, attributed to you and your site.

Joe – Yes indeed, I am going after people I feel have done me wrong and just in one day I see their sites are either down or the content has been removed. Do not be afraid to stand up for yourself. Just try to refrain from threatening peoples wives and children but anything short of that is fine, hehe 😉

Ryan, if you found an article on Site X, why would it matter? Don’t you think the author deserves a link to his/her site?

By linking to the article, you’re not helping your fellow online marketer.

As for the DCMA, does that apply to thieves who live in other countries? How can we tell Google that someone stole our article and infringed upon our copyrights? I think Google, not as a policing official, but as a concerned citizen should take complains like this seriously.

Aaron Pratt is correct in his assertion that PR shouldn’t affect who Google thinks is the originating resource. The true originator should get the credits deserved regardless of how old a site might be or how many links point to that site.

Lee, the problem is that PR rules crawling, so high-PR aggregators get their stuff crawled more frequently than low-PR content sites. In Aarons case the aggregator got the “source bonus” simply because his page was fetched before Aaron’s.

Turns out that a network admin, while trying to create a “temporary” ftp access credential for a site accidentally took it offline (yeah, i know) for about 42 minutes. If the 404 error & the 410 error are the same, and if Googlebot came around during those 42 minutes, then how bad could it be??

I’ve noticed that the site has dropped significantly in the rankings. Is this just a coincidence? Or is there something that I can do to rectify the misunderstanding??

I’ve got a real head scratcher for you, but first I want to thank Google for creating the new Sitemaps interface, for without it this problem would never have come to my attention.

In the Sitemaps interface I see what Google shows as our web pages’ most common words. This is a tremendous help. The number one word is “lasik” and this makes sense as we certify Lasik doctors and talk a lot about Lasik surgery. The number three word is “surgery” and a count of it’s occurrence on our web pages indicates this is about right. The word found second is “Viagra”.

Say what?

In natural language the word “Viagra” is found a total of two (2) times in our entire website. (People who use Viagra want to know if it is contraindicated for Lasik and we have an article about it.) If I add all of the source code (viagra.gif, lasik-viagra.htm, menuy, sitemap, navagation), “Viagra” comes up a total of 9 times in 6 pages. We have about 640 pages total, not counting image files, etc.. Why would “Viagra” be number two when it should not even be in the top 20?

If that isn’t odd enough for you, the Sitemaps interface shows that “phentermine” is the sixth most common word and “xanax” is seventh. These words are NOT on our website anywhere. Not in natural language, and not in code.

How can Google believe that three of the seven most common words in our website are two words do not exist and one that appears about 0.0001% of the time?

We use Google as our local search engine. I search for “phentermine” and “xanax” on our website and neither of them come up. Google thinks they are very common, but Google can’t find one instance of their use.

We have no black hat here. We have nothing to do with these products and I’ve already told you the instance where “Viagra” is used.

What is worse is that these three terms are moving up. They were lower down the scale of common words just a few days ago. This week Google thinks we have a higher instance of words we don’t even have in our website. I think this may explain why we have dropped from a high average of 26 (bumped 16 one day) in “lasik” search down to 47 depending upon the dance. We say we are about Lasik, but Google thinks we are about common “bad neighborhood” terms.

How do I get this resolved and what could possibly have caused this? It makes no sense to me at all.

hello Matt,
I have a query? Recently i have notice in my sitemap something very unusual. In sitemap one field is web crawler and part of that field is Not Found pages that show some error messages when crawlers do not find the pages or for some misspelled pages. But I have noticed that these results shows the 404 error from the pages which are not from our website that results comes from some articles websites and over there my url’s typed with some spelling mistakes in them. May be the referer made a mistake in typing them. Is this affects on my ranking? It graduaaly increase my errors for Not Found pages, pages athat are not the part of my website domain but these are pages from some other websites. How can i resolve the problem? IS there any method ot it is a google sitemaps tool’s weakness? The number of error messages increase day by day in my sitemap results???
Please give me a solution.