What should NOINDEX do?

Okay, this post will be colossally boring to some people. But I wanted to give you a peek at debates behind the curtain in Google’s search quality group. Here’s a policy discussion about NOINDEX and how Google should treat the NOINDEX meta tag. First, you’ll want to read this post about how Google handles the NOINDEX meta tag. You may also want to watch this video about how to remove your content from Google or prevent it from being indexed in the first place. Here’s the conclusion from my earlier blog post:

So based on a sample size of one page, it looks like search engines handle the “NOINDEX” meta tag:
– Google doesn’t show the page in any way
– Ask doesn’t show the page in any way
– MSN shows a url reference and Cached link, but no snippet. Clicking the cached link doesn’t return anything.
– Yahoo! shows a url reference and Cached link, but no snippet. Clicking on the cached link returns the cached page.

The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between? Let me lay out the arguments for each:

Completely drop a NOINDEX’ed page

This is the behavior that we’ve done for the last several years, and webmasters are used to it. The NOINDEX meta tag gives a good way — in fact, one of the only ways — to completely remove all traces of a site from Google (another way is our url removal tool). That’s incredibly useful for webmasters. The only corner case is that if Google sees a link to a page A but doesn’t actually crawl the page, we won’t know that page A has a NOINDEX tag and we might show the page as an uncrawled url. There’s an interesting remedy for that: currently, Google allows a NOINDEX directive in robots.txt and it will completely remove all matching site urls from Google. (That behavior could change based on this policy discussion, of course, which is why we haven’t talked about it much.)

Webmasters sometimes shoot themselves in the foot by using NOINDEX, but if a site’s traffic from Google is very low, the webmaster will be motivated to diagnose the issue themselves. Plus we could add a NOINDEX check into the webmaster console to help webmasters self-diagnose if they’ve removed their own site with NOINDEX. The NOINDEX meta tag serves a useful role that’s different than robots.txt, and the tag is far enough off the beaten path that few people use the NOINDEX tag by mistake.

Show a link/reference to NOINDEX’ed pages

Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue). If a webmaster really wants to be out of Google without even a single trace, they can use Google’s url removal tool. The numbers are small, but we definitely see some sites accidentally remove themselves from Google. For example, if a webmaster adds a NOINDEX meta tag to finish a site and then forgets to remove the tag, the site will stay out of Google until the webmaster realizes what the problem is. In addition, we recently saw a spate of high-profile Korean sites not returned in Google because they all have a NOINDEX meta tag. If high-profile sites like

aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users (and thus for Google).

Some middle ground in between

The vast majority of webmasters who use NOINDEX do so deliberately and use the meta tag correctly (e.g. for parked domains that they don’t want to show up in Google). Users are most discouraged when they search for a well-known site and can’t find it. What if Google treated NOINDEX differently if the site was well-known? For example, if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag. Otherwise, don’t show the site at all. The majority of webmasters could remove their site from Google, but Google would still return higher-profile sites when users searched for them.

What do you think?

That’s the internal discussion that we’ve been having about NOINDEX meta tags. Now I’m curious what you think. Here’s a poll:

{democracy:6}

I’d also be interested in (constructive) suggestions in the comments about how Google should treat the NOINDEX meta tag. Try to step into both a regular user’s shoes as well as the position of a site owner before leaving a comment.

The webpage is the webmaster’s property. He placed the “noindex” on the page. That is a denial of content. Clearly communicated.

I think that it is disrespectful for a search engine to index what a webmaster asks not to have indexed. Google has “webmaster guidelines” and hopes that they are respected. What makes this any different?

I think it could be compared to a “no trespassing” sign on real estate.

I think google should display a link; however, that link should not count towards page rank for the page. From what I remember this was used to help get rid of blog spam. If it does not count towards rank then it still will cut down on rank, but will allow the site to still show up on the index.

Our highest duty has to be to our users, not to an individual webmaster. When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue).

How is this policy any different then when Google BANNED an entire domain because of even a small violation of the Webmaster guidelines????

Think about it!!! 😐

There was no concern about the user experience THEN – until SearchEnginesWEB started tenaciously protesting the policy.

_______________________________________________________

However, in reference to suggestions for the NOINDEX:

1- One solution would be to develop a new extension to the NOINDEX tags.
Allow Webmasters to use a proprietary NOINDEX tag that would differentiate between a complete disappearance in Google’s SERPs and a referenced link. This would give Webmasters the choice

Perhaps it could be named NOINDEXDOMAIN

2- Another solution would be to ALLOW the site to be referenced in the index – BUT:

a- Do not LINK it –
b- Put a STRIKE Strike Through It
c- Make The Link GRAY instead of the Blue
d- Put ‘Expired’ after the domain

Imagine a world with 50 really important search engines. I don’t want my content included. I should be able to accomplish that with a simple noindex instruction. I should not be expected to go to each search engine that exists and each one that springs up to keep my content out.

Going back to the “no trespassing” sign comparison… if Google was a taxi service would they drive their clients across posted boundaries – without their knowledge? No they would not.

However, that “no trespassing” sign is visible to the public. And, in the case of real estate anybody who walks past it can see it and there is no law against commenting to a friend…. “That cranky old EGOL has a no trespassin’ sign on his land”.

So, if a search engine user types a domain or a page URL into the query box, it would be OK for google to have a special page that says… “The cranky old fart that owns this page says Keep Out”. However, that page should not contain a clickthrough link. That is like opening the gate to the posted land.

In the case of real estate, any person who isn’t blind can see across the property boundary and notice objects on my land. They can talk about them freely – no laws against it. This is the point where I believe that the technology of a search engine departs from the real estate example. When the spider arrives and requests the “noindexed” page, that noindex instruction means: “Don’t look at this” So that page should not appear in any SERP – even if that SERP is relevant. This is the same as land with a privacy fence. Google should not know what is in there.

Don’t be evil… if a Webseite uses the noindex-Tag it means in my opinion that the webmaster doesn’t want the site in the Searchengines index.
If he set the metatag state to noindex by mistake…. bad luck for him.
Google could sent him an email like: were interested in putting your website in our search index, would you like to be in it? (possible Spam-Mail?? 😉 )
So don’t force Websites to show up in the Google SERPs when metatag ist set.

This one seems pretty simple. If the page has a no index, don’t include it in your index. While it is conceivable that someone accidentally inserts the meta noindex, it takes an overt act to do so, rather than an omission.

I suppose the rule might arguably be slightly different if the entire site has a no index and someone is do a navigational search. In that case including a link with no description or title could be appropriate.

I agree with EGOL.
Since webmaster placed a noindex tag to their website, that means he doesn’t want this website to show up on search engines.
You say that numbers are small..ok, so maybe a good way would be to show some info about blocked (by noindex) websites in the webmasters console.
“Not returning the right link because of a NOINDEX tag, it hurts the user experience” – so don’t show it..
You cannot judge whether a website is accidentally blocked or noindex is placed on purpose.
And going after more and more content regardless of your policies and users agreement won’t get you anywhere.
So, I’d vote for not showing the page at all.
Regards

While 99% of me wants to agree with the majority on this one and say it should completely stay out of your index I can see where Google are coming from with the accidental blocking.

Because of this I am in 100% agreement (that’s 1% more!) with Searn Engine Web, a new tag should be developed that IS a middle ground, a “don’t index me at the moment” or “Revisit for index on..” so that people who are about to put a site live can place this tag on pages and people who do not ant content indexed can use good ol’ NOINDEX.

If the noindex is there that means the y don’t want it to appear. Any time I use it I don’t want the page to be seen in the indexes. But if that is the case then they should also block it with the robots.txt file. I imagine this can be mistaken and clicked a button in their content managment system for development and someone forgot to turn it back on. Opps Take the whole site down great. That is an easy thing to do especially if you are devloping in a live environment.

When a site has noindex don’t show it at all. or if google wants to they can make phone calls to see if they really want it removed. I would give my number out for a call from the google stars.

Yeah, this seems easy if you interpret NOINDEX to mean ‘don’t tell anyone the page exists.’ Semantically it could mean ‘don’t spider the page and cache the content’ while being neutral about whether the search engine can reference the URL in its results.

However, if the historical meaning has been ‘exclude this page from the index entirely’ then that’s what Google should do. The fact that some people forget to remove this from a staging site is THEIR issue. Google should not babysit them and, no, it doesn’t matter how high profile the site is.

very pleased to see you guys addressing this issue, and I hope other search engines will follow suit, resulting in one way all the engines treat noindex. Personally my opinion is that no search engine should display a page when it has noindex on it. Even more, I find the fact that most search engines, including google, seem to automatically assume a follow after the noindex if there’s no explicit nofollow to be very weird.

Try explaining to someone who doesn’t know about this stuff the following: “yes we prevented Google from showing that page in their index”. Two minutes later, going over the backlinks in GWT, client: “hey but there’s that page, you said you excluded it”, me: “yeah but…” a LONG explanation follows.

Furthermore: the removal tool works for 90 days only, which is not good enough in a lot of cases. Preventing a page from being spidered in robots.txt has the downside of you having to put all your precious URL’s in a file that anyone can see, OR having to cloak the robots.txt.

So, in all, I’d like the noindex meta tag and http header to prevent google / all search engines from displaying that page in the search results, and I’d like the default to be to assume a nofollow is present with all noindexes.

The fact that some websites get the noindex wrong by accident is a problem, I can understand, but you don’t solve a problem a minority of websites has by forcing a majority of people to change their ways. You solve that problem by educating the people maintaining those website.

I think it is very important that webmasters have a way of completely removing all evidence of contents existence from search engines, and if noindex in conjunction with the standard robots.txt exclusions is the best way to do this then that is good.

Options 2 or 3 where you show a link to the page, begs the question what title and snippet would you display? I have serious reservations about your current treatment of this in regards to robots.txt exclusions, because currently it opens up the possibility of competitors using excluded pages as a way of spoofing content in the SERP. If I can override that with noindex then excellent.

but the treatment still seems to be the same meaning that someone by linking into excluded pages could spoof malicious titles into a competitors results, which opens up a can of legal worms particularly for the finance industry.

It all went wrong for me here “Our highest duty has to be to our users, not to an individual webmaster.”

Of course it does but time and time again Google takes this standpoint and to guys like us, who are making the sites that Google list to best serve their users, end up left feeling like second class citizens.

If a webmaster adds NOINDEX your duty is to respect their wishes. Of course mistakes can be made and for me how to address that is the answer you should be looking for.

As someone else clearly points out, you have no interest in duty to serving your end users if you ban a site. How about all those German’s who searched for BMW to find it wasn’t there? Of course they broke guidelines and got the chop.

Its just an example to point out that your statement “”Our highest duty has to be to our user” is used only when it suits.

Webmasters MUST have the possibility to advise robots to NOT index a website. Google actually shows up forbidden URLs in the robots.txt that is surely not known by a lot of webmasters, so please don’t change the behaviour of noindex as the only effective possibility to forbid robots to index a website!

If you really must decide that ‘well-known’ sites will have their noindex statements ignored, then find a better method than the using mediocre DMOZ resource to define ‘well-known’.
But I can’t condone doing so – you’d be like the paparazzi chasing celebrities. They are asked to go away, but they just won’t listen… 🙁

While I understand Google’s anxiety about accidental removal, NOINDEX (unlike nofollow!), does exactly what it says on tha can,and does Google REALLY want to take on a policy of deciding what’s a webmaster mistake, and what’s a webmaster intention?

It is quicker, cleaner than Google removal Tool – I wish all tags were as sensible.

Just re-read your list; what the other engines do actually LOOKS SILLY!

Google does not access the URL. Google can list the URL as a URL-only result (no snippet) in the SERPs if it sees other pages linking to the URL. Yahoo also lists as a URL-only entry (no snippet) but sometimes goes one step further by crafting a title for the entry by using anchor text found in a link for the title (as long as the title is NOT some sort of generic “click here” or similar). I don’t like that effect.

URL with meta robots noindex:

Search engines fetch the page. They have to do that, in order to then see the meta robots noindex tag. However, Google does still cache the page internally (there was a bug a few years ago where meta noindex URLs in Supplemental index did show in SERPs for a few days under certain conditions), DOES assign PageRank to the URL, DOES follow links out from it, but does NOT show any reference to it in SERPs (except for the brief bug just mentioned). That’s expected behaviour I think.

There’s a bit of a conundrum if I add a “noarchive” tag to a “noindex” meta tag, as you will have to keep pulling the page to remind yourself not to index it, because you haven’t had permission to keep a record of what I said before. I am sure that is easily solved in some way.

From this, it is hopefully apparent that what Google stores internally for later use, and what is shown in the SERPs are two different animals.

I have a question though.

I have seen a small change to the way that the robots.txt disallow directive is handled. The change happened somewhere around, or before, the 2007 October/November time frame. New URLs on a site that are already disallowed by robots.txt, now show up as URL-only entries for a few days and are then dropped. That didn’t use to happen. It appears to me, that maybe they are listed “by mistake” (and I am guessing they are from the Supplemental Index) and then another process “cleans them up” within days. I think this change occurred around the time that Matt talked a lot about “minty fresh indexing”.

The above search continues to show some URL-only entries in Google SERPs. These are all for URLs denied access by robots.txt and most are for new users or for new threads from the last few days (and which get dropped within a week or so), but there are always 30 to 60 URLs in this result.

The question is this:

If access is denied by robots.txt, and the page isn’t fetched by Google, how is Google able to judge that a page is not in English, and adds the translate tag to the URL-only entries for pages that are not in English?

This URL was “created” only a few days ago. The URL matches a robots.txt disallow pattern that has been in force for at least 18 months. I assume that Google has not fetched the URL, as it is robots.txt disallowed.

Is that content language guessing done on the basis of the language of the pages that link to this URL? If so, then that is a bit of a leap of faith, but I haven’t seen it guess wrongly yet.

I have just realised that the content for some of the URLs with an additional [translate] tag is only viewable to people who have logged in to the site. If you aren’t logged in, you get a “please log in” error message instead. So, even if Google had fetched the pages in violation of the robots.txt disallow, they still would not have seen the foreign language content either.

So, how is the “translate” results page able to tell me that the page was originally in Chinese, or Italian then?

“When a user does a navigational query and we don’t return the right link because of a NOINDEX tag, it hurts the user experience (plus it looks like a Google issue).”

Although I think the engineers in this camp have a good point with respect to the “hurting the user experience” side, I’m not sure that it “looking like a Google issue” is a reasonable criterion. When something “looks like a Google issue”, I believe this doesn’t really have much to do with user experience (save for those users who might waste some time contacting Google to tell them about the “error” — does this happen?) or webmaster interests.

Ultimately, I think you have to assume that the vast majority of instances of NOINDEX will be legitimate, and intentionally implemented. If a webmaster doesn’t want a page indexed, it’s probably very frequently precisely for user-experience reasons. Perhaps the webmaster doesn’t want a user fumbling around on an uncompleted site, or doesn’t want a user ending up at a page that is likely of little to no use to him or her if he or she ended up there based on a Google search.

Further, for the vast majority of users, does it _really_ improve experience to get a blank “reference only” result? Or does this just add to confusion?

When I started writing this post, I didn’t have a strong opinion, but I think I’ve just convinced myself that Google should leave things as-is.

I understand that Google wants to seem to have all the websites out there indexed, but NOINDEX means exactly that. I think you should respect it.

Why don’t you run an internal project, and if you’re getting a lot of NOINDEX pages from .ac, .edu, .gov, .mil domains during crawling, then simply send the webmaster an email asking if they intended to stay out of the Google index by using the noindex tag. If no, the email will serve to remind them to remove the noindex. If yes, then you have your answer and can behave respectfully to the webmaster by following their wishes. I’m sure this can all be automated too, which is true to the Google way.

Don’t show the link at all. Who is to say what is an “important” website? Does an important Government website that has noindex on some login only pages have those pages listed when it has specifically asked for noindex? If I have my cms admin/subscriber only pages hidden, nofollowed and noindexed etc, then I certainly do not want those pages to get indexed in any way even if there is password only access to them. If some random admin/mod links to them, I certainly do not want such pages showing at all, despite the inbound links.

I assume that such inbound links would be listed on the webmaster tools – make it that little bit easier to find and get rid of them.

I understand the argument about the Korean websites you mention. But I consider it a webmaster/SEO job to right the issue. If there are no results, anyone worth their salt can find the issue and correct it – if that is what they want. I have found and corrected many issues, even 404’s, and 500 headers being shown for all pages of cms driven websites. The line of webmaster responsibility has to be drawn somewhere, and I consider the purposeful inclusion of a tag should be respected.

I object to such an issue being even debated while Google is effectively no-indexing sites due to the duplicate snippet issue. Yet another client complained of substantially lower rankings today, and a simple search found websites that had copied the content. Previous clients issue to which no comment received – http://www.mattcutts.com/blog/duplicate-content-question/#comment-122817

@Matt I hope you leave it as is – sure I see situations when it is added by mistake, though not always by a blog owner – there was that major problem with many blogspot blogs last year for instance, and WordPress by making it easy for people to add this also should make it clear that it is switched on.

@g1smd

Google add titles to URLs blocked by robots.txt based upon anchor text as well.
For an example search for WordPress SEO, and you will see the paid review I recently blocked with robots.txt with anchor text included.

I think it’s a pretty clear-defined case: webmasters put “noindex” in their page because they don’t want the page indexed or shown. As you can’t know whether a webmaster perhaps accidentally put the noindex there, you have to err on the safe side and do what you’re told. Or how would you feel if people started to interpret Google terms of services in terms of, “oh maybe their lawyers just misspelled this and really mean something else, I’ll ignore it.”

Also, please do not try to push webmasters to always use a Google tool — like a URL removal tool — to do stuff; while Google search is close to a monopoly there are still other engines out there, and webmasters have better things to do than toggle a dozen tool’s configurations.

As far as a middle ground goes, I kind of like Egol’s suggestion to include a notice in the style of “some pages aren’t shown because they use a ‘noindex’ tag”, similar to how you disclose it when you censor e.g. the Human Rights Watch organization in China.

Like egol I say respect the noindex tag, it’s a no trespassing sign. The people that OWN the site placed it or had it placed there for a reason. This is the internet version of a fence or being placed on the do not call list. Violation of either can get you fines and/or jail time. While the end user is googles ultimate perspective and focus the OWNERS implicit statement of noindex means stay out and must be respected as a willful and knowledgable decision on their sites privacy. If any search engine is willing to violate a site owners privacy then violating the end users privacy is sure to follow.

On a mass scale google (or any SE for that matter) cannot determine the reason a webmaster put the noindex tag there. With all the attention and importance being placed on privacy (see “E-mail (required, never displayed)” on your own comment form) google has the duty to respect any and all noindex and nofollow tags. They’re there for a reason- excluding mistakes. If they put the tag there by mistake they’ll know it soon enough and will correct it.
All search engines have to keep in mind that people will put up sites that are private and should not be available to the general public. I know of many sites and message boards that are populated only with family related information and they HATE anybody being on the boards that are not family. The information is private and should be kept that way. The thought that google is toying with the idea of intentionally violating that privacy is scary and wrong.
If the noindex tag and the privacy it affords is not honored what’s next, indexing someones pc to improve the end user search results like microsoft seems to be considering. (I’m in the process of moving to linux because of that statement alone)
As a general policy all search engines should respect the noindex tag, PERIOD. If it’s a large site and you think they did it by mistake send them an email to inform them of it. That way the onus is on them not google.

If you choose the middle road, you may want to consider using a different metric than a listing in the ODP. (possible alternatives: PageRank comes to mind, comScore, Compete).

As much as I want my site listed there, having to possibly wait for over two years for a listing, and even then no sure thing, the directory just isn’t a reliable source on the web. It’s dated on any given day. It’s the internet 2 years ago (ok, that’s a stretch).

>> Google does not access the URL. Google can list the URL as a URL-only result (no snippet) in the SERPs if it sees other pages linking to the URL.

Actually Google behaves similar to Yahoo! if the search text matches some link text pointing at a blocked URL Google uses theink text as a title. I think this is very bad practice. See my examples higher up this page.

>> If access is denied by robots.txt, and the page isn’t fetched by Google, how is Google able to judge that a page is not in English, and adds the translate tag to the URL-only entries for pages that are not in English?

I also spotted this on some of the blocked results for the adwords professional status URLs coming from Googles own website but overlooked it. In the SERP the [ Translate this page ] link assumes there is dutch content on some the pages, however it doesn’t work on all the foreign language URLs it missed quite a few Swedish and Spanish ones. Some of the professional status URLs contain hl= interface language but not all of them do, so my guess is it looks at language from the page or link text pointing to the blocked page.

I’m confused why there’s even a discussion. It seems like it should tell robots not to index just like its name says. Same idea as excluding by robots.txt.

The only time I really use this tag is when I’ve got multiple URIs to the same page. For example, some ecommerce sites (osc) put the path to the item as part of the query string and that can lead to duplicates. The noindex tag is the easiest way I’ve found to keep certain variants of that from being indexed.

And I understand the argument of user experience, but shouldn’t it be up to the page author to decide if their content is included? If I don’t want a page, or a whole site, included in search engine indexes that should be my decision, not the search engines.

Thanks for asking! Of course Google has to respect NOINDEX as is. The sole question is how to enhance navigational SERPs when the best result is NOINDEX’ed. A message like “the best matching result is unfortunately blocked by the site” would suffice IMO.

If you really can’t live with that, then support a robots.txt directive likeNOINDEX=NOREFERENCE: /
respectivelyNOINDEX=REFERENCE: /
so that Webmasters can decide whether or not they allow you to mention references to forbidden stuff on your SERPs.

Google’s handling of the implicit FOLLOW is perfectly in line with the standards, and you really do not want Google to changes that. Think of internal hubs, site maps and such on large sites where you can’t output a gazillion of links in human readable format for example. NOINDEX,FOLLOW is a perfect way to serve bot maps without bothering/annoying visitors/searchers, and there are many other use cases too.

While on the topic of NOIndex: I had a blog who’s index page was set with NOIndex, as I wanted only real pages indexed, not the summary index page. However that noindex setting prevented blogsearch.google.com from indexing my feed. I confirmed with a few others blogs as a test and indeed once we removed noindex from our home page, blogsearch operated as expected and almost immediately.

Actually, I’ve been thinking about this a lot this morning: It’s GOOGLE’S product, and Google is free to do with that product what they wish, as you keep telling us about paid links.

I do like that you ask webmasters’ opinions Matt, I really do. I respect that. But it seems like a one way street unfortunately. Webmasters are using tools and coming up with ideas to make the Google product better (in the hope that somehow it’ll remove the crap above them in the SERPs), but there doesn’t seem to be a lot of love going the other way, huh? We even create content to make it easier for Google to index and have better quality SERPs. However, if we play completely by Google’s rules, do we get a ranking benefit? Nope. If we step outside of the lines, do we get penalised? That’s the threat, although I never see any of my finance competitors getting busted over link schemes like DPA.

So, I’m sort of lead to ask – what’s in it for the webmaster exactly to keep tweaking their settings in order to allow Google to have a better product offering? We always hear the “making the internet a cleaner place for users” argument, but webmaster suggestions and efforts help to make Google a multibillion dollar industry. If webmasters all refused to set their preferred cononical domain, or put nofollows, noindexes, good robots docs, etc, would the Google SERPs be of the same quality? Would the very obvious spam that exists in the Google index be gone? Would you be prepared to give back to those people that help Google out by adhering to your rules?

I think NOINDEX should continue to work as it currently does. However, it might make sense to add a third possibility, LIST, which would not analyze the page but would allow the url to be listed in Google similar to the way Microsoft and Yahoo currently list NOINDEX sites and the way Google lists sites where it has found links but has not yet crawled the site.

Beyond that, if Google Webmaster Tools is not currently doing so, it should have a check for the NOINDEX tag and alert the webmaster to its existence. While not everyone who might accidentally add the NOINDEX tag uses Webmaster Tools, you can help at least some of them. Google Adsense and Analytics would also be great places to add checks and warnings, as I’m pretty sure far more people use one of those on a regular basis than Webmaster Tools.

The middle ground of using Open Directory sounds like an interesting idea but I am concerned that a site could get submitted to it and the website owner really meant NOINDEX. I know that this doesn’t help Google with its customers but I don’t think the customer’s wishes should supersede that of the probable copyright holder.

Like other commenters, I don’t believe using the url removal tool is a realistic or reasonable alternative to the NOINDEX directive. The tool only works for Google and is not permanent. In addition, the website owner needs to use Google’s Webmaster Tools to reinclude content. Webmasters should have the ability to exclude and include content through the use of meta tags and should be able to make that information available to all search engines, not just Google.

My vote is that it should not index anything, as advertised. Maybe a noindexunlessinthedmoz is needed or noindexunlesslinkedto etc, but short of that the tag should work the way it is named.

As far as Google looking bad, I’m not so sure that is a real concern, plenty of sites get removed all the time not for quality issues but of spamming issues that are completely invisible to the average user yet the average user must wonder where the site is, sites that are not fully indexed use the site search function with missing pages giving an incomplete search result, daily people don’t understand why Google of all engines cannot find their links in the GWHG, PageRank is shown as the “importance” of a page yet only the SEO community knows that is rarely updated and often wrong, the last crawl date on in GWT is often quite wrong. So having supposed inaccuracies is not a new problem at Google, as matter of fact it’s even designed into the algo ( ex. Link:) and those don’t carry a warning like, “Google has purposely withheld showing all of the links pointing to this site”

If you want to tackle confusion on webmaster’s part rather than mess with an directive which seems pretty clear to me, noindex== don’t index, I’d take a look at the “none”. I’ve seen far more people accidentally using this one thinking that it means to “no restriction” rather than it’s actual meaning of ” noindex,nofollow” as seen in this blogpost by Vanessa:http://googlewebmastercentral.blogspot.com/2007/03/using-robots-meta-tag.html

aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users.”

Granted, the user finding what they need is important. But if a site has deliberately and clearly told Google they do not wish to be indexed – ESPECIALLY big, high profile sites which have obviously made a decision in this regard – then that is their prerogative, regardless of anything else. I think it’s really dumb to do so, but people have the right to be dumb and not to have their wishes ignored. As many previous commenters have said, it’s like ignoring a no trespassing sign.

I vote don’t show the page but allow it to be indexed, as you do currently. This is handy for people that don’t want to have the website indexed (can put it on every page) but want to check the sitemaps files, 404 errors etc that the webmaster tools offers, which the robots.txt exclusion does not allow.

I basically use a stage domain with robots noindex, nofollow on every page, add the sitemaps to Google Webmaster Tools and then you can see if there is any sitemaps errors, 404 errors etc.

I vote that the page should not be seen, as requested by the web master. This has been around for many years. The web master is the owner of a site, and this is (as someone mentioned) like putting a “No Trespassing” sign on the front lawn. Reasonable, lawful people are expected to obey it – and it is within a normal expectancy to believe that the search engines are reasonable, rule-following on-line citizens.

Likewise, I don’t think it should be completely removed from Google’s internal indexing however, it should just not be shown. Leaving it in an internal index, and flagged as a NOINDEX, would allow Google know about it when it gets a link from another site, to not bother to show it. Likewise the internal referencing might allow the page to be rechecked, although not as frequently, to see if the NOINDEX was removed. (This removes other issues like web master mistakes, to naturally be healed over time.)

I would say a regular user has no real idea how a search engine works. they will often expect that as soon as a page is uploaded to the Internet all of the search engines know about it. Likewise, they have a love/hate relationship because it doesn’t give them the best result the first time they search on the first item on the page. Most users don’t understand detailed queries in a search engine, or using modifiers like the quote or minus sign. So how can we expect them to understand that someone doesn’t want a site /page indexed?

I think that a no-index tag should do just that. I don’t feel it should be changed to help out sites that don’t really know or can’t figure out what they’re doing wrong – like the Korean sites you mentioned. Who is to judge whether a site is important enough to have it re-evaluated.

Hi Matt –
I think that Google should completely drop the page. With that being said, I like the idea of putting a warning/error in the Webmaster Tools that shows which pages are noindexed. If the homepage is noindexed, then the error can be prominently displayed in the overview for that domain in the Webmaster Tools. When a user sees that, they can dismiss the warning or keep it there until it’s fixed.
-Aaron

Matt, I naturally would prefer not to have meta Noindexed pages showing up in the results (as I voted).

But, it’d also be helpful IMHO if you crawled links on NOINDEXed pages, if they specify Follow. In the past, it appeared to me that NOINDEX,FOLLOW might not be actually followed. Or, if they are followed, the PR doesn’t seem to flow through the NOINDEXed page, so the followed pages might as well not be indexed, either…

This would be helpful to us when constructing sites where we need to link up large amounts of pages, though the intermediary branching pages are not worthwhile to have indexed. I think this would help in terms of quality, since the pages would be of low worth, while the destination content pages they link to may be entirely worthwhile.

Pratheep, Yahoo has proposed something: http://www.ysearchblog.com/archives/000444.html . However, the last time we checked, less than a thousand sites were using those special tags. Given the small number of sites involved vs. doing other things with engineering resources, so far we haven’t added support for Yahoo’s tags.

“Out of interest, why has it suddenly become a hot issue? Is it a ‘reputation’ thing – or a realisation that people in other languages may be getting it wrong disproportionately?”

Andrew Heenan, it’s not a hot issue, but every month or so we get a small trickle of sites that are shooting themselves in the foot. “Why didn’t Ben Harper’s site show up? Did bmw.dk remove itself on purpose? These Korean sites have dropped from our index.” Every time we see a site that appears to have made a mistake, it re-opens the conversation.

“As for the comment that it makes Google look bad.. No, it doesn’t. It would if Yahoo listed the page and Google didn’t, but since NOBODY is indexing the page it’s meant to not be found.”

Tim Linden, part of the issue is that Yahoo and MSN do return links to NOINDEX’ed pages, so people sometimes think that it’s a Google issue.

Don Macaskill, thanks for weighing in with a good point about SmugMug. I had thought of the case of parked domains, but not privacy controls on profiles.

If it has the tag, don’t index it. However, if you want to be nice to webmasters who have somehow put it in their code without knowing, why not send them an email telling them to check if they don’t want indexing?

Everyone, I really appreciate the thoughtful comments here. There is a fundamental tension in this case between an easy-to-use tool for site owners vs. more information for users. I agree with many people here that there should be an easy way to remove all traces of a site from Google. I do however also feel the pain of search ranking people at Google who see a navigational query for e.g. a well-known government website and we might not return the answer that the user was looking for because of the NOINDEX meta tag.

Surely thats’s an issue with education of government webmasters? It’s exactly the same in the UK, maybe worse because the Government hands out all these red herring eGov meta data standards for them to obey, which puts they out of touch with the actual public user exerience of their websites.

Wouldn’t it just be easier and consistent that if a page or a website not be indexed is just use the robots.txt file? Rather then hitting all the pages that has NOINDEX can easily be removed from one file.

I’ve come across many sites that are not indexed when they should have been and only to find out that they have a NOINDEX tag on every page. Lately I’ve been finding them in the middle of the coded page rather then at the top where it should be.

Like said before in the comments: “NOINDEX means no index. Why confuse the matter with gray areas?” It is a computer command, and should do what it is programmed for. Google does have thge right to ingnore it though, and make their own rules on how they read a website.

Consistency often equals quality. If Google is not consistent with their method of handling items like this, the lack of quality will quickly cripple the high horse.

It’s currently very difficult to keep a site off Google, please don’t make it harder NOINDEX should not show up in results at all, no matter how many people link to it. All you need is one slashdot story and suddenly lots of people are linking to some confidential information and now they can find it with Google.

I have a site I want private. I had NOINDEX on every page and I had a very restrictive robots.txt (told everyone to go away) well the site still showed up on Google 🙁 they said since my robots.txt told them to go away they couldn’t find my NOINDEX directives so displayed the site links even though they didn’t show any snippets. That was when I learned about the secret noindex robots.txt rule, just for Google (how annoying!)

Please help us to keep some of our content out of Google, we should have control over that in a standard way.

and don’t even get me started on there being no easy official supported way to do this with rss feeds (not that feed search engines seem to honor anyway)!

ark, if you want a site to be truly private, the best way is to just set HTTP basic authentication on it. If it showed up in google at all, then that means there’s a link to it somewhere on the public internet; if google can stumble across it, so can others.

NOINDEX from my point of view is self-explanatory – that is: do not index!

Though Google does not show my NOINDEX pages in the SERPS, I still wonder why terms from those pages show up in [b]Google Webmaster Tools > Statistics > What Googlebot sees > In your site’s content[/b]. I said NOINDEX!

I was wondering if you will be writing soon about your take on the newest activity being promoted by a group of blog and website marketers. The specific new thing they are promoting is a extension of the do-follow blogs. Here is the twist. They are promoting a do-follow / no-follow plugin. This will allow them to trade blog comments with each other and in there words link build, but put No-Follow tags on any of their competitors who choose to comment on their blog posts.

People here want NOINDEX to remove the pages because they understand the SEO implications and NOINDEX is a very helpful feature when used properly. However your Korean example indicates there is confusion out there which means some sites are using NOINDEX without realizing the implications, which are severe for sites and users who want to be found/find things.

IMHO the solution is the same as for other challenges with Google ranking issues – much more robust automated communication with sites. Both via Webmaster Console and an email to the whois contact Google should have a “Webmaster Alert” system that sends out basic obvious information about the status of a site that is not indexed, is perceived to be in violation of guidelines (which is often unintentional or due to confusion), or is downranked as part of the infamous algorithmic mysteriousness.

This scalable, human-free approach would save the webmaster community *millions* of hours each year trying to diagnose problems to please Google. It would also help Google because (again just IMO) the lack of broad based, effective communication with webmasters has more potential to hurt the company than is currently realized internally.

In the pre–Google days, we used NOINDEX,FOLLOW at ibm.com because we found search engines were ranking pages that listed products higher than the intended product page. You couldn’t use robots.txt because you wanted to mark “/products/” as non-indexed, but allow for content under “/products/” to be indexed (eg “/products/thinkpad/”). I think it’s still valid in a pagerank–influenced era. If “important” sites are missing from search results because they’ve chosen to add NOINDEX, then they’re shooting themselves in the foot, but you can’t really know that you’re improving UX by surfacing those results.

Relevancy in SERP is the main question. How better can you produce results, users are looking for? That should be the fundamental question. Lately, I have noticed some great changes in SERP for queries, some results are good, some are bad. It behooves me to ponder on this quagmire that I, as both a user and a webmaster, have to seriously analyze the data that I find online, albeit all these proclamation of how advance the technology has become. I have to sometime go past 5 or 6 pages to find a simple topic specific results in Google. Until the search engine is perfect in accessing human behavior online to be able to serve customized SERP to each individual, we have to assume that the form of AI is inept in understanding a complex human behavior, hence the machine should be taught, and we should give it specific instruction. 0 is OFF, 1 is ON.

I voted for “Show a link to the page “. The reason is that the site owner doesn’t own the links in other sites. So Google has the right to index the links. Links are in the end, content too.

Of course you do have the issue that the content of the page could be unrelated to what the link suggest. That can even get you in trouble in case people (the spamming type) try to abuse this logic.

Therefore Google has to check the page with the noindex tag. Noindex means: don’t index this page. But it doesn’t mean: Don’t visit this page. So Google is allowed to get the content of the page to determine a relevancy score for the link that’s pointing at it. After that it can just forget about the page again.

I guess it’s like “Howard Hughes” You couldn’t “index” the man, but that didn’t mean you weren’t allowed to talk about him.

I look at it like this, and I know I’m not the only one…the problem isn’t the attribute, but the underlying logic behind it.

NOINDEX is an “opt-out” attribute…which is fine if there’s an “opt-in” clause in the first place. The very logic behind search engines crawling and indexing sites (and for that matter, other bots) is fundamentally flawed in that it’s done based on a single hyperlink, and not always with the prior approval of a webmaster. There are cases out there where sites can’t get into search engines because they don’t have at least one inbound link from a page that is crawled frequently; and there are those sites that have appeared in SERPs by accident (the example that comes immediately to mind is when a webmaster asks a question about a site under construction to try and get some help.)

In other words, take NOINDEX a step further, and don’t index anything before a webmaster/SEO uses the Webmaster Tools to sign up, say “I want this site indexed, but only Pages A, B, and C, not X, Y, and Z”. In other words, use the technologies that you have in place to create an opt-in environment and allow users to indicate what can and can’t be indexed…not the present “absence of a ‘no’ is a ‘yes’ answer.”

You’ve got the tools.
You’ve got the technology.
You’ve got the database size and algorithms to be able to endure the short-term hit that would happen.
And most importantly, people wouldn’t be able to hide as well…the ones that have something to hide, that is.

I think that NOINDEX works fine the way it is. I don’t like the idea of google making links to the noindex’ed pages mostly because it can serve useful purposes.

For example, say you wanted to test a marketing campaign and you set a url variable to track that. (ex: http://foo.com/index.php?tracker=google_adwords) Now, if the page(index.php) already is indexed, that would show up as duplicate content. However, if you attach a NOINDEX to index.php if the tracker variable is set, then you don’t have dupe content issues anymore.

In the example above, if the Google index had links to the noindex’ed pages, it could create potentially hundreds, or thousands of duplicate pages that Google would be linking to. That would not be better for the user, and in my example, creating a robots.txt file to track every single index.php variation with a tracking variable would be inconvenient at best.

A better solution would be for Google to create a site that does a better job of explaining all of this technical stuff to webmasters. Webmaster central offers great tools, but perhaps a companion site that would offer best practices and explanations beyond the webmaster guidelines would help. Also, if Google notices some important sites that have noindex’ed themselves out of the index, perhaps Google could send them an email notification? Either way, educating webmasters with better information will go farther than changing the current system.

1. I like that Google is asking for information on this topic. There was a time in Google history that this type of open communication wouldn’t have happened. I feel you and your blog are a key reason Google is reaching out to webmasters. Thank you.

2. NOINDEX means NOINDEX. I have a lot of personal photos (no not those) on the internet and I don’t want people finding them when they Google my name or my son’s name or my family members. It’s meant for family and friends not for the world. I NOINDEX those because Google doesn’t need them, nor any other search engine. It’s my content not Google’s.

3. You need a tag for DUPLICATE CONTENT. I took on the role of SEO Manager at Tribune last week and we have a ton of duplicate content that isn’t going to go away do to business reasons. Example, L.A. Times writes a story and the Orlando Sentinel may pick it up. Both Tribune properties and both want the content on their site. I’d like to have a meta tag that tells Google that although the information on Orlando Sentinel is important and should rank higher in say Orlando for the article, if it is someone searching in Maine then the L.A. Times is the real source of the information and L.A. Times should get the priority not Orlando Sentinel. I’d like a ‘trackback’ option for web pages. It would tell Google that here is some good content that may be useful for people that prefer the Orlando Sentinel (browsing history, locale, other sites that they visit that point to Orland Sentinel, and a myriad of algorithmic possibilities) but for every neutral party . . . send them to L.A. Times because they wrote the article and they should get some extra special treatment for doing so. I’m going to attempt to accomplish this via HTML coding but it would be a better user experience if Google had a tag for ‘SOURCE’ or ‘DUPLICATE CONTENT SEE XYZ’ etc. I’d think it would make Google’s life easier too.

A note at the bottom of the SERPs if a site has been omitted due to the NOINDEX tag. You said yourself (to Eric Enge) that a NOINDEX page still collects PageRank and passes PageRank so I’d think a webmaster would want to know that they could’ve ranked #3 for a particular search had they not put a NOINDEX on the page.

So . . .

“Some results have been omitted per the request of the site owner. To view Google’s policies on this matter, please see our Webmaster Guidelines.”

I use this for pages that I DO NOT WANT INDEXED, so please keep it that way. For instance pages the content of which I have on two places, and I want the other page indexed. Or as an alternative for a 301-redirect, in cases of clients with servers that don’t give the option of a 301-redirect.

There is no reason to put this tag on anything one does want indexed, so please just respect the webmasters on this one.

It’s one thing I adore – Matt answering the first comments at a newborn post. 🙂
I once even wanted to parse the blog and make some stats on digg.com availible, so poeple see and know what this G guy is about.
Asking for tips is a nice way. But nobody has answered user questions at reinclusion request post for years now. And it’s not the way people should be treated. Because you doen’t get an answer at Google Groups either.
if the site was in the Open Directory, then show a reference to the page even if the site used the NOINDEX meta tag – it does make sense to me.

Everyone, I really appreciate the thoughtful comments here. There is a fundamental tension in this case between an easy-to-use tool for site owners vs. more information for users. I agree with many people here that there should be an easy way to remove all traces of a site from Google. I do however also feel the pain of search ranking people at Google who see a navigational query for e.g. a well-known government website and we might not return the answer that the user was looking for because of the NOINDEX meta tag.

Bolding by me.

I don’t agree at all that all traces of a site should be removable by Google or any one else for that matter. A site owner can not dictate what others show on their website. A link in the website http://www.google.com has as much right to be there as a link from http://www.whateversite.com. Why would a website owner have the right to tell anyone not to link to his content?

Google’s only concern should be to the relevance of its results. And determining the relevance of a link doesn’t require the page it links to, to be indexed. All you need to do is visit the page to determine the relevance of the link. That’s not indexing, that´s just visiting the page and using the content as a base of information to determine the relevance of the link.

This issue is much more fundamental than just what webmasters want. A webmaster has no right to say to anyone to link or not link to their pages and they even more so can’t tell some yes and others no.

Pages online that are not password protected simply are publically available and you can’t tell anybody to not visit these pages. Even a noidex tag can’t forbid a search engine to visit the page, it’s Google’s choice not to add the page to their index. Which you don’t anyway. But it does not imply you can’t use the content of the page to determine the relevance of a link that points to it.

I don’t see the problem here to be honest. A webmaster would want to forbid Google to link to a page? Makes no sense at all. If it’s that important to them, they should program their server to redirect a visitor that came from Google back to Google or to a page that says: “Sorry, you came from Google, you can’t see this page, why don’t you try to find another site that links to us. In that case you´re welcome.”

Obviously that’s so ridiculous that it is obvious that this whole issue is a non issue.. 🙂

First off, I have no problem with NOINDEX acting exactly the way it currently does. It ain’t broke, IMO.

Next, you need to be very careful with statements like “Our highest duty has to be to our users, not to an individual webmaster”. That position is just not ethical, IMO. If a webmaster does not want her content to be found by Google’s users, then who is Google to say otherwise?

Having said all of that, it may be worth you considering a couple of things that are “middle ground”.

1) How closely have you considered the difference between a URL and the content at that URL? Perhaps it would be OK to index a URL, but just not the content at that URL … analogous to the Partially Indexed Pages that arise from current robots.txt treatment. i.e. NOINDEX could literally mean “Don’t index the content”, not “Don’t index the URL”. I think it’s very important that Google a) doesn’t index the content; b) show snippets of the content; or c) provide a cached link to the content. Of course, b and c are impossible assuming the first condition is a given.
2) Maybe option 1 is going a little too far in favour of the user, but you could apply the same thinking purely to the home page of a site. This would alleviate the problems seen by http://www.police.go.kr/main/index.do, http://www.nmc.go.kr and http://www.yonsei.ac.kr, all of which are home pages. (http://www.police.go.kr/ -302-> http://www.police.go.kr/index.jsp -302-> http://www.police.go.kr/main/index.do)

Send my regards to Adam, Luisella and Fili who I met at SES last week. They all seem like nice people. Shame you could not make it, though. 😉

The fact is that sometimes the meta robots tag offers the only means for preventing indexing. Simplest example? Suppose you don’t want the home page (/) to be indexed, but you do want every other page on the site to be indexed. You can’t do that with robots.txt.

Also, on some highly dynamic sites, the meta robots tag is extremely useful … e.g. on a site with three query parameters, for each URL there are 16 combinations of those query parameters. The robots meta tag is invaluable for preventing most of those 16 combinations being indexed, something that would be very difficult if not impossible with robots.txt.

In other words, sometimes robots.txt is just not an option. For this reason, it’s essential that noindex offers at least the same protection from indexing as robots.txt – which means partially indexed pages, at worst.

How about adding a new variable in webmaster tools which allows site owners to reference some default content?
For instance, if a site owner applied no follow rule for a particular page they should reference some default page/content OR if s/he decided to remove entire site from G than at least they can reference link to some other SE where user can locate their site. As Matt C mentioned G cares to satisfy user experience; so it shouldn’t concern G which src a website owner wants to direct user.

One thing I’d like noindex (or additional method) to do is simply not to follow the link at all.
robots.txt are failure back from old old ages (okok, 1994), but it is not a standard, and it fails at modern dynamic world.

Now once we got pretty URLs, that don’t have ugly question marks, and .cgi extension there, search engines should have more intelligent method to understand which pages are good to go to, and which are not.

Think of sites, that have /calculateMyExpensive report, then add 300 different languages, and end up having multiple different URLs. Describing in robots.txt is overhead, putting into forced structures is not flexible, and spiders/search engines still read links, that nowadays are dynamically generated quite often.

So if search engines could decide on robots.txt 15 years ago, maybe it is time to have to decide on dynamic way to suggest methods to crawlers? 🙂

I’ve encountered a few sites that use things like ‘rel=noindex’ and ‘rel=noindex nofollow’ and I’ve never understood how that works. Nofollow is pretty straight forward, but how would noindex work on an outbound link?

Matt,
I was listening to a commercial on XM about small business, which we are one, and they said something to the extent of “Qualify your customer’s needs, then do all you can to satisfy those needs.” The first thing that came to mind was Google. Koodo’s for the G.
Now, this may be a side bar but I think this falls in line with the discussion of NOINDEX and do the SE’s really adhere to the requests of the Webmaster’s.
Over the weekend I found some pages that Yahoo has indexed in a directory that is clearly marked not to follow in the robots.txt file on the site. But they indexed some of those pages anyway.
IMHO, I think that we as webmasters are trying to satisfy the customers needs and control what the SERP’s deliver up to yours and our customer, because we really do share the customer; if you really think about it.
For us, when we say, please don’t Index we have a good reason for it. It would be nice to have some standardization agreed to by all of the major players on how to handle these commands, much like sitemap.org came together, why not do something of the same for standard’s. Wait………isn’t that what the W3C is for?
But what do I know i’m really just the marketing director.
Let me know when you get back in the KY area and I’ll take you and yours out for some ribs or BBQ. I know you’ve been missing that. TTFN. 🙂

I can’t believe your lawyers are even letting you have this conversation?? If I put a NOINDEX tag on a site and you blatantly ignore it so that you can display the content to your users (justified by you putting their wants over my requests), that opens Google up to a whole host of lawsuits.

You have avoided those lawsuits in the past because the site owners could have put up a robots.txt, noindex, etc. and in most cases chose not to. If you start returning content that has been purposely blocked because you made a business decision to ignore the content owner, YOU ARE HOSED.

Also, I find it interesting that you are willing to break convention/standards because you are afraid of a few high profile sites not appearing. Does this explain why high profile sites can sell links with no fear of de-listing? You can’t very well drop forbes.com and still be considered to have good search results. But the little guy, well they can be squeezed quite easily. Tsk, tsk.

I am still fairly new to the internet community. So, I can sincerely say that mistakes happen. Unfortunately sometimes mistakes are not mistakes at all but intentional abuse of conventions created by the search engines to assist webmasters which become black hat tools of deception.

I feel that keeping a page indexed even though it has a noindex attribute attached to it in the robots.txt file not only would help newbies (like myself) but help monitor link schemers who try to us the noindex tag to hide their black hat link schemes.

So, I feel that from both the search engine and white hat webmaster standpoint that indexing or archiving a page in some form would be a win win situation. The only instance that I could see that a page would not be considered for indexation is if it had an ssl certificate on it or contained personal information that had that could potentially facilitate id theft or other types of internet crime.

Frankly Matt, I think it’s absurd Google is considering adding links to pages that have NOINDEX on them, regardless of what the ODP, MS, Yahoo or anyone else is doing. I don’t care if there are a “few” webmasters that are doing it wrong by adding NOINDEX to pages that “should” be indexed. The simple matter is, MOST webmasters are using it the way it was intended. Are you really going to punish all the webmasters that care about how and where their content appears because of a few idiots?

And Travis has it right, too. If you do start indexing pages that have NOINDEX on them, and you show ads on those pages with their URLs in the SERPS, it becomes obvious that you’re refusing to comply with the webmasters’ wishes for your own financial gain. Good luck with those lawsuits.

To Peter: a NOINDEX tag very much is a no trespassing sign. That you don’t consider it to be doesn’t mean a hill of beans to the rest of us. Legitimate search engines are obligated to obey the directives of the webmasters of the sites they visit. If they choose not to obey they *will* ultimately be held responsible.

Finally, if ANYTHING, Google has the best webmaster tools in the world, bar none. If you really need to consider showing the pages somewhere, include them in one of the many reports in the webmaster console so the webmasters that USE the service can quickly identify and correct the issues with their sites. But please don’t punish the webmasters that are doing things right by ignoring their requests to have pages not be indexed.

Sorry about the double entry but I just had an idea. Rumor has it you did away with the supplemental index. What if you created a similar datbase to index “noindex” tagged pages. A sort of “noindex” index that would not show up in or corrupt the regular search result relavancy?

Sure, people have the right to exclude themselves from search engine listing but until someone puts a better AI into the search engine spiders those people who MISTAKENLY puts in noindex on their pages will be left out of the index.

I’m not Korean but can you imagine trying to find a way to contact your police authorities online and be blocked because someone may have mistakenly put a noindex in?

The search engines are there to help find what we need online. They help make head or tails of the chaos that is the web. I think it within their jurisdiction to verify if a noindex directive was valid or not. Sure, every Tom, Dick and Harry can noindex their site for exclusion – but public service sites like police web sites, hospitals should be included nonetheless.

If website is added to Google through google’s “add url” then show the link (no snippets). If a website is not added through google’s add url and is somehow found through other links, then don’t show it at all.

I think an index of nofollow sites that would not be displayed to the public but that could be queried to look for high traffic sites utilizing a heuristic engines would produce rapid accurate results with precision and in a cost effective manner.

This engine could sift through sites of importance and then using the Google alert system these web masters could be notified of the issue to correct the issue if they wish to do so.

This would not only benefit the webmaster but help the search engine retain relevance without opening the legal quagmire of accountibility and liability that publicly displaying these results might produce.

If you want to help, then do what Andy Beard said.
[quote]Wouldn’t it be easy to send some kind of automated friendly email to a site owner warning about a significant change in your ability to index the site?[/quote]

I agree with most of the people here, a page with the tag noindex should not be returned in the search results.

I don’t know the numbers when it comes to navigational queries and not being allowed to return the right result because of a noindex tag. But when it really is a problem, why don’t you mail webmasters just like you do when you ban them for 90 days. Just tell them Google received thousands of navigational queries and you weren’t allowed to show them the URL to the website because of the noindex metatag. And let them decide wether they want to be (re)included or not.

When the results return an url that links to those highly relevant sites (which often may be DMOZ or the likes), add a ‘We found relevant links in this result’ text. People will go to that url and find the link themselves.

I think that the NOINDEX tag should remain as it is, why change something that everyone understands.

I do understand the usability issues and that the web master could stop a site being cached to potentially restrict a customers traffic, if they had a falling out etc and the customer would be none the wiser.

Besides, indexing a noindex page could be treated as Copyright infringement. Usually, a webmaster knows that search engines crawl web pages and retrieve content such as text and images. But if there’s a label saying “don’t index”, this is a clear indicator for the SE not to touch content from this page.

In the end, it’s sort of “common law”: SEs may they retrieve, store and even commercially use copyrighted content without explicit permission; that’s widely accepted. But they must obey standardized meta information which disallows them to retrieve that content. Noindex is clearly a case of the latter.

By the way, the copyright issue is not too far-fetched. I think Google themselves had the problem already.

I agree with the majority – NOINDEX should just be treated to mean that and webmasters should have control of what content from their site is being indexed by search engines (otherwise there is not much difference between the spam crawlers and decent search engine crawlers).

I do no think that showing NOINDEX pages will do users any good – in fact, we are usually having to place NOINDEX on pages because the users request it.

Take our site, http://www.peopleperhour.com, as an example; we get emails from users on a daily basis requesting us to “remove their profiles from google” (not from our site, but from google). Luckily, because of the way NOINDEX is handled today by search engines and google, we can provide a facility for them to select in their account preferences their profile not to be indexed by search engines (amongst other things, that option places a NOINDEX on their profile).

I don’t see how we would deal with such requests if every search engine started to deal with NOINDEX in a different way…

Dave (Original) said: “BTW, in case of emergency, I would ring my Countries emergency number and not boot up the PC and start surfing. Perhaps I’m the odd-one-out?”

I stand corrected.

But nonetheless, mistakes done by others should not affect all the other users of the net. Otherwise, I’m all for privacy being the webmaster’s responsibility. (So I stand by my vote for middle ground).

There are any number of reasons for someone to make a page NOINDEX. To simply assume that webmasters do not know what they are doing is quite insulting.

Why would I want a page to not be indexed? Fairly quickly I can think of at least 4 good reasons.

1. It could be temporary content. It will make both my site and Google look bad if there is a search result for a page that no longer exists.

2. It could be a Contact Us page. As a rule, I NOINDEX those to help keep the spammers away. I can’t see where not having that page of my sites indexed reflects poorly on Google.

3. It has duplicate content on it.

4. It has information that I want to make freely available to my site visitors but not advertise publicly – a premium for visitors sort of thing. Perhaps I don’t want the entire world knowing they can download a free report/graphic/template/etc from my site.

The examples of government sites not wanting to be indexed; you know there are parts of the world that have a beef with Google. They should have every right to opt out of your service if they choose. Just as Google continually maintains that they have every right to list or not list a site; why should site owners not have the right to opt out?

As to the everybody else is doing it conundrum, didn’t you learn in grade school that such a claim is never a justification for doing what you know is wrong?

There have been legal discussions over whether failure of a system to obey a “robots.txt” file is “trespass against chattels”. I don’t think there’s any case law on this yet, though. I’d argue that all crawlers should obey robots.txt and “nofollow”, etc. tags. Please don’t get clever in this area.

We’ve noticed some sites blocking their “contact” page via the “robots.txt” file. One publicly traded company whose main business is directory pages does this. If they didn’t, Google would detect that the contact page is identical for hundreds of thousands of sites. From our perspective, as a site rating system, we obey the robots.txt file, then down-rate them because we can’t find contact info. Some days you just can’t win.

There’s some interaction between “robots.txt” and aliasing issues.
One case is where a site has “example.com” and “www.example.com”, and only wants one of them indexed. It’s common to see “robots.txt” used to block one or the other. The problem is that if, say, “example.com” redirects to “www.example.com”, and the robots.txt file for ‘example.com” blocks access, the crawler can’t see the redirect and may not find ‘www.example.com”. It seems appropriate in such cases to read the HTTP headers only, despite the robots.txt file, to check for a redirect. The real world equivalent is going up to a building, finding a door that says “Do not enter, use other entrance”, and going to the other entrance.

Another case is when the robots.txt file is itself redirected. One very large computer company does this for their main site. When that happens, it’s not clear whether the robots.txt file applies to the site doing the redirection or to the target of the redirection.

Except for those somewhat obscure issues, I’d argue for taking the “Keep Out” message of “robots.txt” and the various meta tags quite seriously.

EGOL hit the nail on the head. a “NOINDEX” tag on a URI, as well as any URI being explicitly disallowed via robot.txt, should be the clearest, most unequivocal way for a webmaster to communicate that whatever content is on that URI is not intended for the search engine attempting to index it.

and since that URI has been explicitly disallowed, it shouldn’t accumulate any link authority, either. if an engine isn’t allowed access to a given URI, then they shouldn’t have any information about it, should they?

I don’t agree at all that all traces of a site should be removable by Google or any one else for that matter. A site owner can not dictate what others show on their website. A link in the website http://www.google.com has as much right to be there as a link from http://www.whateversite.com. Why would a website owner have the right to tell anyone not to link to his content?

Google’s only concern should be to the relevance of its results. And determining the relevance of a link doesn’t require the page it links to, to be indexed. All you need to do is visit the page to determine the relevance of the link. That’s not indexing, that´s just visiting the page and using the content as a base of information to determine the relevance of the link.

Wow, dude…you just swung for the fences, missed, twisted yourself around in the batter’s box, and whacked yourself right on the helmet with the barrel. Michele made a lot more sense than you just did; pay attention to what she said, because she actually gets it.

A webmaster, when (s)he uses the NOINDEX attribute, is explicitly stating that the page should not appear in any indices of any variety. In order for Google to include the page in a relevant position among its results, Google would first have to take a copy of that page (or INDEX it), which is exactly what the webmaster specifically told Google not to do in the first place. There are numerous cases whereby webmasters choose to use this attribute for this specific reason…they have content that they do not wish to be indexed, and they need to specify it at the page level (e.g. expired dynamic content) because specification via robots.txt would be less efficient (or in the case of free subfolder hosts, not really an option). How is a searcher going to benefit from a so-called “relevant link” within SERPs to content that the webmaster clearly didn’t want the searcher to see in the first place?

Yes, a savvy webmaster will use logins/passwords and protect his/her content that way and blah blah blah yadda yadda, but in most cases it’s either not permitted by the host or the webmaster simply doesn’t know how to do so effectively.

It’s not Google’s job to play judge and determine which pages should and shouldn’t be indexed just because the NOINDEX attribute was invoked. If they do that, then they have to do the same thing with robots.txt Disallow, which effectively accomplishes the same purpose. For those of you who think there’s a middle ground or that big G should ignore the attribute, consider what would happen if it ever ignored Disallow and the hell most of you would raise.

What this boils down to is that the sites belong to webmasters, and they should have the right to distribute the content in any manner they see fit to anyone they see fit (which again, goes back to “opt-in” vs. “opt-out”). If people don’t want traces or all of a site to show up in a search engine, that’s their right and it should be respected. Remember, Google doesn’t own the content here.

I like what EGOL said: “Going back to the “no trespassing” sign comparison… if Google was a taxi service would they drive their clients across posted boundaries – without their knowledge? No they would not.”

That is a pretty good argument against Mike’s claim “customers first”. It would be the best for the taxi customer to simply cross everybodys estates…

The noindex attribute is an easy way to tell search engines which content is important and which isn’t.

Example: I currently finished a simple 9 pages site using a popular CMS. I registered the 9 pages via sitemap. Nevertheless Google somehow found all kinds of internal pages that for example belong to the gallery that I used to store the images that I use in the articles.

It was pretty complicated to find all the wrong URLs that Google found to put them into the robots.txt – it would be easier if you could simply set parts of a CMS to not be indexed: Login pages, legal documents aso.

If the definition of NOINDEX is being rewritten could you take the time to rewrite the definition of HTML or maybe WWW. I know I’m dripping with sarcasm but Matt really you have never posted something so obtuse.

As Joost said above teach those who don’t understand. Don’t throw out the entire curriculum because a few lousy students can’t follow directions.

This is the first time I have commented here but I have been reading for a while.

I guess I am in the minority even though I am a webmaster. I think first and foremost, Google should provide what is best for the user of their search engine. Of course that means if they are going to include noindex sites, they should also not ban sites.

There should be a clear black and white way in which Google handles all sites. Do not ban sites that could be of use to your users if the site violates some Google golden rule.

All sites that have something to offer users should be indexed. I can see not including sites of no value such as duplicated crap sites who steal stuff from other sites but if a site has something to offer the user, it should be indexed. That in, my opinion, includes noindex pages if they are the only reference to an inportant site.

Basically, I am saying, their should be a middle ground but it should be a clear policy on handling all sites so we do not always have to second guess what we do so we can please Google and we can do what is right for our own users.

1) Do not index the content.
2) Do not link to the URL in the SERPs, UNLESS the URL is a “home page” (/, or redirected to by /).
3) If it is a home page with a NOINDEX tag, it’s OK to link to it in the SERPs, but do not index the content; do not provide a snippet; and do not provide a cached copy. Treat it like a “partially indexed page”.

> No, it’s not OK to link to ANY page in the SERPS that has a NoIndex tag.

What’s the big deal? It already happens. URLs appear in the SERPs that haven’t even been fetched. If they haven’t been fetched, the NOINDEX can’t have been seen.

You need to separate in your mind the indexing of content and the indexing of a URL. What does NOINDEX (and robots.txt) really mean – don’t index the content, or don’t index the URL, or both?

What you’re really asking for is not to create a link to a URL. I think there are lots of examples where a link would be undesirable. OTOH, I don’t see how you could really complain if somebody (including a search engine) created a link to the home page of your site. Not indexed the content, just created a link to the home page. Hence I think my vote is reasonable.

In light of recent changes to the Adsense Terms of Service, I’d like to add #9, another obvious reason that NOINDEX should mean DO NOT INDEX THIS PAGE!

Every Adsense publisher is being required to add a TOS to their site to remain in the program. Many, including myself, are waiting for someone like JenSense to post a “Google Approved” version that meets the new requirements as stated in the new policy.

I plan on making my new mandated TOS NOINDEX. Why? Because I don’t want to get dinged for duplicate content by the same company that is mandating that I have that content.

Why in the world would you want millions of pages in your index that have the same or nearly the same content? In this particular example, there is no benefit to anyone to include them in the results.

*** I currently finished a simple 9 pages site using a popular CMS. ***

If the URL returns “200 OK” and does not have a noindex metatag and/or is not disallowed in robots.txt, then it is fair game to be found and indexed. It is down to you, the designer, to keep bots out of the parts of the site that you do not want to be indexed. That is, a sitemap file is merely a suggestion to look at “these” URLs on your site, not a list of the “only” URLs that are open for indexing. Your error.

*** *** No, it’s not OK to link to ANY page in the SERPS that has a NoIndex tag. *** ***

*** What’s the big deal? It already happens ***

No it doesn’t. URLs with a noindex meta tag do not appear in Google SERPs; not even as a URL-only entry.

URLs excluded by robots.txt can appear as URL-only entries in the SERPs if a link to it is found elsewhere.

However, yes, some people are confused as to whether we’re talking about indexing of content, or just showing that a URL merely [i]exists[/i].

Of course Google wants to serve their users. But on the moment I place a noindex on MY site so MY content won’t be indexed than I do that for a reason (or i’m just stupid).

Of course and user want to find as many pages as possible. But it still is the choice of the webmaster if he wants to be found. It’s his content, an his disission. People always will make mistakes, but that’s their own risk.

No it doesn’t. URLs with a noindex meta tag do not appear in Google SERPs; not even as a URL-only entry.

Sure they can. I’ll quote three general situations:

1) When the URL is indexed before the NOINDEX is added to it, it will remain indexed … potentially for quite some time
2) When the page is protected by both robots.txt and NOINDEX, the NOINDEX will never be seen and the URL may still be indexed
3) When the page is new, the URL may be indexed before the content has been fetched.

> Let me guess, YOU dint have a NoIndex tag on your homepage and you would be arguing the other side if you did.

Er, no, because I recognise that if you put a site on the Web, it’s likely that somebody will link to it. That’s how search engines tend to find new sites, after all. So, what would your response be if you put a site on the Web and somebody else linked to the home page? Not a search engine, but just some regular site? Can you really complain if you put a site out there and somebody links to it, without even using a deep link?

– Keep things simple
– Do not design your sites for the minority of users. On top of it don’t mess up the semantics of NOINDEX because a few users misused it
– NOINDEX in my opnion is don’t index at all. Seems pretty straight forward. So when you haven’t indexed it how can you show any reference to it? Doesn’t make sense.

Noindex-ed websites shouldn’t be indexed at all.
In addition, Google and other major search engines should work on an extension to the current robots and meta-index standards, that is supported by all major search engines. This extension could give webmasters the possibility to give more detailled instructions (like ACAP intended to give, but ACAP is too complicated and not very good at all).

A site containing content that you’re testing or using for offline marketing purposes only.

An intranet site that you don’t want anyone even coming across the login page for.

In all three cases, robots.txt would be a far more appropriate means of blocking access.. If you use a NOINDEX tag, you’re letting the spider in, only to tell it you didn’t want it there. With robots.txt, you don’t even let the spider in.

There’s three perfectly valid reasons to complain. How many more do you want?

Dave, I’m not attempting to cloud the issue. I’m attempting to discuss the issue that Matt raised. The fact is that Googlebot would probably not discover a home page unless at least one other site had link to that home page. And Google would almost certainly not rank a Web page for a high-traffic term unless lots of other Web sites had linked to that home page. So I suggest that one extra link to that home page, from Google (not Googlebot).

In attempting to compare this to harvesting spambots, it’s you that are clouding the issue, in an emotive and misleading way. Nobody is talking about harvesting. Nobody is talking about indexing any content. I’m simply talking about one extra link to a home page, in addition to the other links to that home page which must have existed in order for Google to find and rank that home page in the first place.

You can use a special HTML META tag to tell robots not to index the content of a page,

That’s from http://www.robotstxt.org/meta.html, and it’s exactly the behaviour I’m proposing. The content of the page is not indexed. So please don’t compare this to spambots, Dave. The standard would be followed to the letter.

Alan, Google clearly state on their own pages that the use of NOINDEX is to “Block or *remove* pages using meta tags”. If a site owner uses the GOOGLE suggested method (from the horses mouth) they should expect googlebot to comply leaving only spam bots NOT to comply.

You play semantics all you like, but NOINDEX in the context of SE bots, means “Do NOT index my page in the SERPS”. That IS the standard.

Matt, Google and yoursef even admit that, but are trying to cloud the issue by saying NOINDEX on home pages, important sites etc should be treated *differently*. You cannot have it both ways.

IF you glided through a STOP sign / RED light at an intersection and a Cop saw you, do you think you can worm out of the practice by saying “But officer, I could cleary see there were no cars coming” ? PLEASE ANSWER.

IF Google do/does as you suggest, they are violating Web standards and it’s the thin edge of the wedge in that Webmasters DO NOT retain the right to do as they please with their own sites.

That’s from http://www.robotstxt.org/meta.html and it’s exactly the behaviour I’m proposing. The content of the page is not indexed. So please don’t compare this to spambots, Dave.

More sematics. IF they don’t want the CONTEXT indexed, why on Earth should they expect a link to the *content* in the SE index?

From the same page Google and yourself link to;

robots can *ignore* your tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

SE’s know FULL WELL what noindex SHOULD do (they themselves have it in black and white), but believe their SHOULD be *exceptions* based soley on fear of the SE looking bad in the searchers eyes (profits). I say, bad luck, my business looks bad when I’m on page 100 of the SERPs, but I don’t resort to unethical practices for MY gain ONLY.

Regarding the “no-tresspassing” sign (which I think is a good analogy), if I put a no tresspassing sign on my yard does that mean that the city should remove my address from it’s records? Should there be a blocked out space on all maps where my yard is? Or does it mean – stay off my property?

If I am a directory and you put a noindex on your site, can you sue me for having a link to your site? Can you sue me for content on my site?

Or does noindex mean “No Index”? Not erase me and every reference to me from existence?

If Google indexes a link to a page on your site from MY site, then that link and it’s anchor text is MY content, not yours. Therefore they are allowed to index that content.

Ian, the comparison on “no-tresspassing sign” and the NoIndex meta tag are Worlds apart and cannot compared and as such, it’s not a good anolgy. Unless apples and oranges are the same 🙂

A much better anology is having a silent phone number. I.e , NOT indexed in ANY phone book.

Google state in black & white what NoIndex should do;

Block or *remove* pages

They themselves admit that, BUT, believe their should be special situations where they treat the tag differently.

I’m all for Google doing as they please with THEIR index/site and retaining the right to remove pages as they see fit. BUT that same right MUST apply to ALL site owners.

It’s ludicrous for a Webmaster to comply with Google’s own guidelines and add NoIndex to a page, only to find that Google’s ignored it and added a link to it in the SERPs like ALL pages *without* NoIndex.

Dave, unlike your analogies, I think Ian’s analogy is a very good one. 🙂

If somebody links to your site, and Google indexes that link, then your gripe is with the person who linked to your site … not with Google. Following your logic, Dave, a directory can link to a site with a NOINDEX’ed home page, but a search engine cannot. That’s a contradiction I attempted to solve. 🙂

Both robots.txt and the robots meta tag were designed and introduced before Google was invented, at a time when far less emphasis (if any) was placed upon link analysis. robots.txt is focused on whether content may be fetched. The meta tag is focused on whether content may be indexed [or links followed]. Neither is focused on whether content may be linked to. At the time they were invented, that did not matter. Now, it does matter.

The way Google and other engines has treated robots.txt and the robots meta tag, with respect to linking to content, has therefore been a matter of interpretation. Currently, as Matt states, interpretations vary, and Matt’s concern is that Yahoo’s searchers may receive more relevant results than Google’s searchers because of this variation in interpretation.

Rather than just change the policy without consultation, Matt decided to solicit opinions here. Frankly, I think if Google changed the policy without consultation to the policy I have outlined above, nobody would have even noticed. It is really not a big deal.

It would become a bigger deal if it moved beyond the home page. For example

1) any static URL
2) any URL within one click of the home page
3) any URL

I’m not suggesting any of those should feature in SERPs if they contain a NOINDEX tag. Just a home page. But, according to the letter of the standards, Google (and its competitors) would be within their rights to index links to any URL.

Again, everyone’s looking at robots.txt vs. the meta robots tag as if it’s the real problem. It’s not. It only exposes a deeper problem, and a much simpler solution.

The deeper problem: that bot methodology is “opt-out”.
The solution: make indexing “opt-in, with opt-out options”.

If we say we want in, we want in. Otherwise, we want out. There’s no confusion. There’s no indexing of pages that shouldn’t be indexed against a site owner’s will. People would adapt to the new reality of having to submit a Sitemap as per the protocol (most of them submit their homepage, thinking it will actually do something, anyway). And the problem’s solved.

The rest of this is just a silly argument that doesn’t even need to take place. Use the improved tools that are already in place.

Crawling and indexing and grabbing in an opt-out manner is 1998. We’re in 2008. Let’s move there, shall we?

If somebody links to your site, and Google indexes that link, then your gripe is with the person who linked to your site … not with Google. Following your logic, Dave, a directory can link to a site with a NOINDEX’ed home page, but a search engine cannot. That’s a contradiction I attempted to solve.

Alan, meta tags are for robots NOT humans. IF googlebot lands on a page that has a NoIndex meta tag, they should obey it as per their OWN definition of what NoIndex does, irregardless of how they found the page. IF a directory uses bots to locate pages, they SHOULD also the use of NoIdex.

SE’s already index web pages by default, if they start to ignore NoIndex ,with at least 2 do, (based ONLY on self-serving interests) from OTHER peoples sites the tag becomes totally redundant.

You are TOTALLY missing the point that Matt & Google KNOW what NoIdex SHOULD do, but are comtiplating making EXCEPTIONS where they ignore the tag. Why? All because the other 2 major SE’s do just that and *Google is afraid* that their searchers will perhaps think Google is at fault IF a certain page in not in the SERPs. Like I said, Google used to a leader, not a follower.

Rather than just change the policy without consultation, Matt decided to solicit opinions here. Frankly, I think if Google changed the policy without consultation to the policy I have outlined above, nobody would have even noticed. It is really not a big deal.There is VERY limited “consultation” that likely represents about 0.00001% of Webmasters. In regards to “no big deal”, it’s no big deal to most on the Planet that some Cities have a VERY high murder rate, but to those that experience it….

It would become a bigger deal if it moved beyond the home page…..I’m not suggesting any of those should feature in SERPs if they contain a NOINDEX tag. Just a home page.

“Just a home page” are you kidding? It’s the start of of slippery slope to ANY page, you say “Just a home page” and SE’s end up saying “Just a Web page”

Even Matt himeslf believes NoIndex = NoIndex with NO link in the SERPs.

First of all I it’s nice to see the question accually being asked, and I see the dilemma with the other SE ignoring the webmasters. It’s a little like, if everybody runs around in the street shooting their guns – I might aswell do it myself.

I would rather see we have some moral principals left and not just do what we feel like all the time, however money makes the world go round and serving content is what SE live by.

This argument makes me think of the news when Google startet indexing books, alot of angry people there. Google News aswell. I have no problem understanding that a SE is not permitted to index my whole life!

NOINDEX === NOINDEX, unless we don’t care about words and their meaning anymore…

Adam and Dave, I think we have reduced our debate to a simple difference of opinion. 😀

I’d also be interested in (constructive) suggestions in the comments about how Google should treat the NOINDEX meta tag. Try to step into both a regular user’s shoes as well as the position of a site owner before leaving a comment.

I’m a web designer and always load sites I’m redesigning on my own website for the owner to critique before it goes online. I put no index on each page so they will not get indexed (along with disallow in robots.txt) because IF they do get indexed they it could harm the original pages for duplicate content, so I would hope this doesn’t change in future.

This is obviously not your objective, but the fact you’re raising these questions is bringing concerns for the layman or ordinary Webmaster who need to keep some content and pages away from the search engines index.
When it is not appropriate to use htaccess password protection, should we recommend to use a combo of noindex, nofollow tags on the internal links pointing to the content, and moreover robots.txt to be on the safe side ??
Removing content from Google’s index tool works fine but it’s not instantaneous !
Cheers, Oscar

OK Dave, I’ll try to re-answer your question without using the citations I used previously. 🙂

Alan, so you think the abolition of a long standing Web standard for certain pages based on self serving factors, at the expense the page/business owner, is the future?

No, I don’t think it is the future, but neither do I think that a minor change in the interpretation of the standard, whilst still operating within the standard, amounts to an abolition of the standard.

In other words, I do not think that you have fairly or correctly characterised the question. You have constructed a logical fallacy in an attempt to build a straw man argument.

Matt – do not change this one.
It is clear that a NO Index says “DO NOT INDEX”.

Yes Google has to do everything they can for the users. BUT Google would be nothing, if there where no webmasters. You wouldn’t have one single search result. And if I decide I do not want my content (including but not limited to the title tag, the url and the description keyword) in the serps of google than this has to be an option.

You use content from others to make money with that. You do not ask them if they want this. You just give them the oportunity to tell you to stop it by placing “noindex” on there sites. Even if most people want to be indexed this is not the way this shold be – just think about someone copying millions of webpages and making money with them. He only copys about 20 Words from each site. He makes really lots of money with that (Google does!) and he never asked one of the webmasters. He just gives a rule: If you put a “NOCOPY” tag into the header of your site – than I will NOT copy you anymore. Illegal enough what he does – but lets say the others accept it.

Some years later he says: “There are valuable sites around there wich use “NOCOPY” and my readers want to read this content too. Uh – thats the idea! I’ll just say: Well I’ll copy only Headlines from them and if they do not want this, I’ll put up a form which they can use to be removed.

Webmasters shouldn’t need to do ANYTHING for not giving away there content. Well this would be hard for google and so I’d say ok – lets accept the extra tag but if I use it YOU should accept that. No matter if your users what to find my site or not – if I say I do not what you to show it you shouldn’t do it.

You may as well create new tags:
“nodescription” | “duplicate” (or “duplicate=http://…”

What they do should be clear:

nodescription means you shouldn’t show a description. You may show the title.

duplicate means this page may be a duplicate of another one – this one may be explicitly set or left off. You know lots of us are using noindex-follow to avoid duplicate content but let the links be followed. Shoudn’t there be a chance in telling google that his may be a dup but may also hold some extra content => which would say only one of both sould be visible in SERPs but both may be indexed and returned to the user – while it may not matter wich one?

More bad analogies, Dave, but it seems you care deeply and passionately about this issue.

If that’s the case, I can’t understand why you aren’t equally as passionate about the fact that, already, URLs are indexed and may appear in Google’s search results when they are protected by robots.txt. Surely that’s just as big an issue as URLs being indexed and appearing in search results when protected by the robots meta tag? Yet you haven’t mentioned it at all.

Likewise, I believe that if, when Google had implemented that interpretation of robots.txt, they had also implemented a similar interpretation of the robots meta tag, then by now it would be a non-issue. It really is not a big deal.

Maybe if you had read all my comments you would see that I HAVE mentioned MANY time that more wrongs do NOT make a right.

If that’s the case, I can’t understand why you aren’t equally as passionate about the fact that, already, URLs are indexed and may appear in Google’s search results when they are protected by robots.txt. Surely that’s just as big an issue as URLs being indexed and appearing in search results when protected by the robots meta tag? Yet you haven’t mentioned it at all.

Your point? As you have now resorted to changing the subject and stating that 2 wrongs make a right, I take it you have realized how silly your “minor” change is 🙂

So when a page uses SE spam to rise above the rest, all the others targeting that SERP should start spamming as well.

It really is not a big deal.

Bet it would be if you read Googles black & white statement of its “interpretation” of noidex and used it for what they state it IS for.

Current solution: “You can use a special HTML META tag to tell robots not to index the content of a page”

Dave, that does not change under my proposed solution. Very little does:

Current solution: “You can use a special HTML META tag to tell robots not to index the content of a page”
My proposed solution: “You can use a special HTML META tag to tell robots not to index the content of a page”

Current solution: “The URL of a page protected by robots.txt and the robots meta tag may be indexed”
My proposed solution: “The URL of a page protected by robots.txt and the robots meta tag may be indexed”

Current solution: “The URL of a non-home page not protected by robots.txt but protected by the robots meta tag will be removed from the index once the content is read”
My proposed solution: “The URL of a non-home page not protected by robots.txt but protected by the robots meta tag will be removed from the index once the content is read”

Current solution: “The URL of a home page not protected by robots.txt but protected by the robots meta tag will be removed from the index once the content is read”
My proposed solution: “The URL of a home page not protected by robots.txt but protected by the robots meta tag may remain in the index after the content is read”

I won’t presume to talk for Matt or Google about how much our respective opinions matter. But presumably they both matter to some extent, or Matt would not have bothered asking for them in the first place.

Dave, this is neither an ochlocracy nor a democracy. It’s a business decision on Google’s part. The voters in Matt’s poll are not the only people that Matt is concerned about and if it had been a simple case of looking at the poll results then there may as well not have been a debate.

Matt said Google’s highest priority is to its searchers. It’s what searchers want that really matters, as long as what searchers want allows Google to continue operating within robot guidelines.

Describing my family as being totally ignorant of anything is pretty low, Dave, especially when you don’t know them, and demonstrates the futility of further debate with you. So I will take your advice and go and spend Easter with them. 🙂

I havent finished reading thru all the comments but for myself I will say what I am seeking is a way to keep all the flotsam and jetsam (archives of several species and other secondary and likely duplicated content directories and files neccesary for my site to run itself) out of sight, but I want (and expect Google would want same) to take advantage of any opportunities to discover new links as might be available on the site as well as spider and index any unique content.

So here’s my 2 cents / suggestion:

– you dont want your site / content seen /examined at all = disallow
– you dont want the content appearing in the results = noindex BUT either by default or by including i.e follow (?) the spider will follow any links
– noindex / nofollow could be tantamount to disallow

…anyway this the way my mind works about this –

So my next question is how to put this work? the syntaxes are necessarily strict and somewhat arcane to most hobby coders (like moi) and certainly any newbies – and mistakes can be costly… what needs to go in the robots.txt do what I want to do – without a computer science degree ?

noindex – will this *just* keep the specified content from appearing in the index?

I tried adding “noindex,follow” and “noindex, follow” to robots.txt they both threw syntax errors

Hmm this is interesting – noindex: follow: DIDNT throw an error but whats the Googlebot going to do with that?

All I really want to completely disallow is administrative / code / backend / internal stuff that nobody with good intentions is likely to be interested in – read: potentially security sensitive areas….

Hope any of this makes sense is there anybody else out there with similar objectives?

>>> I’d also be interested in (constructive)
>>> suggestions in the comments about how
>>> Google should treat the NOINDEX meta tag.

matt, this is a funny question …
read my lips: NOINDEX ^^

i don’t really see why this should be discussed in the first place. we had a similar discussion at the german google-webmaster-help-group about robots.txt, the google-webmaster trying to tell us something about how hard google’s task was in “interpreting” what publishers meant by excluding files from crawling and i didn’t understand the point there either … so i wonder what you’re at? take over total control letting publishers no choices at all anymore?

With all of the discussions about NOINDEX, I’ve been unable to find an answer to a question I have. I’m currently programming a community-based web site that has several thousand pages (basically categories), but at the beginning, will not have any meaningful content for a good chunk of those pages.

If I put a NOINDEX robots tag on the pages without content, will the search engines come back and try to index the page again at a later date? If they will not come back, what is the best way stop bots from spidering the pages until there is some meaningful content?

SAM:
As allready mentioned Google will not stop indexing your files when you use robot.txt. As this poll tells us (Google never *really* cared for webmasters whishes, they will not start to do so now!) this will be the same with the meta. You can simply leave your pages as they are. But it might be an genious idea to hide categories which are unfilled (at least for *not logged in* users) – since your visitors will dislike empty pages anyway.

The Atomz web site search engine uses [noindex] and [/noindex] and the corresponding [nofollow] tags embedded in HTML to control its spider, so that only parts of a web page will be spidered. It also respects the robots.txt and meta tags. It would be great if Google could also support an inline [noindex] tag of some sort. My $0.02. ~Ray

I’m surprised this debate is going on but hats off to you for giving us a say.
noindex should do what it says. Don’t index. I feel MSN and Yahoo should follow suit. Google already provides the best user experience and I don’t think this issue affects that. It would be great to see a across the board approach to this on all search engines.
Most people reading your blog will have an interest in what happening with Google and other search engines but there are many web developers who don’t keep up with what’s happening. How could you inform all those people of the change? What about wee Johnny who has a hidden page dedicated to his childhood sweetheart? What if Google changed the rules on the noindex tag? Imagine the embarrassment at school when he’s back from his summer break! So for wee Johnny please keep noindex.
I feel it’s common sense – sure some people will make mistakes with it but if a listing on Google is important to them they’ll work it out.

If I’ve specified NoIndex for any reason then don’t index. Yes, I might have shot myself in the foot but isn’t that every business (read website) owners choice? We can’t tell you how to run your search engine, but we can ask you not to tell us how to run our websites.

Let me put it this way Matt, if I come into your home and take photos of you sleeping, can I post those anywhere? No? But you might be an exhibitionist wanting me to do so, right? I think you know where I’m going with this.

If the door is closed, if the sign says do not enter, then it should be respected by everyone, everybot, everything. No matter how good or bad the reason.

Ok guys, so I just mistakenly moved a site live that had that noindex tag from the test site. It was there for about a week. So how long before my rankings return? Is there anything I can do to encourage the spiders to come back and crawl in the meantime?

In my country (Russia) Noindex tag means, that the biggest search enginge Yandex cant see information. So for example if i put noindex tag in comments, that means that Yandex can’t index this url links

“If somebody links to your site, and Google indexes that link, then your gripe is with the person who linked to your site … not with Google.”

no no no – when google follows a link it must check the target domain’s robots.txt and the target page’s meta tag. Google must acknowledge the tag as intended and regardless of how it found the page it should not be index’d.

Thnx for nice post , i have little bit confusion , if i use NOINDEX,Follow then how it can be handle in google; as described it completely disallow crwaling then how it can pass PR juice to other pages who linked with it, if cralwer not read the whole page..

If someone has a no trespassing sign, you can still know that something is there, just not WHAT. You could use Google Maps to peer at the roof of it and guess what it might be, you could ask their neighbour, or you can perhaps get information about the lot or the building from city hall.

You have to read part of the page in order to parse the NOINDEX directive to begin with, so why not make it so any info that can be obtained up until the NOINDEX tag on the domain’s index page is fair game?

This way, you’re not “indexing” it, you’re just acknowledging that it exists, and not collecting any further data on it, so keyword searches can’t find it, etc.

Imagine a world with 50 really important search engines. I don’t want my content included. I should be able to accomplish that with a simple noindex instruction. I should not be expected to go to each search engine that exists and each one that springs up to keep my content out.

Going back to the “no trespassing” sign comparison… if Google was a taxi service would they drive their clients across posted boundaries – without their knowledge? No they would not.

However, that “no trespassing” sign is visible to the public. And, in the case of real estate anybody who walks past it can see it and there is no law against commenting to a friend…. “That cranky old EGOL has a no trespassin’ sign on his land”.

So, if a search engine user types a domain or a page URL into the query box, it would be OK for google to have a special page that says… “The cranky old fart that owns this page says Keep Out”. However, that page should not contain a clickthrough link. That is like opening the gate to the posted land.

In the case of real estate, any person who isn’t blind can see across the property boundary and notice objects on my land. They can talk about them freely – no laws against it. This is the point where I believe that the technology of a search engine departs from the real estate example. When the spider arrives and requests the “noindexed” page, that noindex instruction means: “Don’t look at this” So that page should not appear in any SERP – even if that SERP is relevant. This is the same as land with a privacy fence. Google should not know what is in there.