chalkywhite

8:36 pm on Jun 28, 2013 (gmt 0)

@Lucy

A wordpress blog of mine has that post format, but on the end of every url it has added a /trackback/. So dupe content ahoy!.

Wordpress is a nightmare for these auto generated pages, you really need your head screwed on to not get caught out. Additionally if you turn off trackbacks it just turns them into 302`s meaning temporary. Trackback.php needs editing to change it to 301.

lucy24

8:51 pm on Jun 28, 2013 (gmt 0)

I wonder if we are talking cross-wired.

No, I was responding to a post further up the line. If a page hits the googlebot with a 401, that's good. I was commenting on the depressing number of pages that apparently don't give the googlebot a 401, and therefore get indexed even if humans can't readily see them.

indyank

3:31 am on Jun 29, 2013 (gmt 0)

i understand why 401 URLs aren't shown in the SERPS. But even roboted out URLs should not figure in their index in any form.

The reason why most webmasters would block bots that claim to obey robots.txt is to ensure they don't send any visitors to their site from their platforms. When a webmaster is instructing googlebot through robot.txt, not to crawl the content of a page(though there is no foolproof way to ensure that they aren't crawling and having the content in their DB), googlebot is supposed to ensure their visitors never find that page in any form and by any mean, via their SERPS. How are they helping their genuine visitors by showing links with a boilerplate description? The fact that links to such pages exist on other sites does in no way authorize them to bypass/discard robots.txt for such pages and show them on their SERPS with a boilerplate description.

Convergence

10:38 pm on Jun 29, 2013 (gmt 0)

If the page is roboted-out, the search engine cannot see the "noindex" header.

Assuming that the bot reads and obeys the robots.txt before it visits each page. Including when following inbound links.

Which the bots do not do...

lucy24

12:09 am on Jun 30, 2013 (gmt 0)

Well, this part at least can be answered by a quick look at logs. Has anyone seen the googlebot-- the real thing, not a spoofer-- snuffling around where it doesn't belong?

The googlebot doesn't follow links in the strict sense*. It gets information that says "this page exists" and puts it on a shopping list for later.

* There may be exceptions involving non-page files, but that's an unrelated issue.

phranque

12:17 am on Jun 30, 2013 (gmt 0)

Assuming that the bot reads and obeys the robots.txt before it visits each page. Including when following inbound links. Which the bots do not do...

to which bots are you referring?

it's either a "well-behaved bot" or it's not respecting the robots exclusion. robots.txt is irrelevant to "inbound links" - it is purely about the impending request.

Convergence

12:35 am on Jun 30, 2013 (gmt 0)

Has anyone seen the googlebot-- the real thing, not a spoofer-- snuffling around where it doesn't belong?

Yes.

to which bots are you referring?

The Googlebot

We have specific pages that we do not want indexed, IE: Merchant profile pages. The Googlebot shows up in logs all the time as having accessed those pages. These pages are properly blocked in robots.txt and have noindex in the headers. These pages show up in the SERPs from time to time with the "description blocked by robots.txt" statement.

The only way the Googlebot is accessing these URLs is from internal pages as we have yet to add rel=nofollow to them...

phranque

12:44 am on Jun 30, 2013 (gmt 0)

We have specific pages that we do not want indexed,

in this case you should allow the resource to be crawled and supply a meta robots noindex element or X-Robots-Tag noindex header.

The Googlebot shows up in logs all the time as having accessed those pages. These pages are properly blocked in robots.txt

These pages show up in the SERPs from time to time with the "description blocked by robots.txt" statement.

this is precisely the expected behavior when googlebot is excluded from crawling, so it means if the real googlebot did actually crawl that url it is ignoring this fact in the index.

[edited by: phranque at 12:56 am (utc) on Jun 30, 2013]

aakk9999

12:51 am on Jun 30, 2013 (gmt 0)

The only way the Googlebot is accessing these URLs is from internal pages as we have yet to add rel=nofollow to them...

I think you might have meant: The only way Googlebot is finding about these URLs is from internal pages as we have yet to add rel=nofollow to them...

But I am a bit confused, perhaps you could clarify: 1) Are you saying that you can see that Googlebot is requesting URL which is blocked in robots.txtOR 2) Googlebot is requesting URL that only exists as internal link on the page which you have blocked by robots.txt (and to which you plan to add rel="nofollow") but the target linking page is not blocked via robots.txt?

Because, if it is 1), then Googlebot is not behaving

But if it is 2), then there could be other ways how Googlebot could have found about this URL

Have you tried to use "Fetch as Googlebot" in WMT for the offending URL (the one that is only linked internally, which you think should not be requested by Googlebot, but seeing the requests it in your logs) - does Googlebot fetch it?

Have you tried to put this URL in WMT section "Health --> Blocked URLs --> URLs Specify the URLs and user-agents to test against" and see what Google thinks, whether the crawling is allowed or not? What results do you get?

lucy24

1:13 am on Jun 30, 2013 (gmt 0)

We have specific pages that we do not want indexed <snip> These pages are properly blocked in robots.txt and have noindex in the headers.

OK, let's try this again from the beginning...

Convergence

8:53 am on Jun 30, 2013 (gmt 0)

You folks have me confused - lol.

We have a directory which is listed in robots.txt that is disallowed called /merchant/.

Disallow: /merchant/

All the pages in that directory have noindex in the meta tags.

<meta name='robots' content='noindex, nofollow' />

The only point of access to this directory, and the pages contained within, is through a link on the merchant's product page (located on our site). That link on the product page currently does NOT have rel="nofollow" in the link URL.

Yes, I admit that it is possible someone, somewhere, could have bookmarked a specific merchant and the Googlebot is following the link from there. However, with hundreds of merchants, we're pretty confident that people aren't doing this en masse. We will see heavy crawling by the Googlebot, then in a day or two those pages will be in the SERPs, with aforementioned "description is blocked by robots.txt". Then there will be some sort of data update/refresh and the pages are gone.

As you can see, the Googlebot does not visit our robots.txt very often. 147 times while crawling 531K pages.

Googlebot 531,659+147

Testing in WMT yields the following:

Blocked by line 12: Disallow: /merchant/ Detected as a directory; specific files may have different restrictions

Am I missing something here?

phranque

10:43 am on Jun 30, 2013 (gmt 0)

then in a day or two those pages will be in the SERPs, with aforementioned "description is blocked by robots.txt". Then there will be some sort of data update/refresh and the pages are gone.

you won't see the pages - only the urls. and when you say "the pages are gone" are you sure they aren't filtered out? try adding &filter=0 to the google search url and see if those urls reappear.

i don't think you want googlebot requesting robots.txt first for every resource requested. iirc googlebot caches robots.txt for up to 24 hours. what was the elapsed time for those 147 requests of robots.txt?

Am I missing something here?

the part i see missing is where you have verified that googlebot has actually requested a url in the /merchant/ directory and if so that you checked the IP of the visitor to verify that it is in fact googlebot and not a spoofed user agent.

it has been mentioned numerous times in this thread that the noindex directive is irrelevant when you have excluded googlebot from crawling that url.

lucy24

10:46 am on Jun 30, 2013 (gmt 0)

147 times while crawling 531K pages.

You can't set a per-page quota. Well-behaved small robots pick up robots.txt at the start of each separate visit. Large robots-- and you can hardly get bigger than the googlebot-- read robots.txt, spread it around to their fellow googlebots, and hold it for up to 24 hours.

As I understand it, "nofollow" doesn't mean "pretend you haven't seen this link". It just means "I make no claims about the quality of the material I'm linking to".

Convergence

8:43 pm on Jun 30, 2013 (gmt 0)

you won't see the pages - only the urls.

:) Semantics, I meant URLs/links, search results. (it was 3AM when I posted)

and when you say "the pages are gone" are you sure they aren't filtered out? try adding &filter=0 to the google search url and see if those urls reappear.

Next time we look for blocked directories in the Google, we'll give the &filter=0 a go.

i don't think you want googlebot requesting robots.txt first for every resource requested. iirc googlebot caches robots.txt for up to 24 hours.

No, we wouldn't. Point I was trying to make was the Googlebot doesn't check robots.txt very often. It will go from internal link to internal link and so on...

what was the elapsed time for those 147 requests of robots.txt?

From 1st of June until I posted.

the part i see missing is where you have verified that googlebot has actually requested a url in the /merchant/ directory and if so that you checked the IP of the visitor to verify that it is in fact googlebot and not a spoofed user agent.

Yes. Have verified. Watched it live. Saw it with my one good eye. Checked the referrer, and IP. It's the Googlebot.

it has been mentioned numerous times in this thread that the noindex directive is irrelevant when you have excluded googlebot from crawling that url.

Point being?

You can't set a per-page quota. Well-behaved small robots pick up robots.txt at the start of each separate visit. Large robots-- and you can hardly get bigger than the googlebot-- read robots.txt, spread it around to their fellow googlebots, and hold it for up to 24 hours.

:) Yes, fully aware - see above response to phranque.

As I understand it, "nofollow" doesn't mean "pretend you haven't seen this link". It just means "I make no claims about the quality of the material I'm linking to".

That is one "definition". Also, "it's a paid link" or "don't pass on page juice", or "don't follow" - depending on what bot we're talking about.

From Matt Cutts:

How does Google handle nofollowed links?

In general, we don't follow them. This means that Google does not transfer PageRank or anchor text across these links. Essentially, using nofollow causes us to drop the target links from our overall graph of the web. However, the target pages may still appear in our index if other sites link to them without using nofollow, or if the URLs are submitted to Google in a Sitemap. Also, it's important to note that other search engines may handle nofollow in slightly different ways.

https://support.google.com/webmasters/answer/96569?hl=en

Convergence

8:55 pm on Jun 30, 2013 (gmt 0)

Googlebot 531,659+147

I can see how this is misleading - this is cumulative. Our site is broken up into four main sub-domains (each with their own hosting account) with hundreds of sub-categories and hundreds of product sitemaps broken down by merchant...

aakk9999

9:44 pm on Jun 30, 2013 (gmt 0)

So if I understood well:

- You have confirmed via WMT that URLs with pattern /merchant/ are blocked via robots.txt - However, you have positively identified in your logs (via IP address and user agent) that Googlebot has requested an URL with the pattern /merchant/, i.e. in your logs there was a line something like : GET /merchant/ with 200 OK, IP address from Googlebot and user agent Googlebot

Are you absolutely sure that this URL was requested by Googlebot and not some other bot from Google (e.g. AdsBot-Google treats robots.txt differently, see Note 2 below)

If so, how odd...

With regards to the results you are seeing in SERPs for URLs with /merchant/, which appear as URL with "A description for this result is not available because of this site's robots.txt – learn more") - to me this would indicate that Googlebot knew the page was blocked via robots.txt and that it has not crawled it.

There is an important distinction between crawling and indexing. Robots.txt controls crawling, but not indexing [developers.google.com ]:

Note: Pages may be indexed despite never having been crawled: the two processes are independent of each other. If enough information is available about a page, and the page is deemed relevant to users, search engine algorithms may decide to include it in the search results despite never having had access to the content directly. That said, there are simple mechanisms such as robots meta tags to make sure that pages are not indexed.

(*)Note 2: AdsBot-Google ignores robots.txt User-agent: * section and to block it there has to be a dedicated user-agent section declared in robots.txt

Convergence

9:56 pm on Jun 30, 2013 (gmt 0)

- You have confirmed via WMT that URLs with pattern /merchant/ are blocked via robots.txt - However, you have positively identified in your logs (via IP address and user agent) that Googlebot has requested an URL with the pattern /merchant/, i.e. in your logs there was a line something like : GET /merchant/ with 200 OK, IP address from Googlebot and user agent Googlebot

Yes.

Are you absolutely sure that this URL was requested by Googlebot and not some other bot from Google (e.g. AdsBot-Google treats robots.txt differently, see note below)

Yes. We also REFUSE to have Adsense on our web properties. That would compete with OUR PPC ad network :)

If so, how odd...

Yes. That's why I posted what I have.

There is an important distinction between crawling and indexing. Robots.txt controls crawling, but not indexing

However, saw plain as day, the 200 header response in the logs.

It what it is, lol...

phranque

11:27 pm on Jun 30, 2013 (gmt 0)

have you verified that all preceding requests for robots.txt also got a 200 OK response?

and there's no chance that robots.txt was modified such that the /merchant/ directory was not excluded at some point?

lucy24

12:25 am on Jul 1, 2013 (gmt 0)

However, saw plain as day, the 200 header response in the logs.

The response is less significant than the fact that there was a request to respond to.

I've had to 403 the major search engines because they persist in trying for the piwik directory even though it's roboted-out. Meanwhile, other roboted-out directories with ordinary <a href> html links from accessible pages remain perfectly safe. Go figure.

Convergence

3:30 am on Jul 1, 2013 (gmt 0)

have you verified that all preceding requests for robots.txt also got a 200 OK response?

Yes. Also have seen it in other "forbidden" directories.

and there's no chance that robots.txt was modified such that the /merchant/ directory was not excluded at some point?

No.

Question, will look to grab a screenie next time we come across it. Providing I do not reveal personal site information, is it allowable to link to it or upload the image?

Next question:

Why does this appear to be so unbelievable that the Google talks out of both sides of It's "face"?

Convergence

3:43 am on Jul 1, 2013 (gmt 0)

Just checked:

Have an entire page of the Googlebot going to another directory that is "blocked". Took a screenie of the first nine out of 168 URLs.

We block product reviews from the bots as they are for our visitors, not the bots...

lucy24

7:24 am on Jul 1, 2013 (gmt 0)

I think this is where you will have to move from the polite "Please Do Not Enter" sign to the locked and barred door :(

Or, of course, let them continue crawling but slap Noindex tags on everything in sight. It's an individual choice.

aakk9999

11:58 am on Jul 1, 2013 (gmt 0)

There is one thing I noticed some time ago: If a robots.txt includes a pattern that is at the very end of a long-ish URL, then sometime "robots.txt" exclusion does not work.

I noticed this when I blocked some URLs based on parameters that were at the end of long-ish URL. When I changed the pattern blocking, then robots.txt exclusion worked.

@Convergence, on the another note - even though Gooblebot requested these URLs, from what I understood from your post, it still did not show the URL title/description from the fetched page, i.e. the SERPs showed URL and Google created meta "A description for this result is not available..etc"

So I wonder why did it fetch if it did not then peak into it.

If you search for a unique sentence from one of these pages, do they show in SERPs?

nettulf

1:06 pm on Jul 1, 2013 (gmt 0)

It could be natural to make sure the URL exists when they choose to show it in the SERPs, even without description.

Just speculation of course... but how else would they remove robot-blocked and indexed URLs that is deleted or never existed at all? They would just sit there in the results forever.

lucy24

4:19 pm on Jul 1, 2013 (gmt 0)

Were they full GET fetches or just HEAD? When {well-known hotlinking site} checks for pages, it just does a HEAD to ensure the page is still there. I think w3's link checker does the same.

Convergence

5:58 pm on Jul 1, 2013 (gmt 0)

If you search for a unique sentence from one of these pages, do they show in SERPs?

We have seen visitors come in directly to the "reviews" page and the referring phrase was "widget reviews".

Were they full GET fetches or just HEAD? When {well-known hotlinking site} checks for pages, it just does a HEAD to ensure the page is still there. I think w3's link checker does the same.

Full GET with page size show in the logs.

If you search for a unique sentence from one of these pages, do they show in SERPs?

Will try that the next time we go hunting for them. However:

I can remember searching on the Google for some sort of "how to" related to website coding - came across a W3Schools result in the SERPs that said "A description for this result is not available because of this site's robots.txt – learn more". Upon clicking on the link, it was exactly what I was looking for.

It could be natural to make sure the URL exists when they choose to show it in the SERPs, even without description.

Just speculation of course... but how else would they remove robot-blocked and indexed URLs that is deleted or never existed at all? They would just sit there in the results forever.

Agree. Personally believe EVERYTHING gets crawled...

phranque

6:41 pm on Jul 1, 2013 (gmt 0)

Why does this appear to be so unbelievable

i've actually been paying attention to this issue for several years. i've read a fair number of threads in this forum and a few more in other forums and in every case i've seen where it was claimed that googlebot ignored robots.txt it turned out to be a misunderstanding or a technical issue.

/[REMOVED_BY_ME]/review/[REMOVED_BY_ME].html

what does the ruleset look like that excludes googlebot from this directory? do you have a googlebot-specific section with all exclusions intended for googlebot? i assume you realize that disallowed urls are matched left-to-right, starting with the root directory slash.

I can remember searching on the Google for some sort of "how to" related to website coding - came across a W3Schools result in the SERPs that said "A description for this result is not available because of this site's robots.txt – learn more". Upon clicking on the link, it was exactly what I was looking for.

that could easily be due to anchor text rather than on-page factors.

Personally believe EVERYTHING gets crawled...

i've also plowed through quite a number of log files and never seen a misbehavior by googlebot re:robots.txt

Convergence

4:47 am on Jul 2, 2013 (gmt 0)

Well, phranque - It's doing it right now. 175 pages in the /review/ directory. What is interesting is it is one right after another. There is NO sitemap for reviews; so the Googlebot has a list of URLs found on our product pages and is following-up, trying to scrape (yes, I said scrape) reviews for nothing else but It's use.