WordPress Trac: Ticket #16893: Stop or reduce crawling of comment reply ?replytocom URLshttps://core.trac.wordpress.org/ticket/16893
<p>
(For full background, you can check out the off-topic comments at the ends of <a class="closed ticket" href="https://core.trac.wordpress.org/ticket/10550" title="defect (bug): nofollow attribute added to comment_reply_link function (closed: fixed)">#10550</a> and <a class="new ticket" href="https://core.trac.wordpress.org/ticket/16881" title="defect (bug): Remove all unwanted 'nofollow' attributes from 'reply to comment' links (new)">#16881</a>.)
</p>
<p>
<a class="changeset" href="https://core.trac.wordpress.org/changeset/16230" title="Remove nofollow on comment reply links. Fixes #10550">r16230</a> (quite appropriately) removed the <tt>rel="nofollow"</tt> attribute from the &lt;a href="?replytocom"&gt;Reply to this comment&lt;/a&gt; links in the comments display. Since then, users have reported search engines are now crawling these pages (as one would expect). This means unnecessary server overhead (these pages are almost always dynamically generated even when using a caching plugin) and may reduce the frequency search engines crawl "real" pages since there are so many of these dummy URLs to index.
</p>
<p>
Additionally, there may be SEO-related reasons why this is bad. Although that may be largely mitigated by <tt>rel="canonical"</tt> and the fact that contents of these pages are 99.9% a duplicate of their canonical versions, search engines are also known to penalize sites for having many pages with duplicate content. And, the specification for the canonical attribute states that its use in page-ranking calculations is at the discretion of indexers (i.e., using canonical is no deterministic guarantee of anything).
</p>
<p>
I'm attaching 2 patches to be considered/discussed separately:
</p>
<h2 id="a1:robotsmetatag">1: robots meta tag</h2>
<p>
The first applies the <a class="ext-link" href="http://www.robotstxt.org/"><span class="icon">​</span>robots exclusion standard</a> to these URLs.
</p>
<p>
For individual site admins, putting <tt>Disallow: *?replytocom</tt> in your robots.txt is the obvious fix. <a class="closed ticket" href="https://core.trac.wordpress.org/ticket/11918" title="enhancement: do_robots() Enhancement (closed: fixed)">#11918</a> would have added this rule to what do_robots() returns, but this was dropped (but that was way before <a class="changeset" href="https://core.trac.wordpress.org/changeset/16230" title="Remove nofollow on comment reply links. Fixes #10550">r16230</a>). I would support putting it back in.
</p>
<p>
My patch general-template.php.17522.diff would add the <tt>&lt;meta name='robots' content='noindex,nofollow' /&gt;</tt> tag (already used when the WordPress privacy setting is enabled) to the ?replytocom URLs. The pages would still be hit by crawlers but would no longer be indexed -- so, 100% addressing the duplicate indexing issue and rendering moot debates over the effectiveness of "canonical."
</p>
<p>
Compared to robots.txt, this is an imperfect solution to the crawling/server overhead problem because these pages would still have to be at least partially retrieved to read the meta tag. I expect that it will still have a beneficial effect in this area, because well-behaved crawlers will reduce the frequency that they retrieve these noindex URLs.
</p>
<p>
I've got this code deployed on some live sites to test this idea out (but they previously had a robots.txt rule blocking ?replytocom, so I don't expect instantly useful info).
</p>
<p>
Even if it's not 100% effective, I think this is an improvement and trivial enhancement that could go into core soon. (this is the dev-feedback item.)
</p>
<h2 id="a2:changelinkstoforms">2: change links to forms</h2>
<p>
Picking up on an idea filosofo had to replace the &lt;a&gt; links with forms, I've done that with comment-template.php.17522.diff.
</p>
<p>
Functionally, this is a drop-in replacement, since these are GET forms that produce the same HTTP request as the current &lt;a&gt; tags (please, try them out!).
</p>
<p>
Although Google has been <a class="ext-link" href="http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html"><span class="icon">​</span>crawling forms</a> for some time, it's only been on an experimental basis and isn't widespread. It also does not affect crawler's page selection or the search engine ranking in any way. So, changing these links to GET forms should sharply reduce (but possibly not eliminate) crawling of URLs. I don't have any data for this so it merits further discussion.
</p>
<p>
One implementation issue would be that since the &lt;a&gt; tags are now &lt;button&gt; elements, it's going to affect themes. That alone could be a deal-killer! I chose the button element over input because it's easier to style in a cross-browser way (i.e. to look just like a link), but defer to the UI folks on that.
</p>
<p>
So ... this 2nd idea isn't necessarily fully cooked and would need broader support.
</p>
<p>
And, someone may have an even better way to address the overall issue.
</p>
en-usWordPress Trachttps://core.trac.wordpress.org/chrome/site/your_project_logo.pnghttps://core.trac.wordpress.org/ticket/16893
Trac 1.0.1joelhardiSat, 19 Mar 2011 06:37:31 GMTattachment sethttps://core.trac.wordpress.org/ticket/16893
https://core.trac.wordpress.org/ticket/16893
<ul>
<li><strong>attachment</strong>
set to <em>general-template.php.17522.diff</em>
</li>
</ul>
<p>
adds robots meta tag to ?replytocom pages
</p>
TicketjoelhardiSat, 19 Mar 2011 06:38:02 GMTattachment sethttps://core.trac.wordpress.org/ticket/16893
https://core.trac.wordpress.org/ticket/16893
<ul>
<li><strong>attachment</strong>
set to <em>comment-template.php.17522.diff</em>
</li>
</ul>
<p>
changes "reply" links in comments list to form buttons
</p>
TickethakreSat, 19 Mar 2011 10:17:21 GMThttps://core.trac.wordpress.org/ticket/16893#comment:1
https://core.trac.wordpress.org/ticket/16893#comment:1
<p>
Thanks for the initiative.
</p>
<p>
Why nofollow?
</p>
TicketjoelhardiSat, 19 Mar 2011 19:18:43 GMThttps://core.trac.wordpress.org/ticket/16893#comment:2
https://core.trac.wordpress.org/ticket/16893#comment:2
<p>
Replying to <a class="closed" href="https://core.trac.wordpress.org/ticket/16893#comment:1" title="Comment 1 for Ticket #16893">hakre</a>:
</p>
<blockquote class="citation">
<p>
Why nofollow?
</p>
</blockquote>
<p>
I'm assuming you mean in the robots meta tag?
</p>
<p>
Well, a distinction should be made here between this and rel="nofollow" -- the robots version predates the rel attribute and has a different meaning. (The potential for confusion has been noted in the <a class="ext-link" href="http://microformats.org/wiki/rel-nofollow#open_issues"><span class="icon">​</span>rel=nofollow spec</a> since forever.) The <a class="ext-link" href="http://www.robotstxt.org/meta.html"><span class="icon">​</span>robots meta version</a> actually means "do not scan this page for links to follow" whereas the rel attribute means "leave this link out of your link-value calculations" (i.e. PageRank). You probably knew that, I just thought I'd explain for whoever comes along since these tickets have veered way OT.
</p>
<p>
So, in the robots version, we could do just "noindex" but leave out "nofollow" and the page would not be indexed, but the search engine crawler would still scan it for links to other pages to index.
</p>
<p>
Anyway, I thought about it whether to include it, because right now there are no "bad" links on the ?replytocom pages. All the links are identical to those on the regular post/comment page, with the exception of cancel-comment-reply-link, and it only hrefs "${postURL}#respond".
</p>
<p>
But, for the same reason there's also no reason to <strong>not</strong> include it -- the links that would be followed have already been spidered one hop before getting to the ?replytocom page.
</p>
<p>
I decided to include it because the goal is to eliminate or reduce crawling of these pages -- so the hope is that if they're "noindex,nofollow," smart search engines like Googlebot will adjust their crawl frequency of these URLs down, since they're of zero value. Possibly they could also cancel page downloading midstream when the HTTP response is larger than the TCP window size, reducing transfer (although this is so minor I don't think it's worth a big investigation).
</p>
<p>
Whereas, if the pages aren't nofollow, bots can/should still scan them for links, so they're going to be crawled perhaps just as frequently as before (only, not be indexed).
</p>
<p>
Hypothetically, if in the future WordPress were to add some new, unique link to these ?replytocom pages (unlikely, I know), we could have to revisit this issue. However, I think such a link is more likely to be another functional, app-controller style URL (like "cancel comment reply") that we don't want crawled than something with unique high-value content that we do want followed. So, that also argues in favor of putting the "nofollow".
</p>
TicketjoelhardiWed, 06 Apr 2011 07:09:25 GMThttps://core.trac.wordpress.org/ticket/16893#comment:3
https://core.trac.wordpress.org/ticket/16893#comment:3
<p>
Reporting back on my running of <a class="attachment" href="https://core.trac.wordpress.org/attachment/ticket/16893/general-template.php.17522.diff" title="Attachment 'general-template.php.17522.diff' in Ticket #16893">attachment:general-template.php.17522.diff</a><a class="trac-rawlink" href="https://core.trac.wordpress.org/raw-attachment/ticket/16893/general-template.php.17522.diff" title="Download">​</a> (which adds the robots noindex,nofollow meta tag to ?replytocom URLs) on 2 live sites for the past couple of weeks since this ticket was added.
</p>
<p>
It's worked as well as (or better than) I expected and I'd recommend adding this functionality to a future release.
</p>
<p>
?replytocom pages have not been indexed by Google and there's been no increase in googlebot crawling of these sites (previously I'd had robots.txt block access to these URLs). So, even the hypothesis about googlebot intelligently not trying to recrawl these URLs once it encounters the meta tag has borne out.
</p>
<p>
Also, in Google Webmaster Tools there's a "crawl errors" section which normally lists URLs blocked by robots.txt. These URLs aren't included (in fact they don't show up anywhere in Webmaster Tools) since they're blocked by the meta tag. So, the end-user goal of users not having these URLs litter their screen when they log into Webmaster Tools is also achieved. I think this is a good improvement to quiet the complaining on the other thread about Google now crawling these pages since the rel="nofollow" attrib was dropped from &lt;a&gt; tags, and don't see any potential downsides.
</p>
TicketnacinWed, 06 Apr 2011 19:23:08 GMThttps://core.trac.wordpress.org/ticket/16893#comment:4
https://core.trac.wordpress.org/ticket/16893#comment:4
<p>
This patch seems a little low in the stack. Maybe this somewhere:
</p>
<pre class="wiki">if ( isset( $_GET['replytocom'] )
add_filter( 'pre_option_blog_public', '__return_zero' );
</pre>
TicketjoelhardiWed, 06 Apr 2011 21:14:40 GMThttps://core.trac.wordpress.org/ticket/16893#comment:5
https://core.trac.wordpress.org/ticket/16893#comment:5
<p>
Thanks, I agree about it being too low in the stack, I just couldn't think of a better way.
</p>
<p>
I looked for someplace obvious to add a filter and didn't see one, and didn't know what filter to add or about __return_zero (how useful!). So you have solved 95% of it!
</p>
<p>
I had thought that, to group the filter addition with the other replytocom code, it would have to go into one of the functions in comment-template.php unless there was a more serious refactor. 'replytocom' is just a magic string in that file and there are about 3 funcs doing branches on <tt>isset($_GET['replytocom'])</tt>.
</p>
<p>
The problem is that none of these functions is called until after wp_head() so that doesn't work.
</p>
<p>
It would definitely work to put it in default-filters.php but then it's even lower in the stack. Could put the replytocom check in a new function and hook it to wp_head, but don't think that's better since it adds overhead and is basically equivalent to how noindex() is called.
</p>
<p>
Could make replytocom into a public query var, and then add a filter to 'query_vars' or similar so that it's called inside class WP when the request is parsed.
</p>
<p>
Anyway, those are my ideas, somebody like you who knows the code 10x better may have a much better one.
</p>
TicketnacinSat, 09 Apr 2011 15:44:11 GMTmilestone changedhttps://core.trac.wordpress.org/ticket/16893#comment:6
https://core.trac.wordpress.org/ticket/16893#comment:6
<ul>
<li><strong>milestone</strong>
changed from <em>Awaiting Review</em> to <em>3.2</em>
</li>
</ul>
<p>
Replying to <a class="closed" href="https://core.trac.wordpress.org/ticket/16893#comment:5" title="Comment 5 for Ticket #16893">joelhardi</a>:
</p>
<blockquote class="citation">
<p>
It would definitely work to put it in default-filters.php but then it's even lower in the stack. Could put the replytocom check in a new function and hook it to wp_head, but don't think that's better since it adds overhead and is basically equivalent to how noindex() is called.
</p>
</blockquote>
<p>
Actually, I was thinking default-filters.php. Putting it in a function and hooking it is no different, since undoing that is unhooking a filter, while undoing this is hooking a filter. Dropping it into default-filters simply avoid an extra layer. It's not lower in the stack, as it's keeping the code outside of noindex(), which would still feel like a generic function.
</p>
<p>
Moving to 3.2 for review.
</p>
TicketnacinSat, 09 Apr 2011 15:45:54 GMTattachment sethttps://core.trac.wordpress.org/ticket/16893
https://core.trac.wordpress.org/ticket/16893
<ul>
<li><strong>attachment</strong>
set to <em>16893.diff</em>
</li>
</ul>
TicketjoelhardiSun, 10 Apr 2011 01:23:46 GMThttps://core.trac.wordpress.org/ticket/16893#comment:7
https://core.trac.wordpress.org/ticket/16893#comment:7
<p>
Works for me!
</p>
TicketnacinThu, 12 May 2011 03:59:17 GMTstatus changed; owner, resolution sethttps://core.trac.wordpress.org/ticket/16893#comment:8
https://core.trac.wordpress.org/ticket/16893#comment:8
<ul>
<li><strong>owner</strong>
set to <em>nacin</em>
</li>
<li><strong>status</strong>
changed from <em>new</em> to <em>closed</em>
</li>
<li><strong>resolution</strong>
set to <em>fixed</em>
</li>
</ul>
<p>
In <a class="changeset" href="https://core.trac.wordpress.org/changeset/17891" title="Don't allow indexing of replytocom URLs. fixes #16893.">[17891]</a>:
</p>
<div class="message"><p>
Don't allow indexing of replytocom URLs. fixes <a class="closed ticket" href="https://core.trac.wordpress.org/ticket/16893" title="enhancement: Stop or reduce crawling of comment reply ?replytocom URLs (closed: fixed)">#16893</a>.<br />
</p>
</div>
TickethakreThu, 12 May 2011 09:04:24 GMThttps://core.trac.wordpress.org/ticket/16893#comment:9
https://core.trac.wordpress.org/ticket/16893#comment:9
<p>
Thanks, I can't wait to see this in public testing!
</p>
Ticket