ATM, MediaZilla isn't indexed by search engines, which means that searching for MediaWiki bugs will *never* get one here, but directs one at most to one of those many fishy websites that pair up the bug mailing list with advertisements. Even if one then reads the bug number and searches for "mediawiki bug 4711", one still doesn't get here. So please remove robots.txt.

Just a disclaimer: I do not get paid for the time I spend here. If WMF wants me to jump through some hoops, that's fine, but no, thanks.

There's a nice Google Tech Talk by Spolsky where he explains the design principles of stackoverflow.com and the road bumps that impede workflows. If WMF has some data that vandalism and load on the bugtracker outweigh the ease of use for and value of potential patches from MediaWiki users, so be it.

Fulltext, freeform search would be great, but a lot can be done with advanced
search.

Given the recent struggles here with vandalism and load, though, increasing our
problems by introducing bots isn't a high priority right now.

True about going directly, if you know, most people wouldn't know that bug 1234 actually is [1]

I'm not sure what the issue is with having Google among others index our bugzilla instance. It doesn't open us up to any more spam

Looking at [3] it seems we have the default BZ robots.txt installed one

A bit of searching around [2] among others, seems to suggest we'll need to install the sitemap extension [4]

I also don't think a blanket removal of the robots.txt is a good idea. However, doing by example [5], and updating to something along those lines seems very sane. I'm not sure why the default is so limiting. The sitemap extension also includes an improved robots.txt

We can easily get ops to update the robots.txt, because it's a quick fix, but might need to find a bit more time to get ops to actually install the extension, and then presumably possibly a submission to Google webmaster tools

I SERIOUSLY doubt that robots.txt is doing ANYTHING to help lower issues we have here. Vandalism works even though we have a robots.txt so naturally it's completely ignoring that. And I know for a fact that e-mail addresses are already being harvested from our bugtracker, so robots.txt isn't helping there.

The only thing that robots.txt is doing is keeping out all the good bots, all we have now are the bad ones.

Are we sure we want this? I would imagine it would be similar as to why we don't really want the lists indexed because the amount of craft it could protenitally introduce into the results we don't want.

I SERIOUSLY doubt that robots.txt is doing ANYTHING to help lower issues we
have here. Vandalism works even though we have a robots.txt so naturally it's
completely ignoring that. And I know for a fact that e-mail addresses are
already being harvested from our bugtracker, so robots.txt isn't helping there.

Just to be clear, I wasn't saying that we are keeping vandalism at bay by having a stricter robots.txt file. As pointed out in comment #1, there are plenty of links to the tracker all over the internet that vandals could follow if that was how they found bug trackers to play with.

In the past (perhaps less so currently?) "well behaved" spiders that respected robots.txt have routinely wreaked havoc on sites like this one that are, essentially, a bunch of cgi scripts that result in a process being forked for each request.

So, last week, we dealt with some apparent vandalism when someone brought the server to a halt by requesting a particular URL over and over.

My point was simply that if we suddenly make bugzilla visible to spiders who respect robots.txt, they would probably send a ton of queries to the server (e.g. several spiders from each search engine) to quickly discover the newly available data.

That sort of sudden visibility could very well look a lot like the vandalism we saw last week.

Sitemaps have already actively been submitted to Google, there was just a failure with Yahoo.

Replacing ./robots.txt. (The old version will be saved as
"./robots.txt.old". You can delete the old version if you do not need
its contents.)
Pinging search engines to let them know about our sitemap:

This statement isn't correct anymore. I get results from bugzilla.wikimedia.org on google.com (though not with the ranking that would be perfect).
Don't know exactly what the benefits of aforementioned bugzilla-sitemap would be compared to the current situation.

(In reply to comment #15)
We haven't moved the bugzilla customizations to gerrit yet. We probably should.

See [[Site map]]. The current sitemap is broken (it's an empty file), which is also invalid XML.

Another benefit from sitemap, apart for letting the search engine to know about all bugs, is to have a "last modified" field on each page to be indexed. If a particular page (or bug in this case) has already been indexed by the search engine, it won't reindex it again unless the last modified is newer than the cached copy: that should save some CPU and bandwidth because old bugs won't be re-crawled again.

Tim: It's not clear to me what needs to be done to fix this report. Could you
please clarify? Otherwise I might close this as WORKSFORME as I simply don't
know what's missing...

Essentially, as Jesús said, the sitemap extensions seems to be broken as deployed as it doesn't return any sitemap. Its installation was suggested by Sam in comment #3. For starters, it would be nice if someone could relay the status of RT #2198.

Without a working configuration, it's hard to assess whether the bad search rankings are due to this error.

There also seems to be an assumption that Allow rules can override previous Disallow rules. I'm not sure if this is actually the case. If *.cgi is disallowed, will *show_bug.cgi become allowed with a later directive?

Sitemap is not needed for search engines to index mediazilla, since all bugs are sent to wikibugs-l and end listed in various web pages. But they aren't indexing mediazilla because of this entry in robots.txt:

Disallow: /*.cgi

But without a sitemap, search engines don't know when a bug is updated, and end reindexing the entire site every time, producing a lot of overhead on the servers and bringing the site down. With a sitemap, only updated bugs since the last site index would be crawled again (supposedly), reducing the overhead on the site, although not sure to what extent.

From comment 32, it just needs to generate a sitemap, there's no need to ping search engines about it's existence. They'll know about it when they fetch robots.txt again and find a sitemap file location there. I don't see why it's pinging search engines.

From what I see, that patch doesn't ping search engines, and also saves the sitemap on the server and sends it to the search engines instead of regenerating the sitemap *every time* the URL is requested, during a period of time defined in SITEMAP_AGE. This should be more convenient. Maybe we can get the extension that's using bmo somewhere? Or at least consider using that patch if it looks sane.

Add Comment

Text is available under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA); code is available under the GNU General Public License (GPL) or other appropriate open source licenses. By using this site, you agree to the Terms of Use and Privacy Policy. · Wikimedia Foundation · Privacy Policy · Terms of Use · Disclaimer · CC-BY-SA · GPL