Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

marpot writes "Recently, the 1st International Competition on Wikipedia Vandalism Detection (PDF) finished: 9 groups (5 from the USA, 1 affiliated with Google) tried their best in detecting all vandalism cases from a large-scale evaluation corpus. The winning approach (PDF) detects 20% of all vandalism cases without misclassifying regular edits; moreover, it can be adjusted to detect 95% of the vandalism edits while misclassifying only 30% of all regular edits. Thus, by applying both settings, manual double-checking would only be required on 34% of all edits. Nothing is known, yet, whether the rule-based bots on Wikipedia can compete with this machine learning-based strategy. Anyway, there is still a lot potential for improvements since the top 2 detectors use entirely different detection paradigms: the first analyzes an edit's content, whereas the second (PDF) analyzes an edit's context using WikiTrust."

If the algorithm can detect 20% with perfection then that must constitute extremely low hanging fruit. That type of vandalism is just annoyance. It is so obvious that the end user readily recognizes it as such and can skip over it or revert the edit.

The real issue is disinformation, which is vastly more subtle. The only defense is fact-checking or seeking out references. If the algorithm is capable of recognizing that kind of vandalism then the developers should have the software writing all the articles in the first place, because it'd have to be pretty spectacular to manage that.

The major problem now is that 99% of all good edits submitted to wikipedia are reverted anyways as false positives.

The reason for this is that corrupt administrators do nothing to stop it, and corrupt idiots wanting to become admins just sit all day on the semi-automated tools like "Twinkle" or "Huggle" reverting anything in sight to get their edit counts up.

The real issue is disinformation, which is vastly more subtle. The only defense is fact-checking or seeking out references.

Depends on your definition of bad edits. Take for example an edit with useful information but unsourced or poorly worded. Delete or keep and try to improve?

Increasingly WP seems to be going for the former. Ditto with useful articles that are about niche subjects, particularly software, which are deemed not notable enough to keep. Whatever happened to "Wikipedia is not paper"?

The major problem now is that 99% of all good edits submitted to wikipedia are reverted anyways as false positives.

That's just nonsense, and you know it. It does indeed happen that good edits are reverted, but it does not happen 99% of the time, not even in close to that.

It's hard to say what the bigger problem is -- good edits that are reverted -- or bad edits that aren't. My guess is that both these problems are about equal at the moment, and neither of them are particularily large. That is, Wikipedia s

If the algorithm can detect 20% with perfection then that must constitute extremely low hanging fruit. That type of vandalism is just annoyance. It is so obvious that the end user readily recognizes it as such and can skip over it or revert the edit.

You have to consider that the people doing the vast majority of vandalism reversions aren't the end users, it's registered wikipedians who maintain articles as a hobby. Automatically reverting 20% of the vandalism means contributors have that much more time to spend verifying uncited claims in other articles.

I'm sure it's relatively easy to find 20% of the incidents of vandalism when it's a blatant 'rip out half the page and write profanities' sort of thing, but even those results aren't that great. They can 'turn it up' a bit and catch a higher percentage, but that seems to be a slightly bad idea. If wikipedia is based on information from the community at large, I really doubt the people that insert such knowledge will be thrilled when their edits are deleted immediately.

But I have seen pages where simple factual errors have been corrected, along with citations, AND even a note about the edit on the Talk Page, and they are still reverted. It most often happens on articles too obscure to be policed well that are likely to attract people with agendas. (For example, I've seen both left-wing and right-wing religious crazies peddling their incorrect historical/factual assertions on obscure pages on reli

I've also seen a territorial admin who kept deleting things even after an academic familiar with the field did a survey of dozens of the standard textbooks in an area and posted the results on the Talk Page, proving that the admin's view on the subject was absolutely wrong.

Doesn't work. It's multiplicative, not additive, meaning that the second time you only get 20% of the 80% you had left, i.e. 16%, for a 36% cumulative cleansing, thus remaining with 64% of the original, etc.

To clarify: the sequence asymptotically approaches 100%, but you'll not get there in a finite number of steps (of course, this is the purely mathematical side of things, to which you have to add the fact that the number of vandalism is discrete, on one hand, so if the percentage has to be rounded to the nearest integer you _would_ get 100% in a finite number of steps, but not if it rounds by defect; also, the vandalism cleanup takes time, so even if the number of steps is finite you would have more vandalis

I was going to point out that it relies on the assumption that it cleans 20% of any given article. I would say it cleans 20% of an average article, and once an arbitrary article is cleaned, you are now given that it isn't average, ie that 20% figure no longer applies. Don't know what it is, but I suspect reapplying would immediately or quickly converge to 0%. Especially machine learning approaches, I guess they would (eventually) get to the point of covering everything they can in the initial pass.

I don't know where that 34% figure comes from for the manual double checking. The test set contains about 60% vandalism and 40% real edits, so I'll assume this represents the rate of vandalism on wikipedia. Now, consider a set of 1000 edits. 600 would be vandalism while 400 would be real edits. The second filter would catch 570 instances of real vandalism along with 120 false positives. Even if you used the first filter to automatically remove the 120 instances of vandalism it finds, you would still b

This comes from personally maintaining some 200+ wikis on Wikidot.com.

There are two kinds of vandals: those in the community of contributors, and those outside it. The first class of vandals cannot easily be detected automatically but when a wiki is actively built, the community will easily and happily fix damage done by these. The second class are usually spammers and come along when the wiki is stale. They are easily detected by the fact that a long static page is suddenly edited by an unknown person. It's very rare to find a real edit happening late after a wiki has solidified. We handle the second type of vandalism trivially by getting email notifications on any edits.

Trick is, wikis (maybe not Wikipedia but then certainly individual pages) don't have random life cycles but go through growth and stasis.

The second class are usually spammers and come along when the wiki is stale. They are easily detected by the fact that a long static page is suddenly edited by an unknown person. It's very rare to find a real edit happening late after a wiki has solidified.

Ah... now I know why people revert my generally anonymous but high quality edits on neglected articles. Anyone who edits a dormant article must be a spammer or vandal? I don't think this is true.

Trick is, wikis (maybe not Wikipedia but then certainly individual pages) don't have random life cycles but go through growth and stasis.

While I guess you're correct in general, I've seen quite a few situations on Wikipedia where a new user coming in and taking a look at an established article actually leads to a period of revision, reconsideration, and perhaps growth on a given page.

Anyway, there is still a lot potential for improvements since the top 2 detectors use entirely different detection paradigms

This implies that the lower-scoring detectors are less valuable in terms of looking for sources of improvement. That's not true, and that wasn't stated in the paper's "Conclusions" section. If the lowest scoring detector finds 5% of the bad data, and it's a different slice from what the other detectors find, then that's quite valuable.

Wikipedia already has programs which detect most of the blatant vandalism. Page blanking and big deletions are caught immediately. Deletions that delete references generate warnings. Incoming text that duplicates other content on the Web is caught. That gets rid of most of the blatant vandalism. It's not a serious problem on Wikipedia.

The current headaches are mostly advertising, fancruft, and pushing of some political point of view. That's hard to deal with using what is, after all, a rather dumb machine learning algorithm that has no model of the content or subject matter.

I made a fake article that has been up for three years and ten months. It has even been brushed up a little bit by a few people. The article is full of fake companies, fake people and fake ideas. Do I win? I'd tell you what it is, but I want to see how long it stays up and if I post it someone will see to taking it down.

As the owner of the first vandalism reverting bot in mainstream use - http://en.wikipedia.org/wiki/User:Tawkerbot2 [wikipedia.org] I guess I have a bit of perspective on the whole problem.
Originally the bot was designed / created to auto revert one very specific type of vandalism, a user who would put a picture of spongebob squarepants into pages while blinking them (or squidward or some cartoon character) - that was pretty easy to get.
Next we went to stuff like full page blanking, ALL CAP LETTER UPDATES and additions of a tonne of bad words, based on common vandalism trends (ie, if a page had 0 profanity on it and someone added a few words it would be reverted, again, not too many false positives.
That basically caught the "dumb kid" type of vandalism, and it was amazing how much lower a percentage it caught of total edits when students went back to school.
The only problem, at the time, it was a resource pig. The bot was originally running on a P2 300MHz w/ a grand total of 256MB of RAM and the load got to be so high that we had to move it about 5 times.
It's interesting to note that at first, many many people were opposed to the idea of automated vandalism revision, it was almost a contest to revert stuff first - and the bot would win a vast majority of the time. However, as time went on, my inbox started getting rather full whenever I had a power outage, cat knocked the cord out of the box hosting it etc. Community reaction to bots doing the grunt work in vandalism really changed.
Anyways, just my 2c on it, and just for the heck of it to prove I'm actually the Tawker on wiki, http://en.wikipedia.org/w/index.php?title=User%3ATawker&action=historysubmit&diff=387163504&oldid=268687392 [wikipedia.org]

It looks like the winning entry [uni-weimar.de] uses all of those attributes plus a bunch more. From pages 3-4 of the paper.

Anonymous -- Wether the editor is anonymous or not.

Vandals are likely to be anonymous. This feature is used in a way or another inmost antivandalism working bots such as ClueBot and AVBOT. In the PAN-WVC-10 training set (Potthast, 2010) anonymous edits represent 29% of the regular editsand 87% of vandalism edits.

Sure, stupid spammers think replacing an article with a badly spelled advert for ViAGRa is the way to go, and morons think that they gain something from inserting "I'M GAY!!!!!" into an article about someone they dislike, but why just do damage for no other purpose than destroying other people's hard work?

I just don't get it.

These trolls/vandals need to get their asses kicked - hard. Or maybe just have something of theirs broken, just for the fun of it, and see if t