Privacy, openness, trust and transparency on Wikipedia

How the free encyclopedia project deals with sockpuppets

Wikipedia's enormous growth during this decade, which has made it a "poster child of Web 2.0", has been enabled by its "anyone can edit" philosophy – external credentials are not required, and one still doesn't even need to set up a user account to change the content of one of the planet's most visited websites. This radical openness created unsurprising vulnerabilites (to vandalism, libel, copyright violations, introduction of bias, organized PR activities, etc.), but it is balanced by an equally radical transparency, where even minuscule actions of editors are recorded indefinitely.

This talk will describe some of the structures, methods, and tools that the Wikipedia community has developed over the years to defend the project from these vulnerabilities, and to establish its internal reputation system.

The main focus will be on the investigation of "sockpuppets" (multiple accounts operated by the same person), or rather their abuse. For contributions made without logging into an account, the originating IP address is recorded publicly, so topics like open proxies, TOR or geolocation became important for Wikipedians, and many of them have come to recognize certain IP ranges of certain ISPs immediately...
However, the IP addresses used by logged-in editors are hidden due to privacy concerns, and can only be requested (together with additional data from the HTTP headers – user agents and XFF) by a few trusted users via the "CheckUser" function of the MediaWiki software. And on the other hand, the edit history of an account contains a wealth of public information which is analyzed in many ways by Wikipedians. I will describe several of them and relate some of these home-grown methods to results from forensic linguistics and stylometry (research fields with a long history). I will also give a brief summary of statistical concepts – and known fallacies – related to sockpuppet investigations.

At the same time, these tools and techniques can reveal a lot of sensitive information (I will give concrete examples), and highlight the privacy issues that Wikipedia's transparency creates for its contributors.