Sunday, December 6, 2015

Today, I'm writing briefly about a problem that I expect to be studying and trying to fix over the course of the next few weeks.

The problem: The damage detection models that ORES supports seems to be overly skeptical of edits by anonymous editors and newcomers.

I've been looking at this problem for a while, but I was recently inspired by by the framing of disparate impact. Thanks to Jacob Thebault-Spieker for suggesting I look at the problem this way.

In United Statesanti-discrimination law, the theory of disparate impact holds that practices in employment, housing, or other areas may be considered discriminatory and illegal if they have a disproportionate "adverse impact" on persons in a protected class. via Wikipedia's Disparate Impact (CC-BY-SA 4.0)

So, let's talk about some terms and how I'd like to apply them to Wikipedia.

Disproportionate adverse impact. The damage detection models that ORES supports are intended to focus attention on potentially damaging edits. Still human judgement is not perfect and there's lot of fun research that suggests that "recommendations" like this can affect people's judgement. So by encouraging Wikipedia's patrollers to look a particular edit, we are likely also making them more likely to find flaws in that edit than if it was not highlighted by ORES. Having an edit rejected can demotivate the editor, but it may be even more concerning that the rejection of content from certain types of editors may lead to coverage biases as the editors most likely to contribute to a particular topic may be discouraged or prevented from editing Wikipedia

Protected class. In US law, it seems that this term is generally reserved for race, gender, and ability. In the case of Wikipedia, we don't know these demographics. They could be involved and I think they likely are, but I think that anonymous editors and newcomers should also be considered a protected class in Wikipedia. Generally, anonymous editors and newcomers are excluded from discussions and therefor subject to the will of experienced editors. I think that this has been having a substantial, negative impact on the quality and coverage of Wikipedia. To state it simply, I think that there are a collection of systemic problems around anonymous editors and newcomers that prevent them from contributing to the dominant store of human knowledge.

So, I think I have a moral obligation to consider the effect that these algorithms have in contributing to these issues and rectifying them. The first and easiest thing I can do is remove the features user.age and user.is_anon from the prediction models. So I did some testing. Here's fitness measures (see AUC) all of the edit quality models with the current and without-user features included.

wiki

model

current AUC

no-user AUC

diff

dewiki

reverted

0.900

0.792

-0.108

enwiki

reverted

0.835

0.795

-0.040

enwiki

damaging

0.901

0.818

-0.083

enwiki

goodfaith

0.896

0.841

-0.055

eswiki

reverted

0.880

0.849

-0.031

fawiki

reverted

0.913

0.835

-0.078

fawiki

damaging

0.951

0.920

-0.031

fawiki

goodfaith

0.961

0.897

-0.064

frwiki

reverted

0.929

0.846

-0.083

hewiki

reverted

0.874

0.800

-0.074

idwiki

reverted

0.935

0.903

-0.032

itwiki

reverted

0.905

0.850

-0.055

nlwiki

reverted

0.933

0.831

-0.102

ptwiki

reverted

0.894

0.812

-0.082

ptwiki

damaging

0.913

0.848

-0.065

ptwiki

goodfaith

0.923

0.863

-0.060

trwiki

reverted

0.885

0.809

-0.076

trwiki

damaging

0.892

0.798

-0.094

trwiki

goodfaith

0.899

0.795

-0.104

viwiki

reverted

0.905

0.841

-0.064

So to summarize what this table tells us: We'll lose between 0.05 and 0.10 AUC per model which brings us from beating the state of the art to not. That makes the quantitative glands in my brain squirt some anti-dopamine out. It makes me want to run the other way. It's really cool to be able to say "we're beating the state of the art". But on the other hand, it's kind of lame to know "we're doing it at the expense of users who are most sensitive and necessary." So, I've convinced myself. We should deploy these models that look less fit by the numbers, but also reduce the disparate impact on anons and new editors. After all, the actual practical application of the model may very well actually be better despite what the numbers say.

But before I do anything, I need to convince my users. They should have a say in this. At the very least, they should know what is happening. So, next week, I'll start a conversation laying out this argument and advocating for the switch.

One final note. This problem may be a blessing in disguise. By reducing the fitness of our models, we have a new incentive to re-double our efforts toward finding alternative sources of signal to increase the fitness of our models.

Sunday, November 22, 2015

So I've been working on a blog post for blog.wikimedia.org about ORES. I talked about ORES a few weeks ago in ORES: Hacking social structures by building infrastructure, so check that out for reference. Because the WMF blog is relatively high profile, the Comms team at the WMF doesn't want to just lift my personal bloggings about it -- which makes sense. I usually spend 1-2 hours on this, so you get typos and unfinished thoughts.

In this post, I want to talk to you about something that I think is really important when communication about what ORES is to a lay audience.

Visualizing ORES

The WMF Comms team is pushing me to make the topic of machine triage much more approachable to a broad audience. So, I have been experimenting with visual metaphors that would make kinds of things that ORES enables easier to understand. I like to make simple diagrams like the one below for the presentations that I give.

The flow of edits from The Internet to Wikipedia are highlighted by ORES

quality prediction models as "good", "needs review" and "damaging".

ORES vision

But it occurs to me that a metaphor might be more appropriate. With the right metaphor, I can communicate a lot of important things through implications. With that in mind, I really like using Xray specs as a metaphor for what ORES does. It hits a lot of important points about what using ORES means -- both what makes it powerful and useful and also why we should be cautious when using it.

A clipping from an old magazine showing fancy sci-fi specs.

ORES shows you things that you couldn't see easily beforehand. Like a pair of Xray specs, ORES lets you peer into the firehose of edits coming into Wikipedia and see potentially damaging edits stand out in sharp contrast against the background of probably good edits. But just like a pair of sci-fi specs, ORES alters your perception. It implicitly makes subjective statements about what is important (separating the good from the bad) and it might bias you towards looking at the potentially badwith more scrutiny. While this may be the point, it can also be problematic. Profiling an editors work by a small set of statistics is inherently imperfect and the imperfections in the prediction can inevitably lead to biases. So I think it is important to realize that, when using ORES, you're perception is altered in ways that aren't simply more truthful.

So, I hope that the use of this metaphor will help educate ORES users in the level of caution they employ as this socio-technical conversation about how we shoulduse subjective, profiling algorithms as part of the construction of Wikipedia.

Sunday, November 1, 2015

So I've been working on this project on an off. I've been trying to bring robust measures of edit quality/productivity to Wikipedians. In this blog post, I'm going to summarize where I am with the project.

Basically, I see the value of Wikipedia as a simple combination of two hidden variables: quality and importance. If we focused on making our unimportant content really high quality, that wouldn't be very valuable. Conversely if we were to focus on increasing the quality of the most important content first, that would increase the value of Wikipedia most quickly.

Value = Quality × Importance

But I want to look at value-adding activities, so I need to measure progress towards quality. I think a nice term for that is productivity.

Value-added = Productivity × Importance

So in order to take measurements of value-adding activity in Wikipedia, I need to bring together good measure of productivity and importance.

Measuring importance

I'm going to side-step a big debate purely because I don't feel like re-hashing it in text. It's not clear what importance is. But we have some good ways to measure it. The two dominant strategies for determining the importance of a Wikipedia article's topic are (1) view rate counts and (2) link structure.

With view rate counts, the assumption is made that the most important content in Wikipedia is viewed most often. This works pretty well as far as assumptions go, but it has some notable weaknesses. For example, the article on Breaking Bad (TV show) has about an order of magnitude more views than the article on Chemistry. For an encyclopedia of knowledge, it doesn't feel right that we'd consider a TV show to be more important than a core academic discipline.

Link structure provides another opportunity. Google's founders famously used the link structure of the internet to build a ranking strategy for the most important websites. See PageRank. This also seems to work pretty well, but it's less clear what the relationship is between the link graph properties and the nebulous notion of importance. At least with page view rates, you can plainly imagine the impact that a highly viewed article has.

Fun story though: Chemistry has 10 times as many incoming links as Breaking Bad. It could be that this measurement strategy can help us deal with the icky feeling us academics get when thinking that a TV show is more important than centuries of difficult work building knowledge.

Measuring productivity

Luckily, there is a vast literature for measuring the quality of contributions in Wikipedia. Many of which I have published! There are a lot of strategies, but the most robust (and difficult to compute) is tracking the persistence of content between revisions. The assumption goes: the more subsequent edits a contribution survives, the high quality it probably was. We can quite easily weight "words added" by "persistence quality" to get a nice productivity measure. It's not perfect, but it works. The trick is figuring out the right way to scale and weight the measures so that they are intuitively meaningful.

The real trick here was making the computation tractable. It turns out that tracking changes between revisions is extremely computationally intensive. It would take me 60 days or so to track content persistence across the entire ~600m revisions of Wikipedia on a single core of the fastest processor on the market. So the trick is to figure out how to distribute the processing across multiple processors. We've been using Hadoop streaming. See my past post about it: Fitting Hadoop streaming into my workflow It's been surprisingly difficult to work with memory issues in Hadoop streaming that don't happen when just using unix pipes on the command line. I might make a post about that later, but honestly, it just makes me feel tried to think about those types of problems.

Bringing it together

I'm almost there. I've still got to work out some threshholding bits for productivity measures, but I've already finished the hard computational work. My next update (or paper) will be about who, where and when of value-adding in Wikipedia. Until then, stay tuned.

Monday, October 26, 2015

So, the title of this post is a little bit extreme, but I chose it specifically because I think it will help you (gentle reader) in thinking in the direction that I'd like you to think.

Back when I was a young & bright eyed computer scientist who was slowly realizing that social dynamics were a fascinating and complex environment within which to practice technology development, I was invited to intern at the Wikimedia Foundation with a group of 7 other researchers. It turns out that we all had very different backgrounds going in. Fabian, Yusuke, and I had a background in computer science. Jonathan had expertise in technology design methods (but not really programming). Melanie's expertise was in rhetoric and language. Stuart was trained in sociology and philosophy of science (but he'd done a bit of casual programming to build some bots for Wikipedia). I think this diverse set of backgrounds enabled us to think very well about the problem that we had to face, but that's a subject for another blog entry. Today I want to talk to you about the technology that we ended up rallying around and taking massive advantage of: the Structured Query Language (SQL) and a Relational Database Management System (RDBMS).

Up until my time at the Wikimedia Foundation, I had to do my research of Wikipedia the hard way. 1st, I downloaded Wikipedia's 10 terrabyte XML dump (compressed to ~100 gigabytes). Then I write Python script that used a streaming p7zip decompressor and a unix pipe to read the XML with a streaming processor. This workflow was complex. It tapped many of the skills I had learned in my training as a computer scientist. And yet, it was still incredibly slow to perform basic analyses.

It didn't take me long to start using this XML processing strategy to produce intermediate datasets that could be loaded into a postgres RDBMS for high-speed querying. This was invaluable to making quick progress. Still I learned some lessons about including *all the metadata I reasonably could* in this dataset since going back to the XML was another headache and week-long processing job. As a side-note, I've since learned that many other computer scientists working with this dataset went through a similar process and have since polished and published their code that implements these workflows. TL;DR: This was a really difficult analysis workflow even for people with a solid background in technology. I don't think it's unreasonable to say that social scientist or rhetoric scholar would have found it intractable to do alone.

When I was helping organize the work we'd be doing at the Wikimedia Foundation, I'd heard that there was a proposal in the works to get us researchers a replica of Wikipedia's databases to query directly. Honestly, I was amazed that people weren't doing this already. I put my full support behind it and thanks to others who saw the value, it was made reality. Suddenly I didn't need to worry about processing new XML dumps to update my analyses, I could just run a query against a database that was already indexes and up to date at all times. This was a breakthrough for me and I found myself doing explorations of the dataset that I had never considered before because the speed of querying the relevancy of the data made them possible. Stuart and I had a great time writing SQL queries for both our own curiosity and to explore what we ought to be exploring.

For my coworkers who had no substantial background in programming, they saw yet another language that the techies were playing around with. So, they took advantage of us to help them answer their questions by asking us to produce excel-sized that they could explore. But as these things go when people gets busy, these requests would often remain unanswered for days at a time. I've got to hand it to Jonathan. Rather than twiddling his thumbs while he waited for a query request to be resolved, he decided to pick up SQL as a tool. I think he set an example. It's one thing to have a techy say to you, "Try SQL! It's easy and powerful." and a totally different thing for someone without such a background to agree. By the end of the summer internship, I don't think we had anyone (our managers included) who were not writing a bit of SQL here and there. All would agree that they were substantially empowered by it.

Since then, Jonathan has made it part of his agenda to bring SQL and basic data analysis techniques to larger audiences. He's been a primary advocate (and volunteer product manager) of Quarry our new open querying to for Wikimedia data. That service has taken off like wildfire -- threatening to take down our MariaDB servers. Check it out https://quarry.wmflabs.org/ Specifically, I'd like to point you to the list of recent queries: https://quarry.wmflabs.org/query/runs/all Here, you can learn SQL techniques by watching others use them!

Sunday, October 4, 2015

So, I just crossed a major milestone on a system I'm building with my shoe-string team of mostly volunteers and I wanted to tell you about it. We call it ORES.

The ORES logo

The Objective Revision Evaluation Service is one part response to a feminist critique of power structures and one part really cool machine learning and distributed systems project. It's a machine learning service that is designed to take a very complex design space (advanced quality control tools for Wikipedia) and allow for a more diverse set of standpoints to be expressed. I hypothesize that systems like these will make Wikipedia more fair and welcoming while also making it more efficient and productive.

Wikipedia's power structures

So... I'm not going to be able to go into depth here but there's some bits I think I can say plainly. If you want a bit more, see my recent talk about it. TL;DR: The technological infrastructure of Wikipedia was build through the lens of a limited standpoint and it was not adapted to reflect a more complete account of the world once additional standpoints entered the popular discussion. Basically, Wikipedia's quality control tools were designed for what Wikipedia editors needed in 2007 and they haven't changed in a meaningful way since.

Hacking the tools

I had some ideas on what kind of changes to the available tools would be important. In 2013, I started work in earnest on Snuggle, a successor system. Snuggle implements a socialization support system that helps experienced Wikipedia editors find promising newcomers who need some mentorship. Regretfully, the project wasn't terribly successful. The system works great and I have a few users, but not as many the system would need to do its job at scale. In reflecting on this, I can see many reasons why, but I think the most critical one was that I couldn't sufficiently innovate a design that fit into the social dynamics of Wikipedia It was too big of a job. It requires the application of many different perspectives and a conversation of iterations. I was a PhD student -- one of the good ones because Snuggle gets regular maintenance -- but this work required a community.

When I was considering where I went wrong and what I should do next, I was inspired by was the sudden reach that Snuggle gained when the HostBot developer wanted to use my "promising newcomer" prediction model to invite fewer vandals to a new Q&A space. My system just went from 2-3 users interacting with ~10 newcomers per week to 1 bot interacting with ~2000 newcomers per week. Maybe I got the infrastructure bit right. Wikipedia editors do need the means to find promising newcomers to support after all!

Hacking the infrastructure

So, lately I've been thinking about infrastructure rather than direct applications of experimental technology. Snuggle and HostBot helped to know to ask the question, "What would happen if Wikipedia editors could find good new editors that needed help?" without imagining any one application. The question requires a much more system-theoretic way of reasoning about Wikipedia, technology and social structures. Snuggle seemed to be interesting as an infrastructural support for Wikipedia. What other infrastructural support would be important and what changes might that enable across the system itself?

OK. Back to quality control tools -- the ones that haven't changed in the past 7 years despite the well known problems. Why didn't they change? Wikipedia's always had a large crowd of volunteer tool developers who are looking for ways to make Wikipedia work better. I haven't measured it directly, but I'd expect that this tech community is as big and functional as it ever was. There were loads of non-technological responses to the harsh environment for newcomers (including the Teahouse and various WMF initiatives). AFAICT, the tool I built in 2013 was the *only* substantial technological response.

Why is there not a conversation of innovation happening around quality control tools? If you want to build a quality control tool for Wikipedia that works efficiently, you need a machine learning model that calls your attention to edits that are likely to be vandalism. Such an algorithm can reduce the workload of reviewing new edits in Wikipedia by 93%, but standing one up is excessively difficult. To do it well, you'll need an advanced understand of computer science and some substantial engineering experience in order to get the thing to work in real time.

The "activation energy" threshold to building a new quality
control tool is primarily due to the difficulty of building a
machine learning model.

So, What would happen if Wikipedia editors could quickly find the good, the bad, and the newcomers in need of support. I'm a computer scientist. I can build up infrastructure for that and cut the peak off of that mountain -- or maybe cut it down entirely. That's what ORES is.

What ORES is

ORES is a web service that's provides access to a scale-able computing cluster full of state-of-the-art machine learning algorithms for detecting damage, differentiating good-faith edits from bad and measuring article quality. All that is necessary to use this service is to request a url containing the revision you want scored and the models you would like to apply to it. For example, if you wanted to know if my first registered edit on Wikipedia was damaging, you could request the following URL.

We use a distributed architecture to make scaling up the system to meet demand easy. The system is built in python. It uses celery to distribute processing load and redis for a cache. It is based on revscoring library I wrote to generalize machine learning models of edits in Wikipedia. This same library will allow you to download one of our model files and use it on your own machine -- or just use our API.

Our latest models are substantially more fit than the state-of-the-art (.84 AUC to our .90-95 AUC) our system has been substantially battle tested. Last month, Huggle, one of the dominant yet unchanged quality control tools started making use of our service. We've seen tool devs and other projects leap at the ability to use our service.

Sunday, September 13, 2015

Today, I want to talk about a specific type of research output that, I feel, adds a substantial amount of value beyond a single research project. I'm talking about open sourced research code, but not of the type that you normally see -- the type that's actually intended to be used by other people to serve new purposes.

Before I dig into software libraries I want to talk to you about, I must make a distinction:

CRAPL quality code -- This is the code that a researcher builds ad-hoc in order to get something done. There's little thought spent on generalizability or portability. With this code, it's usually better and faster to fix a problem by adding a step in the workflow rather than fixing the original problem. So, you end up with quite a mess usually. While good for documenting the process of data science work, this is not useful to others.

Library quality code -- This code has been designed to generalize to new problems. It's intended to be used as a utility by others who are doing similar work -- but not the exact same work. It's usually well documented and well tested. With this code, it is sacreligious to add more code on top of broken code to fix a problem. This is just one of the many disciplines that must be applied to the practice of writing software to have good, library quality code.

I've been analyzing Wikipedia data for nearly a decade (!!!) -- and I can tell you that it was never easy. The English Wikipedia XML dumps that I have done most of my work with are on the order of 10 terrabytes uncompressed. The database, web API and XML dumps all use different field names to refer to the same thing. In each one, the absence of a field -- or NULLing of the field can mean different things. Worse, the MediaWiki software has been changing over time, so in order to do historical analyses, you need to take that into account. In the process of working out these details and getting my work done, I've produced reams of CRAPL quality code. See https://github.com/halfak/Activity-sessions-research for an example. In this case, I have a Makefile that, if executed, would replicate my research project. But if you look inside that Makefile, you'll see things like this:

That's a commented out Makefile rule that calls my local database with my local configuration hardcoded and runs some SQL against it. This is great if you want to know what SQL produced which datafile, but not very useful if you want to replicate the work. And why is it commented out!? Well, the database query takes a long time to run and I didn't want to accidentally overwrite the data file as I was finishing off the research paper. Gross, right? This isn't all that useful if you wanted to perform a similar analysis.

But in producing this CRAPL code, there are some nice, generalizable parts that occur to me so I write them up for others' benefit. I've gone through a few iterations of this and learned from my mistakes.

Back in 2011, I released the first version of wikimedia-utilities, a set of utilities that made the work I was doing at the Wikimedia Foundation easier. The killer feature of this library was the XML processing strategy. It changed the work of processing Wikipedia's terrabyte scale XML dumps from a ~2000 line script to a ~100 line script. But the code wasn't very pythonic, it lacked proper tests and did not integrate well into the python packaging environment.

In 2013, I decided to make a clean break and start working on mediawiki-utilities, a super-set of utilities from wikimedia-utilities that were intentionally generalized to run on any MediaWiki instance. I had learned some lessons about being pythonic, implementing proper tests and integrating with python's packaging environment.

But as I had been working on new projects and realizing how they could generalize, I ended up expanding mediawiki-utilities to a monolith of loosely related parts. And it gets worse. Since I focused on those parts as I needed them, there were certain modules that were ignored. Since I did most of my work with the databases directly, it was rare that I spent time on the 'database' module of mediawiki-utilities. I ended up with a monolith that was inconsistently developed!

So, in thinking about monoliths and how to solve problems that they impose, I was inspired by the Unix philosophy of combining "small, sharp tools" to solve larger problems. I realized that the primary modules of mediawiki-utilities could be split off into their own projects and combined in interesting ways -- and that this would enable a more distributed strategy to management. So I've been hard at work to bring this vision into the light.

This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing.

This library provides a set of basic utilities for interacting with MediaWiki’s “action” API – usually available at /w/api.php. The most salient feature of this library is the mwapi.Session class that provides a connection session that sustains a logged-in user status and provides convenience functions for calling the MediaWiki API. See get() and post().

This library provides a set of utilities for group MediaWiki user actions into sessions. mwsessions.Sessionizer and mwsessions.sessionize() can be used by python scripts to group activities into sessions or the command line utilities can be used to operate directly on data files. Such methods have been used to measure editor labor hours.

It's my goal that researchers who haven't been working with wiki datasets will have a much easier time building off of my work to do their own. I think that a good set of libraries can make a huge difference in this regard. That's my goal.

I'll be making a more substantial announcement soon. In the meantime, I'm cleaning up and extending documentation and working together some examples that demonstrate how a researcher can compose these small, sharp libraries together to perform powerful analyses of Wikipedia and users in other MediaWiki wikis. Until then, please use these utilities, let me know about bugs and send my your pull requests!

Thursday, August 27, 2015

So, I didn't have much time this week and I'm doing this Iron Blogging thing. If you got here looking for a cool discussion of the life of a traveling technologist, I regret to inform you that this will only be a meta-discussion. Once I've completed the proper discussion, I'll put the link right below this paragraph.

What?

So my life is pretty weird in a lot of ways. I travel a lot. I generally don't see my teammates at the Wikimedia Foundation for months at a time. Worse, I have a group of friends who are extremely geographically distributed that my geographically local friends don't know about. We only see each other during conferences and other academic events.

Another sort of interesting aspect of my professional life is that I straddle the line between industry and academia. When it comes to the meat of knowledge & knowledge production, there's no conflict. But the timescales are amazingly different.

But through dealing with this, I've worked out some hacks. Some have to do with communication channels and making it feel like you are present even when you are not in the office. Others are my folding bike and the amazing experience I get visiting European cities.

So, I conclude with a promise of future bloggings with photos and insights. I just don't have the time right now!

TL;DR: We ran an experiment where we gave a WYSIWYG editor to newly registered editors in Wikipedia and monitored the effect it had on their productivity. We found that it didn't affect productivity either way. I think that this is because the barriers to entry in Wikipedia primarily consist of social/motivational issues, so reducing the technical literacy barriers that VE targets did not have a meaningful effect.
The talk is embedded below. My talk is first. There's a second talk by some students looking at building a knowledge graph with Wikipedia and some google tech too.