Misogyny, machines, and the media, or: how science should not be reported

Yesterday (26 May 2016), the thinktank Demos released a blog post entitled The Scale of Online Misogyny in which the author Jack Dale discussed “new research by Demos’ Centre for the Analysis of Social Media”. This research, in a nutshell, intimates that around half of misogynistic abuse on Twitter is sent by women. I’m going to go through that post and put my thoughts here as I progress, with the first side-point that a proportion of what I say for the blog post can also be said for Demos’ 2014 report by Jamie Bartlett et al.

In this research, Dale says that they analysed their Twitter data using, “a Natural Language Processing Algorithm”. This is remarkably under-explained. For context, it’s a bit like saying that they used “a surgical procedure”. Well yes, but could we have more details? (Edited to add: I later discovered a press release that contains the following: “Demos conducts digital research through its Centre for the Analysis of Social Media (CASM), using its own in-house technology – Method 52, which is a Natural Language Processing tool.” Still not much to go on there though.) Whatever the case, this Natural Language Processing, or NLP algorithm is then described as a process “whereby a computer can be taught to recognise meaning in language like we do.”

Fundamentally, no.

Computers can work their way through the simpler aspects of language and make some reasonable approximations, but there is a reason that even software built purely to convince people that they are talking to another human still struggles succeed. This is because computers are largely closed off from the complexities and nuances of human language, and especially those that reside outside of language itself. Face an algorithm with messy features like sarcasm, threats, allusions, in-jokes, novel metaphors, clever wordplay, typographical errors, slang, mock impoliteness, and so on, and it will invariably make mistakes. Even supposedly cut-and-dried tasks such as tagging a word for its meaning can fox a computer. If I tell you that “this is light” whilst pointing to the sun you’re going to understand something very different than if I say “this is light” whilst picking up an empty bag. Programming that kind of distinction into software is nightmarish.

That aside, having described NLP thus, Dale goes on to say: “This means that the different uses of ‘slut’ and ‘whore’ can be classified, allowing us to further illustrate the nature and extent of misogyny online.” Well, as the above hopefully shows, it’s not as easy as that. It can classify them. Given time computers will classify anything. But whether it will do so correctly is a different issue, and we can guarantee that the classifier will be unable to take into account important factors like the relationships between the people using those words, their intentions, sarcasm, mock rudeness, in-jokes, and so on. A computer doesn’t know that being tweeted with “I’m going to kill you!” is one thing when it comes from an anonymous stranger, and quite another when it comes from the sibling who has just realised that you ate their last Rolo. Grasping these distinctions requires humans and their clever, fickle, complicated brains.

When it comes to the data, Dale states that: “we collected just under 1.5m tweets from around the world over a period of 23 days.” What period? Last year? This year? Some other time? (Edited to add: it turns out that it spans “23 April – 15 May 2016”. That was buried in this press-release which I only found today – 27 May.) Data from “around the world” entails a morass of cultural sensitivity issues. And, given everything I have to say about gender shortly, it would have been very useful here to know whether the data was streamed directly from Twitter’s API using software such as R or FireAnt, or whether it came via a service such as Datasift, Gnip, or similar.

Moving on a bit, Dale tells us that: “Two types of language was classified: ‘aggressive’ and ‘self-identification’.” However, we are given minimal information about how this classification was done. “Aggressive language” seems to have been based on the presence of second-person pronouns (you) and commands (shut up), though we only get a single example of each – perhaps third-person (he, she) and possessive pronouns (your/s, his, her/s) were also considered? We just don’t know.

Similarly, “self-identification language” seems to be based on first-person pronouns (I) and “a jovial manner”. How on earth an algorithm detects a jovial manner I do not know. (Indeed, plenty of humans can’t detect a jovial manner and they’re supposed to be better at this.) In fairness, Dale notes at the bottom of the post the error rates, as follows: “The classifier identifying aggressive tweets was 82% accurate, and the classifier identifying self-identifying tweets was 85% accurate.” Despite one in five or six-ish being incorrectly classified, though, we have no information on how the algorithm was tested for its precision/recall. What was the sample size? Did they refine the process in light of their results? We simply don’t have the information, so, as with most of this study, replicating any of this is going impossible – based on this blog post at least. (I have been wondering for the past twelve hours whether there is perhaps a proper 2016 report buried somewhere. If someone has spotted it, do let me know.)

Dale goes on to say:

“Interestingly, this study reflects the findings of our 2014 report, in which women were as comfortable using misogynistic language as men; the 2016 findings show that 50% of the total aggressive tweets were sent by women, while 40% were sent by men, and 10% were sent by organisations or users whose genders could not be classified.”

And this is where it all comes apart at the seams for me. My very first question is: how has the gender categorisation been carried out? Bear with me, because this one will require some details to explain. Neither the 2016 blog post nor the 2014 report provide any insight into how they determined the gender of the users. This is fairly crucial since this nugget about women supposedly being responsible for half the abuse is precisely the finding that most of the media wentwildover. (NB. I know that over the coming days there will be some excellent critiques of the whole underpinning argument of women allegedly being as comfortable using misogynistic language as men, so I will stick here to the methodological issues. If I have time, I will also undertake my own analysis of “slut”/”whore” UK tweets gathered from a 23 day span in something like March or April 2016, and I’ll see what I come up with in terms of results. Watch this space.)

The trouble is, gender on Twitter is not a simple matter of looking on profiles for a flag, or even for pronoun use. For instance, at registration, Facebook asks users to specify their gender, and then uses this information to propagate sentences like, “It’s her birthday today!” (And to sell you things, but that’s another story.) When creating a Twitter account, users don’t specify a gender, and when collecting tweets from Twitter’s API, e.g. via FireAnt or R, gender is not included. That doesn’t mean that Twitter doesn’t keep its own record of its guesses at gender – it has done so since at least 2012, but that information is used for advertising purposes. What we do know is that you don’t get this through the API stream, and that’s where Dale supposedly got his data.

The above said, some data-providers give the gender of Twitter users as part of their service, and Dale might have used a service like this rather than streaming directly from the API himself. Since I’ve had Twitter data from a service known as Datasift before, I decided to test some that was provided in 2015. That sample covered 14,181 tweets sent from 8,217 unique accounts. It spanned the whole of August 2014 and, as it happened, it concerned the Gaza/Israel conflict. The output from Datasift was in BSON format (crudely, binary JSON) which is readable in software such as FireAnt, and crucially, it comes with extra fields that are not available from the Twitter API alone, including Klout score, language detection, and gender. The latter is obviously of most interest here, and the values available are the following six. They’re fairly self-explanatory, and for ease, I’ve ranked them in order of their frequency in this BSON dataset:

Not found

Male

Female

Unisex

Mostly_male

Mostly_female

So where does this information about a user’s gender come from? Even in my BSON dataset, it’s unclear. It may be that Datasift produces these extra layers of analysis itself (edited to add: it turns out they do), or it may be that Twitter grants large license-paying organisations such as Datasift and its own recently-acquired service, Gnip, access to extra information that ordinary API users don’t get. It’s clear that other services provide this sort of data but some only do so in an extremely summary format. The following is from another little dataset of English tweets purchased in 2016 from a different provider:

Total users

89,928

Total gender detections

43,430

% Of detections

48.29%

Total male

31,623

Total female

11,807

% Male

73%

% Female

27%

In this data, gender isn’t even ascribed to individual users so it’s impossible to check its accuracy, and it is kept simply to the classic male/female dichotomy – more on this later. One key aspect in both this brief summary above and in my BSON data that stands out quite vividly, however, is how often gender isn’t ascribed.

In the above summary, over half (51.71%) of the users do not have a gender ascribed to them, and in my August 2014 BSON, out of 8,217 unique accounts, 4,381 (53%) are “not found”. That’s before we even consider the unisex, mostly_male, and mostly_female categories that are available in that data. They account for 7.30%, 4.05%, and 2.45% respectively, or a further 13.8%, in case you wondered. In other words, 67% of the BSON data is not given a “definite” male/female gender classification. It looks, then, as though Datasift is being very cautious about its classification, and rightly so.

Contrast this with the Demos report, in which only 10% of the tweets “were sent by organisations or users whose genders could not be classified.” That’s more than a little remarkable. How is it, then, that Demos has been able to classify gender 90% of the time where Datasift and other providers have been significantly more cautious?

Guessing games

The gender for Demos’ 2016 blog post is almost certainly being identified on the basis of algorithms. One simple mechanism is to build a library of names, classify them by gender (e.g. Claire = female, Matthew = male), and then let that do the bulk of the work. That might do reasonably well, but how does it handle ambiguous cases such as Charlie, Alex, Sam, and Harper? How does it categorise Nick’s News or Shelby Fans or Going Miles or Mac Tweets? I figured that I would put it to the test, so using my August 2014 BSON data, I exported the gender, bio, screen name, and username categories, cut the results down to the first 200 unique accounts classed as female, and the same again for accounts classed as male, and then checked each one manually. (Yes, this was very boring and took a long time.)

The results were not promising.

When analysing the accounts supposedly belonging to females, out of 200, five should have been classified as unisex (e.g. a skateboard shop, a small-time search engine, a cultural movement, etc.) whilst a further fifteen should have been categorised as male. That is if I’m to believe bios that read things like “father”, masculine-sounding names like Nathan, and/or profile pictures featuring men – more on deception later. Whatever the case, that doesn’t sound much, right? But twenty out of 200 is an error rate of 10%. If that extrapolates (and I have no plans to check a further 800 because I have a life outside of this) then across a million accounts, that’s 100,000 mistakes.

When the steps for accounts classed as male are repeated, the results are better but still not perfect: I found two that should have been unisex and one that should have been female, which is an error rate of 1.5% or, if it extrapolates, 15,000 in every million accounts. Incidentally, this error rate roughly reflects Twitter’s own error rate: “A panel of human testers has found our predictions [for gender] are more than 90 percent accurate for our global audience.” In other words, at the risk of stating the blindingly obvious, around 10% of the predictions are wrong. Naturally, though, Twitter don’t say whether any of their categories perform particularly better or worse.

As an added aside, the above findings shouldn’t be taken to reflect badly on Datasift. Their own suggested usage for their gender analytics is very clearly not this kind of research. In their example use case, they suggest that you could “collect sentiment about your advertising campaign for a new range of women’s clothing”. That’s certainly not in the same league as trying to influence governmental policy about online abuse. Likewise I have no objections to people inferring gender and doing analysis on that basis. My issue is that very serious research requires very serious rigour.

Anyway, how do my findings affect Dale’s conclusion that 50% of women are sending abuse versus 40% of men and 10% unknown? Well, I’m not sure, because I don’t know where he got his gender data from, and as I mentioned above, his certainty about gender seems to be far higher than that in two differently acquired sets of data that I have. He may have used his own methods and algorithms to judge gender, and those methods could be wrong by a far greater margin than the ones in my BSON data, but without access to any of that information, we will simply never know. One thing I am sure of, however, is that if I can’t test it myself, then I will tend to be extremely skeptical. (Heavy hint, here, Demos: if you would like to share your datasets and subsets with me – or in fact all the details of the study that are missing – then I would be delighted to look at them.)

In summary, it should be obvious by this point that algorithms running on Twitter accounts and classifying them by gender are making (sometimes highly educated) guesses, but as my very rough and ready test on my own data shows, those guesses appear to be wrong significantly more often for accounts deemed to be female, where one in ten appears to be misjudged, versus less than two in every hundred for those accounts deemed to be male.

Let’s imagine, though, that Dale built the perfect algorithm that could somehow weigh up the choice of name, the picture, the bio, the tweets, and many other kinds of other information, and then make a flawless classification of gender every time. That would be great, wouldn’t it?

Well, no, because people aren’t always themselves online…

Deception

Plenty of Twitter accounts don’t use “proper” names. Those names can look real enough to an algorithm – as far as the computer is concerned, there’s no difference between Mickey Mouse, Peppa Pig, Walter Wolf, or Susie Swan, but a human with the correct cultural knowledge will recognise problems with the first two. And that’s before we get onto an even greater issue at stake: Susie Swan from Milton Keynes who loves needlework and Whitesnake may actually be Barry Bear from Texas who is trying out a new identity online. (NB. There is no morality judgement here. People create accounts that are unlike their offline selves for all kinds of reasons, both innocent and malicious, just as people lie face-to-face, in some cases to be kind, and in others, to be cruel.) However perfect an algorithm, it is also perfectly credulous, and it will believe what you tell it. It will also be as limited and as socially insensitive as you make it, which leads me into my final point…

The gender “binary”

Gender isn’t binary. It’s a spectrum. Unfortunately, this is largely unwelcome news for many people who want to create and train classifiers with nice, easy, dichotomous or trichotomous categories. Corpus linguistics has its own demons to face here with sampling frames for metadata categories that are equally as unsympathetic, and just as we now have more options than either “single” or “married”, the sooner we start writing the full range of genders into the code, the better.

Final word

For the record, I have no axe to grind when it comes to NLP and Machine Learning methods. They can do truly extraordinary things and I use them – or, er, make use of people who use them – in my research all the time. With greater automation, though, comes greater potential risk. Computers can get an awful lot right in a short space of time, but as a saying somewhere probably goes, they also enable us to make colossal screw ups with a speed, efficiency, and level of ease heretofore only accessible to those in possession of both strong alcohol and light artillery. When sending a classifier to judge reams of tweets for sensitive issues, and then using the results to make broad societal observations or support movements or inform policy, it is best to be both extremely cautious and extremely transparent. Rigour, replicability, robustness, and all that.

I would finish by observing that I am as able to make mistakes as any human, and possibly even better than any machine, so if you spot errors, misunderstandings, or omissions, feel free to tweet me at DrClaireH.