Fighting Words Not Ideas: Google's New AI-Powered Toxic Speech Filter Is The Right Approach

Alphabet Jigsaw (formerly Google Ideas) officially unveiled this morning their new tool for fighting toxic speech online, appropriately called Perspective. Powered by a deep-learning model trained on more than 17 million manually reviewed reader comments provided by the New York Times, the model assigns a score to a given passage of text, rating it on a scale from 0 to 100%, similar to statements that human reviewers have previously rated as “toxic.” What makes this new approach from Google so different than past approaches is that it largely focuses on language rather than ideas: for the most part you can express your thoughts freely and without fear of censorship as long as you express them clinically and clearly, while if you resort to emotional diatribes and name calling, regardless of what you talk about, you will be flagged. What does this tell us about the future of toxic speech online and the notion of machines guiding humans to a more "perfect" humanity?

One of the great challenges in filtering out “toxic” speech online is first defining what precisely counts as “toxic” and then determining how to remove such speech without infringing on people’s ability to freely express their ideas. While the online world largely consists of the equivalent of private property where traditional legal rights like “free speech” do not exist, companies nonetheless want to allow their users as much freedom as possible to express themselves – if users feel a platform is censoring them, they will simply go elsewhere.

Historically most attempts to address “hate speech” or “abusive speech” or “toxic speech” online have conflated the concept of ideas with the language used to express those ideas. In short, one can hold an unpopular idea, but express it clearly and clinically and supported with copious evidence, while another might hold a common idea, but express it to others using a profanity-laden diatribe. The former happens every day in the world of academia and scientific research – a scientist runs an experiment that challenges the current orthodoxy and publishes a series of papers in academic journals that document the new work. The latter is what we most commonly talk about as “hateful speech” in which logic and reason drown beneath a sea of emotional attacks that bear no relation to the topic being discussed.

Imagine an argument between two strangers over the “best” brand of soft drink. If the argument centers strictly on factual information (how much sugar is in each, chemical sensitivities of the average human tongue to the ingredients of each drink, additives in each drink, the role of soft drinks in the human diet, etc.), the discussion takes on an informative and constructive tone in which the two sides may vehemently disagree with each other, but focus their efforts on the factual environment surrounding their mutual arguments. In the end, both sides may come away still completely in disagreement, but the conversation itself has yielded a body of citations and statements much like an adhoc academic literature review that can benefit many others into the future.

Now, instead, imagine the first person says “Sprite is the best soda and anyone who thinks otherwise is the stupidest idiot alive” while the second responds “Actually, Coke is the best, but your feeble mind and lack of a college education prevents you from understanding the complex science that proves it.” Lobbing such emotional attacks at each other shifts the conversation away from the actual question of which soft drink is the “best” and avoids offering evidence on either side to support their arguments. Personal attacks also create a psychological distancing that makes it difficult to convince the other side even when they otherwise would have agreed. In short, by introducing personal attacks, the discussion ceases to actually be a discussion about the topic in question and instead proceeds to devolve into a common flame war. In the end all that’s left is an hour’s worth of profanity and pejoratives that neither informs nor contributes anything to human society more broadly.

It is human nature to revert to emotional attacks over logical discourse, and the goal of tools like Perspective is to shift us back towards logic, essentially using machines to make us better humans. Yet, this itself raises the question of just what the role of the Internet is and whether we are better humans when we utilize cold rational logic and reason to argue our viewpoints or whether it is simply human nature to tear each other apart online and we should step back and allow this. One user quoted in Wired, who herself has been the subject of online harassment, argued “People need to be able to talk in whatever register they talk. … Imagine what the internet would be like if you couldn’t say ‘Donald Trump is a moron.’”

It is a fascinating argument to offer that we must protect the ability to say “Donald Trump is a moron” online, as such a statement offers nothing more constructive than saying “Hillary Clinton is a liar” and merely polarizes rather than informs society. Calling someone names is the classic schoolyard go-to of a kindergarten child who has yet to form the ability to make or defend rational arguments. Part of growing up is evolving from saying “You’re stupid” to “I disagree with your viewpoint and here are my reasons why.” In short, in modern society our educational system exists to elevate young minds from schoolyard taunts to reasoned arguments.

If you think about toxic speech online today, it is essentially the verbal equivalent of the schoolyard bully shoving someone off the swing set because the bully doesn’t like them. Part of the educational experience growing up is that when the bully shoves someone off the swings, the teachers intervene and work to guide the bully from physical and emotional outbursts towards constructive verbal discourse. Should machines do the same online?

One interesting example illustrating the potential of such tools to guide us towards more constructive discourse is how Google’s tool handles a comment that an elected official has committed a crime – something Twitter’s new anti-abuse tool currently struggles with. If someone writes “XYZ is a traitor,” Google's tool assigns a toxicity similarity score of 72%. However if one writes a longer statement that includes supporting arguments and evidence such as “XYZ has committed the criminal act of treason by virtue of aiding a foreign power to undermine U.S. interests in Syria by appearing at a news conference with Assad and legitimizing him in a way that Russia has been unable to,” the tool assigns a score of just 11%.

In short, while it may not have been specifically designed to do so, the Google tool intriguingly penalizes short emotionally-laden attacks while rewarding longer statements that make the exact same argument, but do so in more clinical language and provide supporting evidence for their viewpoint. Offering detail and evidence for an argument allows the other side to respond in meaningful fashion where simply calling someone a “traitor” offers little for others to engage with.

While Google’s tool is not quite perfect yet and struggles with some of the complex negations and social media abbreviations and misspellings like “idi0t” (with the number zero replacing the letter "o") or “stuuuuuuuuuupid” (repeating certain letters multiple times) that reflect the freewheeling speech of social mediacompared with the more formal speech of New York Times commentators used to train the system, it is a remarkable step forward by virtue of its shifting the conversation from ideas to language. In my conversations with many in the community over the past few years I have argued long and hard for us to shift how we think about toxic speech from censoring ideas (which is precarious territory and opens the door to governments censoring criticism of themselves) to focusing on how we express those ideas – in short, how we express our beliefs rather than what those beliefs are.

In doing so, we recognize that toxic speech can break out anywhere over the simplest of topics and that even otherwise rationale and logic-seeking university professors are all-too-quick to devolve to name calling in an era of anonymous speech. We also recognize that it is very hard to change what someone believes, but it is very possible to change how and where they express those views.

Of course, as the Wired interviewee asked, what would social media be without hate speech and freewheeling attacks on each other? If machines eliminated emotional diatribes and forced our every online conversation to focus on clinical statements rather than blind emotions, turning us from schoolyard bullies punching those we disagree with into learned scholars having enlightened intellectual discussions, what would this mean for the future of global society? Many a science fiction book has explored the notion of machines guiding humans to a more perfect humanity in which toxic speech and profane attacks cease to exist and are replaced by constructive engagement in which people disagree through civilized discussion. Will pushing us away from toxic speech and screaming matches make us more perfect humans or will it decrease our humanity and make us more like machines pursuing logical conclusions? Such questions are best left to the philosophers, but at the very least we can say that toxic speech has reached a point that there are many in society that feel uncomfortable joining the online conversation today and thus action of some form is needed.

Jigsaw is quick to note that they view their current system as a first step, a combined technology and social experiment rather than a production system to be deployed this afternoon. They are releasing the system via an API service available to other organizations to experiment with it and explore how it performs in their own communities. This is itself immensely noteworthy in a world in which companies increasingly roll out filtering systems without warning and without any insight into how they function or any ability for the community to offer feedback. Jigsaw’s API prominently offers the ability to flag a score the user believes is wrong, which will eventually be fed back into the models to retrain them.

Putting this all together, Jigsaw’s approach to the issue of toxic speech online represents in my mind the perfect model: a focus on language over ideas, an open API available to others, transparency in how the model was built and what it looks for, and an open feedback loop for users to help refine the model over time.