The Washington Post carried a story earlier this week about a new experiment in data-driven policing being run by the Fresno Police Department. While algorithmic and predictive policing has become a hot topic of late, the Post article touched on three areas of particular relevance to the big data world: data recency, the burgeoning use of sentiment mining and dashboard/interface design.

Almost 90% of all local police departments in the US now employ some form of surveillance technology from “cameras and automated license plate readers [to] handheld biometric scanners, social media monitoring software, devices that collect cellphone data and drones.” Yet, with this much data pouring into central archives that pool it all together, police departments are running up against traditional warehouse issues like data recency.

The Post article describes how when police ran local councilman Clinton J. Olivier through the software at his request, the system estimated his threat level as green, but his residence was scored as a yellow. Police suggested the score was likely caused by a previous resident who had lived at the address at some point in the past and may have had a police interaction or record. Yet, as Olivier pointed out “even though it’s not me that’s the yellow guy, your officers are going to treat whoever comes out of that house in his boxer shorts as the yellow guy.”

In other words, if police are ever dispatched to an incident at his address, they won’t know that the person coming out of the front door is Olivier and therefore has a green score, they will only know that the address has a yellow score and associate whoever walks out of the door as a potential threat.

There is not enough detail in the article to understand why Olivier’s address was given a yellow score and the company apparently declined to comment on the specific data points that triggered the score, but one likely reason is data recency. If, as the police suggested to Olivier, the score was due to a past resident of his house, it is likely that the database was not properly merging in real estate sale data and USPS change of address data or other feeds that would denote a change in residence.

As with any warehouse, data from different sources often arrives at different rates and with different data quality issues that can affect the ability to merge it together. In this case it is possible that the database has historical arrest and dispatch records, but not the necessary change of address records to recognize that the subject of the previous police interactions at that address no longer resides there. Alternatively, it is also possible that the database is aware that the previous subject has moved, but is designed to still flag the address for a period of time to provide situational awareness for officers in case the person returns for some reason.

The challenge is that by resolving all of this information down to a single red/yellow/green score, all of the data points, interpretations and decisions that led up to the score are lost. It is therefore impossible to know whether an incorrect score is due to an out-of-date database, whether it was by design, or whether it is due to other past adverse information available about the address.

Yet, it is the tool’s use of sentiment analysis of social media posts that poses perhaps the greatest risk of potentially dangerous errors. As the company puts it “to the extent that there is information that is in the public domain, regardless of where the input was derived, it could potentially be surfaced” and the company specifically touts its ability to scour social media as part of its risk scoring system.

However, the local ABC affiliate in Fresno reported on “a Fresno woman whose score went up for posting on Twitter about her card game that happens to have ‘rage’ in its title.” Sentiment mining is an incredibly complex and nuanced field, with the majority of current systems coming from the computer science world, rather than the field’s roots in psychology and communications. As I chronicled for Wired in 2014, laying out the state of the field and the major stumbling blocks of most commercial and research systems, assessing emotional tenor from text is an extremely difficult and error-prone task and the majority of current systems have very significant limitations to their accuracy.

Sarcasm and idiomatic expressions are particularly troublesome, even for human analysts. In 2014 the Secret Service actually put out a call for proposals specifically highlighting the requirement that proposed systems correctly handle sarcastic statements, while the FBI maintained at the time a detailed social media guide that included definitions of more than 3,000 acronyms to help its analysts make sense of online conversation.

In 2012 two British tourists were detained for 12 hours and interrogated by Homeland Security counter-terrorism officials and ultimately denied entry to the US over two tweets, one of which referenced “destroy[ing] America” and the other that he would be “diggin’ Marilyn Monroe up.” It turned out that the DHS analysts who saw the tweets were unfamiliar with pop culture and did not recognize the British colloquial use of “destroy” as a euphemism for partying or that digging up Marilyn Monroe was a quote from an episode of the American comedy show Family Guy. In fact, the DHS paperwork states “Mr. Bryan confirmed that he had posted on his Tweeter [sic] website account that he was coming to the United States to dig up the grave of Marilyn Monroe. Also … that he was coming to destroy America.”

Social media is also rife with exasperated commentary posted in moments of anger. A British traveler in 2010 became the first Briton convicted of a crime on Twitter when, in a fit of anger after having his flight canceled tweeted that he would be “blowing the airport sky high!” While the courts agreed that given his background and lack of any preparative activity, it was highly unlikely that he actually intended to carry out that threat, it was nevertheless a criminal act.

If human intelligence analysts are unable to recognize sarcasm and distinguish between a British slang term for partying and an actual threat to attack the United States or recognize a line from an American comedy show, and when people routinely tweet very realistic threats in moments of anger, what hope do machines have and how should those be factored into social media-based threat scores?

As the Secret Service RFP reflects, sentiment analysis, whether powered by humans or people, has great difficulty understanding intent at the level of an individual when it comes to potential future violence. Few contemporary computer algorithms would have recognized “they take 1 of ours, let’s take 2 of theirs” or “putting wings on pigs” as statements of imminent violence against a police officer, while the hashtag “#shootthepolice” would likely also have been missed by many platforms due to the way in which hashtags have been treated by most sentiment analysis tools.

Putting all of this together, the challenges raised in the Post article and these other cases come not from the data itself, but rather from how that data is being interpreted. When black box algorithms synthesize large amounts of highly diverse data and incorporate very complex and nuanced algorithms like sentiment analysis, but output only a simple score like red/yellow/green, it masks all of that complexity and could lead to an officer feeling overly confident about a risk assessment. Instead, a better interface might be to display to the officer a handful of the data points judged by the algorithm to be most significant, such as the tweets or arrest records in question. The human officer would likely instantly recognize that a “rage” tweet about a card game is a false positive, while an outdated arrest record for a property that was just sold would suggest to an officer that the risk flag on a given house is no longer relevant. Moreover, by giving officers in the field, who are most familiar with local terminology, street slang and gang code of their area, direct access to information, they can likely make better determinations of risk than a generic algorithm applying blind generalized probabilities.

In the end, the issues faced by Fresno’s data-driven policing appear to be a classic issue of user interface design and a dashboard modality that masks the uncertainty of the underlying data. By eliminating the black box scoring component and instead providing officers a concise actionable summary of key data points identified by the software, the database’s realtime indicators can be coupled with the street knowledge and human intelligence of officers to yield the kind of intelligent policing envisioned by the Fresno Police Department with a much lower risk of false positives. Indeed, it appears that is precisely what the city intends to do.