Internet bots—those automated scripts that do everything from gathering stock prices to commandeering innocent computers to launch cyberstrikes—have recently come under attack as threatening the web, democracy and our very way of online life. During the 2016 presidential election, Russia unleashed an army of bots to troll Facebook and other sites, amplifying political division in support of Donald Trump and Bernie Sanders. Twitter reported that in May alone it found nearly 10 million bots each week. Some of these Twitter bots posed as fans to enhance the popularity of celebrities (when the bots were stricken from the rolls, former President Barack Obama, Katy Perry and Oprah suddenly became about 2 percent less popular).Ever ready to wield a legislative remedy,California is considering laws to formally define and regulate all manner of online bots.

If these bots are so terrible, why not simply outlaw them?

Story Continued Below

Well, for starters, the internet could barely function without them. Google, for example, can only index the web and present search results through the use of bots—it calls the process “Googlebot”—which it describes as a spider that crawls nearly every website on the internet, often every few seconds. Say what you will about the fairness of algorithms that order search results, but you’re not going to have much patience for a search engine that depends on humans laboriously copying data from individual websites to craft its rankings. Businesses commonly use bots to crawl and scrape the websites of competitors for real-time pricing information. But there’s another argument in favor of bots that gets far less attention, at least outside of a courtroom in Washington, D.C. And it is challenging the notion that all bots—even fake accounts—are evil.

The case unfolding in the federal district courtroom of Judge John D. Bates was filed by a group of civil rights researchers who depend upon crawling and scraping bots, along with thousands of fake accounts, to uncover persistent and pernicious discrimination—based on race, gender or age—on employment and housing platforms, and across the very web itself. In their hands, internet bots are a potentially unparalleled tool for social justice, albeit one that happens to run afoul of the terms of service of platforms like Facebook and Twitter that prohibit bots and fake accounts.

In a preliminary ruling in March, Bates held that these researchers could well enjoy a First Amendment right to create fake accounts, along with their attendant bot automation, to crawl web platforms, scrape their contents and use the data to statistically measure discrimination. He might as well have said, when it comes to bots, we must learn to tell good from bad.

If there’s such a thing as a good guy with a bot, it’s someone like Christian Sandvig. He’s one of a number of researchers in this new field of “algorithmic accountability,” and he’s a plaintiff in the pending D.C. litigation. Sandvig’s mission is to detect online discrimination in housing or employment opportunities on online platforms and on the web writ large. Do women see fewer ads for high-paying CEO jobs than men? Do white couples see ads for apartment rentals that black couples do not? Preliminary studies suggest the answer to both questions could quite possibly be yes, and often the cause might be an algorithm rather than a deliberate choice by an employer or a landlord. As Sandvig puts it, what do we do “when the algorithm itself is a racist”?

We have known for a little while now that on certain social media platforms there was a potential for people to do what people have done from time immemorial—discriminate. In a series of recent articles, ProPublica showed how Facebook allowed advertisers to discriminate on the basis of race, gender, age, or status as a parent—all categories protected by law—in placing ads for housing. In 2016 and part of 2017, an advertiser could exclude African-Americans from seeing housing ads; when Facebook fixed this setting, ProPublica reported in 2018 it was still possible to exclude based upon gender by checking a box to exclude moms with children, for example.

But these earlier ProPublica studies showed only that one could check these boxes, not that employers, landlords or real estate agents actually were. Moreover, these studies, telling as they are, do not address the big data algorithms that now dominate advertising. They do not tell us whether, for reasons that are not directly attributable to a person, the algorithm has determined based upon past data to show CEO jobs more to men than women.

Andrew Selbst, a scholar of online discrimination and big data algorithms, explains the problem. If you train an algorithm on past data about who holds the top CEO jobs, the data will include far more men than women. The algorithm, detached from all concern for workplace fairness, will conclude that maleness is a qualification for the job, and therefore show ads for those jobs to men more than women. Maleness ends up “being coded as merit. But it’s baked into centuries of discrimination. You’re tech-washing this old claim of seeing merit as this neutral idea.”

***

To test whether algorithms are racist, researchers adapt old-school civil rights testing—“audit testing”—to the vast and ever-shifting expanses of the web. Ordinary civil rights testers will send a white couple and a black couple—identical in every way except race—to apply for apartments. An online audit proceeds the same way, but with the repetitive speed of a modern processor operating at billions of cycles per second. Such “pair-audit tests” are a “really critical part of testing,” says Rachel Goodman, a civil rights lawyer with the ACLU who represents the plaintiffs on the D.C. litigation.

But when applied online, these pair-audit tests must often be automated with bots and fake accounts because, as Sandvig points out, web pages and their ads are entirely personalized, different for each person visiting a site, and even different each time the same person visits the site. “Each of us is seeing a webpage no one else sees and will never be seen again,” he says. This discrimination is hidden because no one knows what they don’t see. Women will never know they weren’t shown that ad for CEO of a company because the ad was personalized for them by an algorithm that concluded they were less qualified than a similar man.

To detect this hidden and fleeting discrimination—as fleeting as when a person leaves a webpage—researchers need to create fake accounts on major platforms for housing and employment. They need to create bots, automated computer scripts that will visit these websites, thousands or hundreds of thousands of pages, and record what they find, before they evaporate.

For example, in a foundational 2015 study, researchers at Carnegie Mellon (among other universities) created 1,000 personas by starting a fresh web browser and clicking a setting that allowed them to set the gender. They set half male, half female. Each fresh browser became a new, virtual online person, and they built these virtual beings, using automated bot scripts, by having the browsers visit the same sites—in this case, the top 100 employment websites. This behavior primed the internet advertising universe—the so-called “persistent tracking cookies” advertisers use to identify a person’s interests—to recognize them as job-seekers. These web browsers, these 500 Johns and 500 Marys, then visited several websites, including The Times of India—useful because its site contained so many text ads—and the researchers recorded the ads each browser was shown. (This account greatly simplifies but captures the thrust of the study, according to co-author Michael Carl Tschantz.)

The study found that the Google ads treated the genders differently, showing “women” fewer ads for high-paying jobs than “men.” In one finding, Google showed an ad for a career coaching service for jobs paying more than $200,000 to the “men” 1,852 times versus only 318 times to “women.” The study did not show that anyone acted intentionally. It did not even attribute blame for the discrimination. “We can’t be 100 percent sure why it happened,” said Anupam Datta, another of the study’s authors. It could arise from numerous sources, such as the algorithm used to generate the ads, the data set upon which that algorithm was trained, or even intentional discrimination by at least some of those placing the ads. But to really determine many of the causes would require “insider access,” Datta said. But it is this inside access that many platforms are unlikely to grant, according to Datta, “because of IP considerations.” Which brings us back to outsiders using bots—a realization that may have led the Knight First Amendment Institute to send Facebook a public letter last week requesting an exception from its ban on certain bots and research accounts often used by journalists.

This 2015 ad study measured cross-platform discrimination facilitated by tracking cookies that allow advertisers to follow a person from site to site. But it did not create fake accounts or use bots to crawl and scrape data from the employment or housing platforms themselves. The researchers didn’t need to. Since that study, interestingly, Google has changed its settings to prevent anyone from creating an anonymous browser and set the gender—one must set up an account. Today, to perform the same research, researchers would need to create fake accounts, according to Datta. Moreover, to perform research for discrimination in housing and employment would likely require fake accounts on those platform sites themselves. But that’s where researchers fear they might run up against federal criminal law.

Indeed, Sandvig had so much concern the research he would like to conduct would violate federal law that he teamed up with other researchers and lawyers at the ACLU to bring a lawsuit against the Department of Justice seeking judgment that the First Amendment protects their research into discrimination online. Two of these researchers told the court that they would like to create fake accounts—“sock puppets,” as they called them—at an employment website, half male and half female, but otherwise with identical attributes, to uncover discrimination, particularly in the ads shown by the platform algorithms.

But these techniques, the use of fake accounts and automated bots to scrape the results, violate the express terms of service of nearly every major platform website. LinkedIn, for example, says, anyone using the service “agrees” that the account “must be in your real name.” Facebook—a major advertiser for jobs and housing—also says in its terms that any user must “use the same name as in real life.” Worse, the federal hacking statute, the Computer Fraud and Abuse Act, under at least some interpretations, incorporates these terms of service. To violate the terms of service is to gain “unauthorized access” to a computer and thus to commit a crime. As Bates put it in the D.C. litigation, “to knowingly violate some of those terms, the Department of Justice tells us, could get one thrown in jail.” While the DOJ argued such criminal prosecutions are quite rare, and that it won’t prosecute harmless terms of service violations, Bates concluded there was a credible threat of prosecution against the plaintiffs.

Bates’ decision is preliminary only—denial of a motion to dismiss—but its language runs broadly. He ruled, essentially, that the researchers’ use of bots to scrape information from platforms would not violate federal law, and that their proposed creation of fake accounts through deception is an activity likely protected by the First Amendment. The government has little interest in criminalizing such activity, he wrote, and the harm to the target platforms from such fictitious accounts, created for research only, is minimal, assuming the facts in the complaint are true.

But his decision sets forth broader principles about the nature of online platforms. The DOJ argued in the Sandvig case, and online platforms such as LinkedIn or Facebook themselves have often argued, that these platforms are merely private property. A platform can deny access to anyone it wishes, for any reason, including if a visitor creates fake accounts or uses automated bots. If the government criminalizes a person who accesses such a website in violation of the website’s rules, they have merely criminalized a trespass analogous to a criminal trespass in the real world that would occur if a person refused to leave the premises after being told to leave.

Bates rejected the argument that the researchers are trespassing; he wrote, essentially, that the public portions of platforms, including the profiles of its users, constitute a public place of sorts.Researchers, at least, may have rights to enter these public portions of the platform, even with fake accounts and bots. After all, he reasoned, a bot merely does what an individual could do in visiting a website: it goes to the site, goes to a particular page, and records what it finds there. As Rachel Goodman, an ACLU lawyer representing the plaintiffs, put it, “It’s not an argument about them as public forums in the traditional sense that they have to accept any comments. What the judge was saying is the internet is a ‘critical medium of communication,’ [and that] the analogy to private property doesn’t hold up.”

***

Judge Bates’ preliminary protection for researchers falls within a very specific context. It applies to legitimate researchers who visit the public portions of major platforms. But how far should that view extend, and how can we draw the lines when it comes to other uses of scraping, as for economic competition? How do we tell a good bot from a bad? And is it fair to say that a bot does no more than what an individual human being might do, only on a larger scale?

After all, bots allow businesses to scrape information from competitors’ sites at a scale far beyond what an individual human being, or a team of human beings, could hope to accomplish. For example, in a closely watched lawsuit in California, a startup company called hiQ’s entire business model revolves around scraping data from LinkedIn and using that information to perform high-level analytics of trends for their corporate customers. Its “Keeper” function uses big data analytics to alert an employer if one of its valued employees is about to jump ship. HiQ uses bots to visit the public portions of LinkedIn, the public portions of individuals' profiles, so one could argue that these bots are simply viewing what any ordinary person signed onto LinkedIn could observe. On the other hand, the bots crawl over hundreds of thousands more profile pages than a human could, and scrape the information there with a speed and accuracy beyond any human’s, and return that information to hiQ for high-level processing. Basically, hiQ would not exist if it couldn’t use bots.

LinkedIn configured its platform to block hiQ, arguing that it was merely protecting its private property. HiQ sued, arguing that LinkedIn’s public-facing profiles of users are essentially public property. The court sided with hiQ, writing that LinkedIn is more like “a storefront window visible on a public street.” It issued a preliminary injunction requiring LinkedIn to allow hiQ to continue scraping its site for this public information. Like Bates, the California court held (preliminarily) that the Computer Fraud and Abuse Act did not criminalize its use of bots, even over LinkedIn’s objections; but going far further, it held that California’s unfair competitions laws required, for now, that hiQ be allowed bot-access to LinkedIn. The case remains on appeal.

These cases raise the fundamental question: To what extent are dominant platforms such as LinkedIn public spaces, and to what extent do the platforms have the power to control access as if they are private property? When those seeking access are civil rights researchers, their cause is somewhat sympathetic. But when those seeking access are Russian bots posing as Americans seeking to sow discord in elections, we may want and even demand that the platforms patrol their access. When the example lies somewhere in the middle, as in the hiQ case against LinkedIn, we will have to await the decisions of the courts.