The Declassification Engine: Reading Between the Black Bars

In late August, the Obama Administration released a trove of documents detailing the government’s collection of information under the Foreign Intelligence Surveillance Act, including a heavily redacted, fifty-two-page report on the National Security Agency’s FISA compliance for part of 2012. “These documents were properly classified,” James Clapper, the Director of National Intelligence, noted in a letter accompanying their release, “and their declassification is not done lightly.” That’s certainly true, by the look of it: black bars appear on all but seven of the pages in the compliance report. When the government touts the release of such heavily redacted documents as an act of transparency, leaving us to guess what we might be missing, the question inevitably comes up: Are there ways we can peer behind the black bars? According to a number of researchers, there often are.

It’s well known that the U.S. government has a tendency toward over-classification. A 2012 report by the Public Interest Declassification Board, a government-funded advisory group, found that a “culture of caution” among executive-branch agencies had lead to chronic over-classification and, in turn, has “compromised” the entire classification system. Government officials frequently perpetuate this culture by invoking national security, but Marc Trachtenberg, a Cold War historian at U.C.L.A., told me that “the function of declassification is much broader than keeping information from the enemy.” Often documents remain classified simply to save face. (Think of the cables released by WikiLeaks in 2010, some of which didn’t reveal sensitive information but were merely unflattering.)

Most government agencies that handle classified information have dedicated sanitizers. For an act so often associated with the anonymous, passionless churning of the government machine, redaction betrays a striking individualism in the choices about what to leave visible and what to obscure, and in the shapes of the black bars themselves. The black marker pen is the sanitizer’s most basic tool. But it can be sloppy, and the sheen of a photocopy sometimes reveals the letters beneath the ink. Sanitizers also employ opaque tape and razor knives, cutting out the sensitive content from a copy of the page. You can usually identify the tool by the marks it leaves behind: the pen-redacted page is filled with heavy, imperfect lines, while the razor knife and opaque tape both leave sharp edges (though the photocopier gives a knifed-out block a mottled, grayish hue). Occasionally, a sanitizer simply covers text with another piece of paper when he photocopies the document. The rise of born-digital documents has brought new challenges: in 2009, the N.S.A. released an updated version of “Redacting with Confidence,” its how-to guide for the declassification of digital documents. The manual emphasizes a new set of actions; when working with a word processor, sanitizers must delete sensitive content, replace it with “innocuous text” to preserve formatting, and only then cover the innocuous text with a digitally drawn black—or, as it recommends, gray—box. “Complex file formats offer substantial avenues for hidden data,” it warns. “Once a user enters data into the document, effectively removing it can be difficult.”

Trachtenberg, who has refined old methods for analyzing the redacted segments in declassified texts, utilizes a sort of comparative reading to see beneath the black. “The basic idea is to exploit the fact that documents—the same documents—are declassified in different ways in different repositories,” revealing different information in each version, he said. The guidelines for declassification vary across agencies and offer room for interpretation, so that a sanitizer responding to a Freedom of Information Act request in 1992 might redact a document differently from a sanitizer in 2012, creating advantageous inconsistencies. Slowly, an attentive researcher can chip away at the blacked-out parts of a document, building context and fuelling further excavation.

But Trachtenberg’s techniques, though fundamentally sound, are slow, and naturally other researchers have taken up the task of trying to automate the process, at least in part. On a cloudless afternoon not long ago, I met with Matthew Connelly, a Columbia history professor, outside the National Archives in Washington, D.C. Together with a group of historians, computer scientists, and statisticians, Connelly is developing an ambitious project called the Declassification Engine, which, among other things, employs machine-learning and natural language processing to study the semantic patterns in declassified text. The project’s goals range from compiling the largest digitized archive of declassified documents in the world to plotting the declassified geographical metadata of over a million State Department cables on an interactive global map, which the researchers hope will afford them new insight into the workings of government secrecy. Though the Declassification Engine is in its early stages, Connelly told me that the project has “gotten to the point where we can see it might be possible to predict content of redacted text. But we haven’t yet made a decision as to whether we want to do that or not.”

An attempt at automated un-redaction would not be without precedent. In April, 2004, Claire Whelan, then a doctoral candidate in computer science at Dublin City University, used a suite of established document-analysis technologies to decrypt a blacked-out word in the infamous “Bin Ladin Determined to Strike in U.S.” brief that George Bush received on August 6, 2001. Whelan ran a digitized version of the memo through optical-character-recognition software, which determined the font type (Arial) and helped to “estimate the size of the word behind the blot,” Whelan’s adviser, David Naccache, told Nature at the time. “Then you just take every word in the dictionary and calculate whether or not, in that font, it is the right size to fit in the space, plus or minus three pixels.” A second dictionary-reading program offered a few hundred fits for the “blot”—ranging from “acetose” to “Ukrainian”—which Whelan whittled down to about a half-dozen likely adjectives and country names. Eventually, she and Naccache identified the government that had warned the United States of an impending terrorist attack just a month shy of 9/11—and unless the Ukraine or Uganda had secret intelligence on bin Laden, all clues pointed toward Egypt.

Whelan and Naccache’s analyses relied heavily upon the revealing shapes of pen-drawn black bars in the memo to narrow down the number of possible words. When I spoke to Naccache, nine years after the experiment, he said that their technique had effectively “died,” due to the difficulty of applying it to digitally redacted documents, which the Declassification Engine’s archive will eventually include. The Declassification Engine, though, wouldn’t face the same barriers as the two human researchers, since its technologies focus less on the redaction bars themselves than on the spaces and words around them.

The Declassification Engine researchers are not approaching the matter without trepidation. When Richard H. Immerman—a historian at Temple University who, as a former Assistant Deputy Director of National Intelligence, has a top-secret security clearance—heard about the project’s potential for un-redaction, he started to worry about the mosaic theory, a precept that the intelligence community often invokes in the alleged and legally tenuous interest of national security. The theory’s thesis is clear-cut: pieces of banal, declassified information, when pieced together, might provide a knowledgeable reader with enough emergent detail to uncover the information that remains classified. Or, as Immerman put it: “If you can find A, somehow you can connect the dots to a really big Z.” In one 2003 case, the Center for National Security Studies sued the Justice Department after it denied a FOIA request for documents relating to the secret detention of hundreds of individuals after 9/11. The court ruled largely in favor of the D.O.J., in the end, justifying the government’s denial of the FOIA request on the grounds that the detainees’ information made up “a comprehensive diagram of the law enforcement investigation after September 11.”

“If they thought it was possible or likely that people could figure out what was behind the black bar and it was significant,” Immerman said of government agencies, “they would stop redacting all together. They would withhold the document.” Many of the people I spoke with voiced similar concerns about exacerbating the intelligence community’s “culture of caution.” The concerns are based not in present reality but in interpretation and promise; it’s about what the government thinks somebody might recover someday.

For his part, Connelly is conscious of the mosaic theory and has formed a “steering committee” of about a dozen historians (including Immerman), computer scientists, and experts familiar with classification and declassification to help guide the Declassification Engine and eye ethical and legal lines. Ultimately, the committee, which convenes in January, will help decide whether the project should try its hand at un-redaction. “No one on this project thinks there isn’t a proper place for official secrecy,” Connelly said, adding that the researchers do “want to explore what’s possible, if we can manage the risks.” The researchers hope the project will help illuminate the space between necessary secrets and over-caution, opening sanitizers’ minds to a less conservative approach to redaction. “I think what we all want,” Immerman said, “is a declassification process that we could be confident withholds material that really does have serious security or privacy implications, in contrast to the over-classification that we experience now.”