While this blog post provides a description of a data exposure discovery involving the Department of Defense, this is no longer an active data breach. As soon as the UpGuard Cyber Risk Team notified the Defense Department of this publicly exposed information, immediate action was taken, securing the open buckets and preventing further access.

The UpGuard Cyber Risk Team can now disclose that three publicly downloadable cloud-based storage servers exposed a massive amount of data collected in apparent Department of Defense intelligence-gathering operations. The repositories appear to contain billions of public internet posts and news commentary scraped from the writings of many individuals from a broad array of countries, including the United States, by CENTCOM and PACOM, two Pentagon unified combatant commands charged with US military operations across the Middle East, Asia, and the South Pacific.

The data exposed in one of the three buckets is estimated to contain at least 1.8 billion posts of scraped internet content over the past 8 years, including content captured from news sites, comment sections, web forums, and social media sites like Facebook, featuring multiple languages and originating from countries around the world. Among those are many apparently benign public internet and social media posts by Americans, collected in an apparent Pentagon intelligence-gathering operation, raising serious questions of privacy and civil liberties.

While a cursory examination of the data reveals loose correlations of some of the scraped data to regional US security concerns, such as with posts concerning Iraqi and Pakistani politics, the apparently benign nature of the vast number of captured global posts, as well as the origination of many of them from within the US, raises serious concerns about the extent and legality of known Pentagon surveillance against US citizens. In addition, it remains unclear why and for what reasons the data was accumulated, presenting the overwhelming likelihood that the majority of posts captured originate from law-abiding civilians across the world.

With evidence that the software employed to create these data stores was built and operated by an apparently defunct private-sector government contractor named VendorX, this cloud leak is a striking illustration of just how damaging third-party vendor risk can be, capable of affecting even the highest echelons of the Pentagon. The poor CSTAR cyber risk scores of CENTCOM and PACOM - 542 and 409, respectively, out of a maximum of 950 - is a further indication that even the most sensitive intelligence organizations are not immune to sizable cyber risk. Finally, the collection of billions of internet posts in several unsecured data repositories raises further questions about online privacy, as well as regarding the right to freely express your beliefs online.

The Discovery

On September 6th, 2017, UpGuard Director of Cyber Risk Research Chris Vickery discovered three Amazon Web Services S3 cloud storage buckets configured to allow any AWS global authenticated user to browse and download the contents; AWS accounts of this type can be acquired with a free sign-up. The buckets’ AWS subdomain names - “centcom-backup,” “centcom-archive,” and “pacom-archive” - provide an immediate indication of the data repositories’ significance. CENTCOM refers to the US Central Command, based in Tampa, Fla. and responsible for US military operations from East Africa to Central Asia, including the Iraq and Afghan Wars. PACOM is the US Pacific Command, headquartered in Aiea, HI and covering East, South, and Southeast Asia, as well as Australia and Pacific Oceania.

There are further clues as to the provenance of these data stores. A “Settings” table in the bucket “centcom-backup” indicates the software was operated by employees of a company called VendorX, complete with a listing of the details of a number of developers with access. While public information about this firm is scant, an internet search reveals multiple individuals who worked for VendorX describing work building Outpost for CENTCOM and the Defense Department:

Descriptions of VendorX’s work on Outpost, via employee LinkedIn pages.

This external reference to “Outpost” as a Pentagon social engineering effort built by VendorX appears to be corroborated by the contents of “centcom-backup,” which, besides, the references to VendorX in the “Settings” table, contains a folder titled “outpost.” Within this folder is the development configurations and API for Outpost, and while this content’s exact relationship to the “Outpost” program described on former employees' profiles remains unclear, some indication of its purpose may be provided by a number of very large compressed files also within the bucket. Decompressed, these files are revealed to contain Lucene indexes, a search engine used to easily look for search terms throughout massive amounts of data, including keywords, partial words, and combinations of words, in a number of different languages. These Lucene indexes, which are optimized to interact with Elasticsearch, seem to parse internet content similar to that contained in the other buckets.

Taken together, this disparate collection of data appears to constitute an ingestion engine for the bulk collection of internet posts - organizing a mass quantity of data into a searchable form. The former employee's reference to “high-risk youth in unstable regions of the world” is further corroborated by an examination of another folder within “centcom- backup.”

This folder, titled “scraped,” contains an enormous amount of XML files consisting of internet content “scraped” from the public internet since 2009 to 2015; the other CENTCOM bucket, “archive,” would be found to contain more such data, collected from 2009 to the present day. With a number of information fields describing the origins, nature, contents, and web address of the post, thousands of examples of such scraped content are listed in plaintext - a smaller example of the massive stores of such data contained in the other two buckets.

Screencaps of sample posts, with information fields visible.

Also contained in “scraped,” however, is a folder titled “Coral,” which likely refers to the US Army’s “Coral Reef” intelligence software. This folder contains a directory named “INGEST” that contained all the posts scraped and held in the “centcom-backup” bucket. The Coral Reef program “allows users of intelligence to better understand relationships between persons of interest” as a component of the Distributed Common Ground System-Army (DCGS-A) intelligence suite, “the Army's primary system for the posting of data, processing of information, and dissemination to all components and echelons of intelligence, surveillance and reconnaissance information about the threats, weather, and terrain” programs. Such a focus on gathering intelligence about “persons of interest” would be even more clear-cut in the other two buckets, starting with “centcom-archive.”

The bucket “centcom-archive” contains more scraped internet posts stored in the same XML text file format as seen in “centcom-backup,” only on a much larger scale: conservatively, at least 1.8 billion such posts are stored here. This vast repository ingested content from a broad array of webpages; while Facebook is a popular, recurring host, everything from soccer discussion groups to video game forums are sources for scraped web posts. The posts themselves are in many different languages, but with an emphasis on Arabic, Farsi (spoken in Iran and Afghanistan), and a number of Central and South Asian dialects spoken in Afghanistan and Pakistan. The most recent indexed files were created in August 2017, right before UpGuard’s discovery, consisting of posts collected in February 2017. Not present are any Lucene index files of the sort seen in “centcom-backup” - the contents of this bucket are purely the input (or, perhaps, also the output) of an internet-scouring machine. There are few indications as to the level of importance afforded to these posts.

Given the CENTCOM buckets’ focus on the collection and organization of millions of internet posts, largely from the Middle East and South Asia - a focus that would certainly also be of interest to a program like Coral Reef - it is perhaps unsurprising to see hints at why some of these posts would be of significance. Arabic posts criticizing or mocking ISIS, posted to Facebook pages for Iraqi anti-jihadi groups, or Pashto language comments made on the official Facebook page of Pakistani politician Imran Khan, who has drawn scrutiny from both the Taliban and the US government, give some indication of content that might be of interest to CENTCOM in its prosecution of regional wars and against Islamic extremists.

The bucket “pacom-archive” is very similar to the contents and structure of “centcom-archive,” but skews toward Southeast and East Asian posts, as well as some by Australians. Taken together, the buckets “centcom-archive” and “pacom-archive” appear to store raw ingested (or even possibly raw egested) internet content on a massive scale, perhaps to be run through text extraction programming. This data’s relationship to the searchable Lucene indexes discovered in “centcom-backup” remains unclear. Taken together, however, the data suggests that there is well-crafted interplay between the “Coral” social media and commentary scraping project, an ingestion engine dubbed “Thor,” and a public-influence initiative referred to as “Outpost.”.

Massive in scale, it is difficult to state exactly how or why these particular posts were collected over the course of almost a decade. Given the enormous size of these data stores, a cursory search reveals a number of foreign-sourced posts that either appear entirely benign, with no apparent ties to areas of concern for US intelligence agencies, or ones that originate from American citizens, including a vast quantity of Facebook and Twitter posts, some stating political opinions. Among the details collected are the web addresses of targeted posts, as well as other background details on the authors which provide further confirmation of their origins from American citizens.

It remains unclear on what basis this data was collected.

What is more clear is the significance of these data repositories’ contents.The collection of public internet posts in massive repositories by the Defense Department for unclear reasons is one matter; the lack of care taken to secure them is another. The CENTCOM and PACOM CSTAR cyber risk scores of 542 and 409 provide some indication of gaps in the armor of two major military organizations’ digital defenses. The possible misuse or exploitation of this data, perhaps against internet users in foreign countries wracked by civil violence, is a troubling possibility, as is the presence of US citizens’ internet content in buckets associated with US military intelligence operations. The Posse Comitatus Act restricts the military from “ being used as a tool for law enforcement, except in situations of explicit national emergency based on express authorization from Congress,” but as seen in recent years, this separation has been eroded.

Despite all of this, the same issues of cyber risk driving insecurity across the landscape are present here, too. A simple permission settings change would have meant the difference between these data repositories being revealed to the wider internet, or remaining secured. If critical information of a highly sensitive nature cannot be secured by the government - or by third-party vendors entrusted with the information - the consequences will affect not only whatever government organizations and contractors that are responsible, but anybody whose information or internet posts were targeted through this program, potentially resulting in unfair bias or unwarranted actions against the post creator.