Leaking Toolkit Description

This is a detailed description of a flexible and extensible leaking and disclosure toolkit for easy release of useful information. This toolkit can be used by independent leaking organizations, investigative journalists, government free information initiatives, activist groups, or many other types of organizations.

Transparency is necessary for citizens to be informed and educated about the actions of their government and other institutions. Leaking organizations like WikiLeaks are essential because they represent an independent check on government secrecy and authority, a proactive alternative to freedom of information policies, and a safe and easy outlet for whistleblowers. While leaking organizations are a great way of promoting transparency, they also have many flaws and challenges in their approach. It is difficult to regularly redact, fact check, and summarize documents for release, especially in large releases. This means that in some releases, leaking organizations cannot live up to their own goals and policies. Even with good summaries and background information, it can be hard to get people to read documents and spend enough time to understand the release let alone trigger reform as a result of a leak. Finally, leaking organizations naturally arbitrate access to information and so can become powerful institutions who use this authority to release or frame information for their own gain.

A leaking toolkit will make leaking easier, safer, and more effective while limiting and distributing the power of leaking organizations. Some parts of the process, like redaction and summarization, could be mostly automated using machine learning and natural language processing algorithms. This would drastically reduce the time it took to run a leaking organization and thus make it feasible for more people to do it well. Additionally, the toolkit will be flexible, open, and easily extensible so it can apply to many types of documents and situations. This would eliminate the need to create new tools to analyze a particular set of documents. Graphical and narrative timeline analysis tools that connect to outside sources with content added both automatically and by hand could help readers better understand the document sets. Crowdsourcing incorporating gamification principles could be used to tweak automated output, check facts, and perform other difficult-to-automate tasks. Crowdsourcing is also beneficial as involving more people leads more directly to mobilization based on wrongdoings revealed in leaked documents and discussion of the information released. It also distributes the power of leaking organizations.

I am currently building and testing an initial set of tools with leaking organizations. I will then continue to expand and improve the toolkit and test them with other institutions as I study their effectiveness. To start this process, I have made LeaksWiki, a wiki to study the leaking process and identify effective strategies and areas that need improvement. My tools and others will also be tested with the results posted on LeaksWiki. LeaksWiki and my research on it forms the basis for the direction and approach taken in my tools. I will work with leaking organizations to get feedback on tool designs, test the tools, and tweak my approach. Some of the tools I release will be built from scratch while others will be proven open source tools that I optimize for leaking and my toolkit. After I have a solid initial toolkit, I will work with institutions with freedom of information policies and other organizations.

My hope is that this toolkit will help build a culture of transparency and accountability enforced by decentralized networks of informed citizens. Such a culture would drastically improve society by encouraging civic engagement and critical thinking in citizens and identifying the flaws of institutions.

What follows is a description of the toolkit and its modules. This document is constantly under revision and I will add new developments, so check back!

The toolkit will be browser-based at first and flexible. It will have a base system with lots of modules that organizations can add to their toolkit installation. Each module will integrate with the toolkit in a manner similar to Firefox addons or WordPress plugins. Additionally, every module will function on its own without the rest of the toolkit and have an underlying API to make it easily adaptable for other purposes.

The toolkit base is the bare system available without any modules. Most of the toolkit is based on the modules, but the toolkit base determines the structure of the installation (centralized or distributed) and provides ways for users to register, administrators to add modules, and administrators to configure settings and access privileges. It also allows administrators to setup a web front-end at a .onion, on the open web, or both. While most of the toolkit base gives everyone the same options, administrators must choose one of two structures are installation, centralized or distributed.

The centralized toolkit installation keeps the data and installation details only on the server the toolkit is installed on. Users will be able to access the information, but no mirror servers or additional nodes can be automatically created.

The distributed toolkit allows others to use their servers to host encrypted copies of some or all of the data. This data may just be an encrypted backup/insurance file or it could be where the server pulls the data from when users view it. One model for this could be a system based on Tahoe-LAFS or a similar system. In either case, it will be harder to take the site down entirely.

Setups that use the distributed toolkit could also allow mirror sites with the same data and the same or similar set of modules and settings. Instead of full mirrors, they could also choose to have additional nodes dedicated to specific details or locations that share data with a central node or nodes with similar topics. This results in more of a leaking network.

Generally, administrators will also want to install at least one of the core modules, which give the toolkit its basic functionality. Other modules interface with one or both of these core modules to help users complete tasks and provide additional tools. There are two main types of core modules, focused and collaborative. Both of these are highly customizable with a variety of options, so the descriptions below cover the most important parts of the basic installation with no customization.

The focused core module is a system for tracking and browsing documents that is only accessible to approved users. This works well with the current processes of most governments, media organizations, and centralized leaking organizations.

In the focused system, documents are uploaded directly by users or through an additional submission module. These documents can then be tracked and managed through an integrated document tracking system. The tracking system displays information about the document and the status of the document. The document statuses are determined by administrators and may represent steps in the disclosure process, like redaction, summarization, and release.

The users of the system can then be assigned or choose to complete these steps. After completing the step, the user changes the status of the document and uploads any information from that step, like a redacted version of the document or links to additional information. This information is then displayed along with the document in the system. Administrators can control what information each user sees.

The focused core module also allows people to browse documents in a few ways. The first way is through a search page with a variety of options. The second option is to view all documents with a particular tag. Documents may be tagged automatically or manually by users. The third way to view documents is through a complete list of documents clustered by file type, size, or other attributes. The fourth viewing option is to look at a random document.

The collaborative core module allows teams of people to work together to investigate certain issues and release leaked information. By default, participation in toolkits using the collaborative core module is open. The collaborative system works best for activist groups, citizen journalists, and distributed leaking organizations. It could also be used for additional investigation of the releases of more centralized and closed organizations.

There has been a huge movement for transparency and open government. This has lead to governments releasing data sets while programmers and designers make pretty data visualizations so people can better use the data. Unfortunately, the transparency process often stops there. No one checks this data, finds additional information, or actually uses it to advocate for reform. This is unfortunate because the data may be wrong or incomplete, making the data visualizations useless. Even if the data is correct, however, there is little benefit to it being public if it does not lead to positive change.

One possible solution to this is for people with different skill sets to work together to investigate issues. The release of government data is a part of this, as is data visualization, but people also need to file Freedom of Information Act (FOIA) requests, interview the people involved, blog and write articles about the findings, and advocate for reform. With the right group of people who care about the issue at hand, investigative strike teams can lead to meaningful understanding and reform of the government or other institutions.

The collaborative core module is an attempt to provide a platform for people to form investigative strike teams around leaked or released information. Users can register to participated in the investigative process. When they register, users select their skills and issues they care about.

Anyone can then add a topic for investigation. For each topic, people then can add questions to answer and/or reform goals to achieve. Finally, each question or goal has actionable tasks needed to answer the question or achieve the goal. Each of these tasks is tagged with the skill needed to complete it. Some of the tasks may depend on others, so tasks that depend upon another task are grouped with the task they depend upon.

Each topic, question or goal, and task is also tagged with areas or issues to which it relates. These issue-based tags and the skill tags on each task determine which users may want to complete which tasks or follow certain topics, questions, or goals. The module then displays tasks appropriate for each user. This means that every user will know exactly what they personally can do to help. The users can then choose which tasks they want to work on and assign themselves to the task. If the user does not complete the task after a certain time period, he or she will be unassigned from the task so other users can work on it.

After completing the task, the assigned user can post the findings for each task through file uploads and a text box or integrated modules related to the task. Users who were not assigned to the task may also edit or add to the findings. Every topic, question or goal, and task also has a discussion thread, so users can talk about findings and determine best next steps.

Each topic, question or goal, and task can be up voted or down voted based on how important users think it is to the investigation. This vote count determines the place of the item on the list. Completed items are generally displayed below incomplete items. A task is complete when a user completes it and a question or goal is complete when all tasks inside are complete.

In a way, this system is a more involved alternative to signing a survey or retweeting a link. Instead of just educating the viewer, it provides a clear path to understanding and change where everyone knows tasks of varying difficulty they can complete to help.

In addition to the toolkit base and the two core modules, there are additional modules to help people complete certain tasks or automate parts of the process. Below are descriptions of some of the planned modules. Some of these I will make from scratch while others can be modifications of existing open source projects.

One of the key parts of many leaking or disclosure organizations is a submission system for whistleblowers to send in documents. Some receive documents by mail or email, but many organizations need a browser-based submission framework that protects the anonymity of their sources.

GlobaLeaks is one such secure and anonymous submission framework. It is highly customizable and integrates a variety of existing security tools. For the secure submission module, it probably makes the most sense to use a modified version of GlobaLeaks.

After documents are received, sometimes names and other identifying information need to be removed to protect people. The exact standards of what to remove varies by organization, but many redact documents by hand and could benefit from some redaction being done automatically to speed up the process.

Automatic redaction is probably best accomplished by removing all words that are not dictionary words. This means names of people and places will be redacted by default. Accuracy of what to redact could also be improved with language processing and machine learning algorithms.

Redaction is tricky though and programs may not do it correctly. Ideally, someone will also manually review all documents to remove unidentified names, unredact details they want in the release, and redact other information they do not want to release. Alternatively, the actual redaction could happen during the review and the module would only highlight suggestions for redaction. This check will also help train the learning algorithms, thus increasing their future accuracy.

Names and information in the document are not the only things that can reveal the identity of the whistleblower or others involved. Files can contain metadata with details on the creator of the file, where it came from, etc. A number of tools exist for removing metadata for different file types, but many organizations still remove metadata by hand.

A module with metadata removal tools for most file types could remove metadata with the click of a button and perhaps be set to do so on submission.

While it would be ideal if everyone could read the released documents in full, the documents are often too long for many people to read them. Some organizations release summaries of the documents to deal with this issue. Unfortunately, these summaries can take a while to write and significantly delay the release.

A module could automatically summarize documents. This would help both the readers and the initial reviewers of the document. There are already many different approaches for automatic summarization. I need to conduct a full review of these to determine which will work best. The module will likely have a variety of summarization options though, as each option may work best on a different type of document.

If a toolkit installation uses the automatic summarization module, the users can still replace, edit, or add to the automatically constructed summary. For some documents, it may be overkill to use this module in the first place.

Sometimes restricted documents are released multiple times through FOIA or other means and each of these released documents has different parts redacted. Alexa O’Brien suggested I make a tool to cross-reference and combine these documents into the most complete version possible.

To use this module, someone could upload a copy of a restricted document with redactions. A bot could then retrieve any additional copies of the document from otherwebsites that release documents like this or someone could upload other copies. All of these copies would then be transcribed and checked against each other or otherwise overlaid until there are as few redactions as possible.

Leaking organizations and activists often need to conduct an investigation of a situation or analyze documents. In complicated situations, part of this investigation may involve understanding how different events and documents fit together. Unfortunately, there is so much information available today that it can be difficult to collect and understand it all.

One way to manage this is a collaborative editable timeline module used to track documents and events. In the Eventline module, each event is also a timeline with sub-eventlines. Anyone can add eventlines with descriptions. On some eventlines, the sub-eventlines could be pre-populated based on closely related Wikipedia articles as well.

Each Eventline has a list of links to relevant articles, documents, and other information as a supplement to the description and a place to discuss the evidence. Each piece of evidence could have other indications of the position it represents or information it provides as well.

Even with sub-eventlines, the view may still be cluttered with events. To deal with this issue, each eventline can be rated by users on its importance to the parent eventline (and some may have multiple parents). Viewers can then change the level of detail displayed and the time range of events shown for each eventline.

In analysis of sets of documents, it can be helpful to view the connections between documents with similar contents or origins. The Associated Whistleblowing Press (AWP) has a tool that does this for WikiLeaks cables. A module based on AWP’s graph tool, but adapted for more document sets could be useful. This module should also allow users to draw their own connections between documents, add notes to the connections, and add other documents to the graph.

Information-Collecting Bots
Some additional research does not need to be done manually. To save time, bots can gather additional documents with content related to documents already uploaded. These bots can gather sources generally or specifically for modules. They could also pre-populate background information and Eventline descriptions with information from Wikipedia. Depending on configuration settings, this bot-collected information could be added to modules and release pages directly or approved by a user and then added to the pages.
Translation

Ideally, leaked documents and supporting information will be translated into several languages. A translation module for a combination of automatic and crowd-sourced translation could help with this.

Documents and information could first be translated using Google Translate or a similar system. For some languages, this is quite accurate, but for others this is near useless. The rough translation could then be posted to a collaboratively editable platform for tweaking and rewriting. One good tool for this could be Wirite, software for collaborative document creation where people can vote on edits.

After any document processing or investigations are complete, a release module allows users to post documents with all supplementary information and findings on central pages. These pages may be for a single document or a set of documents.

Each release page includes a list of documents with summaries, links to additional sources, links to articles about the release, and links or embeds to other modules with additional details or discussions about the release. The pages may also be collaboratively editable or in wiki format, in which case a master copy will be kept between all modules with the same supplementary data.

The release system can be integrated with the tracking or task system for the automatic addition of information after completion of processing and investigation tasks. It also provides search functionality similar to the core modules.

Hopefully users participate in the investigations and releases because they care about the issues and transparency. Even if this is the case, a point system may still encourage users to be more involved.

A points module could integrate with other modules to track which tasks each user completed and assign points for each of those tasks. Some of these points could be specific to the type of tasks, skill, or topic. With enough points in a specific area, the user could be marked as an expert in that area. Administrators could configure the toolkit to allow experts to get priority on claiming certain tasks, weight their up-votes on tasks or questions related to their expertise, etc. This system would also provide an incentive for users to learn more about topics of interest and put that knowledge to use.

The modules above and the ones I make are not the only ones organizations can use. Anyone can make and use their own modules. Additionally, the toolkit base will provide a place for people to upload their modules for others to use.

Ideally, modules provided for others to use will meet a few standards. First, they should be free and open source. Second, they should have an API for accessing data and some functions of the module. Third, the developers need to specify how the module integrates with the base system, core modules, and/or other modules. Fourth, where possible, modules should run on the bare base system without other modules. Fifth, the module should be somewhat customizable with settings or ways to easily modify it. Sixth, and most importantly, the module should work as described. These standards are ideals and not requirements. The more of these a module meets, the better, but it will not be possible to meet all the standards in all cases.

The modules accessible from the base toolkit will be accessible through a search function. Each module will be installable at the press of a button. A guide and API for the base toolkit will help developers create modules that work with the toolkit. The installation pages for the modules will have descriptions of the module, ratings, and reviews to help people decide if they wish to install it.

All of this is just a start to the toolkit design. Creating the toolkit itself will take years. For now, I am starting with the centralized version and collaborative core module. I will go from there as I extend the toolkit and add modules.

If you have suggestions, questions, or comments or want to test the toolkit, please contact me. I am always open to discuss the toolkit and the future of transparency. I also need feedback to make this work.