1) Mimi and I are constantly discussing what it means to be a nonprofit organization, whether it’s a legal definition or a philosophical one. We both agree, though, that our current system is pretty narrow, which is why it’s interesting to see states considering new kinds of entities, like the low-profit LLC.

2) This graphic of who’s online and what they’re doing isn’t going to tell you anything you don’t already know, but I like the way it breaks down the different ways to be online. (via FlowingData) At CDP, as we work on creating a community for the datatrust, we want to create avenues for different levels of participation. I’d be curious to see this updated for 2010, and to see if and how people transition from being passive userd to more active userd of the internet.

3) CDT has filed a complaint against Spokeo, a data broker, alleging, “Consumers have no access to the data underlying Spokeo’s conclusions, are not informed of adverse determinations based on that data, and have no opportunity to learn who has accessed their profiles.” We’ve been wondering when people would start to look at data businesses, which have even less reason to care about individuals’ privacy than businesses with customers like Google and Facebook. We’re interested to see what happens.

4) The Data Portability Project is advocating for every site to have a Portability Policy that states clearly what data visitors can take in and take out. The organization believes “a lot more economic value could be created if sites realized the opportunity of an Internet whose sites do not put borders around people’s data.” (via Techcrunch) It definitely makes sense to create standards, though I do wonder how standards and icons like the ones they propose would be useful to the average internet user.

The Common Data project is looking for a partner organization to develop and test a pilot version of the datatrust: a technology platform for collecting, sharing and disclosing sensitive information that provides a new way to guarantee privacy.

Funders are increasingly interested in developing ways for nonprofit organizations to make more use of data and make their data more public. We would like to apply with a partner organization for a handful of promising funding opportunities.

We at CDP have developed technology and expertise that would enable a partner organization to:

Collect sensitive data from members, donors and other stakeholders in a safe and responsible manner;

Open data to the public to answer policy questions, be more transparent and accountable, and inform public discourse.

We are looking for an organization that is both passionate about its mission and deeply invested in the value of open data to provide us with a targeted issue to address.

We are especially interested in working with data that is currently inaccessible or locked down for privacy reasons.

We can imagine, in particular, a couple of different scenarios in which an organization could use the datatrust in interesting ways, but ultimately, we are looking to work out a specific scenario together.

A data exchange to share sensitive information between members.

An advocacy tool for soliciting private information from members so that organizational policy positions can be backed up with hard data.

A way to share sensitive data with allies in a way that doesn’t violate individual privacy.

If you’re interested in learning more about working with us, please contact Alex Selkirk at alex [dot] selkirk [at] commondataproject [dot] org.

Our Mission

We live in a world where data is obviously valuable — companies make millions from data, nonprofits seek new ways to be more accountable, advocates push governments to make their data open. But even as more data becomes accessible, even more valuable data remains locked up and unavailable to researchers, nonprofit organizations, businesses, and the general public.

We are working on creating a datatrust, a nonprofit data bank, that would incorporate new technologies for open data and new standards for collecting and sharing personal data.

Our Work

We’ve been working in partnership with Shan Gao Ma (SGM), a consultancy started by CDP founder, Alex Selkirk, that specializes in large-scale data collection systems, to develop a prototype of the datatrust. The datatrust is a new technology platform that allows the release of sensitive data in “raw form” to the public with a measurable and therefore enforceable privacy guarantee.

In addition to this real privacy guarantee, the datatrust eliminates the need to “scrub” data before it’s released. Right now, any organization that wants to release sensitive data has to spend a lot of time scrubbing and de-identifying data, using techniques that are frankly inexact and possibly ineffective. The datatrust, in other words, could make real-time data possible.

Furthermore, the data that is released can be accessed in flexible, creative ways. Right now, sensitive data is aggregated and released as statistics. A public health official may have access to data that shows how many people are “obese” in a county, but she can’t “ask” how many people are “obese” within a 10-mile radius of a McDonald’s.

We’ve also started outlining the governance questions we have to answer as we move forward, including who builds the technology, who governs the datatrust, and how we will monitor and prevent the datatrust from veering from its mission. We know that this is an organization that must be transparent if it is to be trusted, and we are working on creating the kind of infrastructure that will make transparency inevitable.

We’ve also started researching the issues we need to address to develop our own privacy policy. In particular, we’ve been working on figuring out how we will deal with government requests for information. We did some research into existing privacy law, both constitutional and statutory, but in many ways, we’ve found more questions than answers. We’re interested in watching the progress of the Digital Due Process coalition as they work on reforming the Electronic Communications Privacy Act, but we anticipate that the datatrust will have to deal with issues that are more complex than an individual’s expectation of privacy in emails more than 180 days old.

Who has your data? And how can the government get it?

The questions are more complicated than they might seem.

In the last month, we’ve seen Facebook criticized and scrutinized at every turn for the way they collect and share their users’ data. Much of that criticism was deserved, but what was missing in that discussion were the companies that have your data without even your knowledge, let alone your consent.

The relationship between a user and Facebook is at least relatively straightforward. The user knows his or her data has been placed in Facebook, and legislation could be updated relatively easily to protect his or her expectation of privacy in that data.

But what about the data consumer service companies share with third parties?

So much of the data economy involves companies and businesses that don’t necessarily have you as a customer, and thus even less incentive to protect your interests.

What about data that’s supposedly de-identified or anonymized? We know that such data can be combined with another dataset to re-identify people. Could the government seek that kind of data and avoid getting even a subpoena? Increasingly, the companies that have data about you aren’t even the companies you initially transacted with. How will existing privacy laws, even proposed reforms by the Digital Due Process coalition, deal with this reality?

These are all questions that consume us at the Common Data Project for good reason. As an organization dedicated to enabling the safe disclosure of personal information, we are committed to talking about privacy and anonymity in measurable ways, rather than with vague promises.

If you read a typical privacy policy, you’ll see language that goes something like this,

Google only shares personal information with other companies or individuals outside of Google in the following limited circumstances:…

We have a good faith belief that access, use, preservation or disclosure of such information is reasonably necessary to (a) satisfy any applicable law, regulation, legal process or enforceable governmental request

We think the datatrust needs to be do better than that. We want to know exactly what “enforceable government request” means. We want to think creatively about what individual privacy rights mean when organizations are sharing information with each other. We’ve written up the aspects that seem most directly relevant to our project here, including 1) a quick overview of federal privacy law; 2) implications for data collectors today; and 3) implications for the datatrust.

The datatrust has always been a big-tent project, but over the last few months, we’ve done a lot of paring down. We’re getting closer to something that feels like a product and less like a vague hope for a better future!

The following is an attempt to describe the datatrust “technology product” by way of comparison with existing websites and services. The “Governance and Policies” aspect of the datatrust was covered in a separate post.

Sensitive information about us. They have it, we don’t.

Today, most of the sensitive data about us (e.g. medical records, personal finance data, online search history) is inaccessible to us and to those who represent the public: elected officials, government agencies, advocacy groups, researchers.

Our Mission: Democratizing Access to Sensitive Data

While a significant movement has grown up around opening up government data, there are few efforts to gain public access to sensitive “personal information” data, most of it held in the private sector.

Our goal for the datatrust is to create an open marketplace for information to democratize access to some of the most sensitive and valuable data there is, to help us answer difficult policy and societal questions. .

Which brings us to the question: What is a datatrust?

A datatrust will be an online service that allows organizations to make sensitive data available to the public and provides researchers, policymakers and application developers with a way to directly query that data.

The datatrust will include a data catalog, a registry of queries and their privacy risks, and a collaboration network for both data donors and data users.

We realize that as a new breed of service, the datatrust is difficult to conceptualize. So, we thought it might be helpful to compare it to some existing websites and services.

A Data Catalog

Like data.gov, the datatrust will provide ways to browse and search a “catalog” of available data.

A Query-able Database of “Raw Data”

Unlike data.gov, datatrust data will be released in “raw” form, not in pre-digested aggregate reports.

Unlike data.gov, datatrust data will not be viewable or downloadable.

Instead, the datatrust will provide a way to directly query raw data.

An “Automated” Privacy Filter

Unlike most open government data releases, the datatrust will not rely on labor-intensive and subjective anonymization methods. Existing methods like scrubbing, swapping or synthesizing data limit the accuracy and usefulness of the data.

By contrast, the datatrust will makes use of new privacy technologies to provide a measurable and enforceable privacy guarantee that treats individual privacy as a value-able asset with a quantifiable limit on re-use.

As a result, the datatrust will keep track of the amount of privacy risk incurred by each query

An Open Collaboration Network

Because the datatrust will maintain an open history of all queries and data users, it will also become an important open registry of how data is being used and analyzed. This in turn can become the foundation for a community of data donors and data users, who will collaborate on collecting and analyzing data for research and data-driven software applications.

Like Amazon, the datatrust will do a better job of describing and browsing data sets as well as eliciting user feedback and data-mining actual usage (as opposed to self-reported usage) to help users find relevant data sets.

Like Wikipedia, the datatrust will depend on an invested and active community to curate and manage the data.

Unlike Wikipedia and Yelp (but like Facebook and LinkedIn), the datatrust will require its users to maintain real and active identities in order to build a quality rating system for evaluating data and data use, based on actual usage and individual reputations (as opposed to explicit user ratings).

Not A Generic Set of Tools for Working With Data

Unlike Swivel, the datatrust is not a generic tool set for working with and visualizing data.

Unlike Ning (a consumer platform for creating your own social network), the datatrust is not a consumer platform for creating your own data-sharing networks. It is also not a developer toolkit for building data-driven services.

Not A Data-Driven Service for Consumers

You should not expect to come to the datatrust to find out if people like you are also experiencing worse than average allergies this year.

Unlike Mint or Patients Like Me, the datatrust is not a personal data-sharing service focused on offering a consumer service (personal finance manager in the case of Mint) or sharing a specific kind of data (tracking chronic diseases in the case of Patiens Like Me)..

But application builders like Mint.com as well as researchers may find the datatrust useful in allowing them to provide services and collect data in new ways from larger groups of people, due to the measurable privacy guarantee provided by the datatrust.

The datatrust is just about data.

The datatrust is a sensitive data release engine and we will build tools insofar as it helps our Data Donors get more data to Data Users. However, it stops short of directly serving consumers. We think that is better left to those with a passion for a specific cause and the domain expertise to serve their constituents well.

1) Infochimpslaunched their API. People often ask, are you guys doing something similar? Yes, in that we are also interested in democratizing access to data, but we’re focusing on a narrower area — information that’s too sensitive and too personal to release in the usual channels. In any case, we’re excited to see more movement in this direction.

2) Wikipedia began a trial of a new tool called “Pending Changes.” To deal with glaring inaccuracies and vandalism, Wikipedia made certain entries off-limits for off-the-cuff editing. The trade-off, however, was that first-time editors to these articles couldn’t get that immediate thrill of seeing their edits. Wikipedia’s trying out a compromise, a tab in which these edits are visible as “pending changes.” It’s always fascinating to see all the different spaces in which people in a community can interact online — this is a new one.

3) The Info Law Group posted various groups’ reactions to the privacy bill proposed by Representative Rick Boucher. Here’s Part I, here’s Part II. Fairly predictable, but it still never ceases to amuse me how far apart industry groups are from consumer advocates.

It is worth remembering: We didn’t build libraries for an already literate citizenry. We built libraries to help citizens become literate. Today we build open data portals not because we have a data or public policy literate citizenry, we build them so that citizens may become literate in data, visualization, coding and public policy.

“At some point data retention laws can be reasonable, but highly-personal information such as browsing history is a step too far,” Jacobs said. “You can’t treat everybody like a criminal. That would be like tapping people’s phones before they are suspected of doing any crime.”

3) Wikipedia is adding two new executive roles. In the process of researching our community study, it really struck me how small Wikipedia‘s staff was compared to the staff of more centralized, less community-run businesses like Yelp and Facebook. Having two more staff members is not a huge increase, but it does make me wonder, is a larger staff inevitable when an organization tries to assert more editorial control over what the community produces?

Last month CCICADA hosted a workshop at Rutgers on “statistical issues in analyzing information from diverse sources”. For those curious, CCICADA stands for Command, Control, and Interoperability Center for Advanced Data Analysis. Though the specific applications did not necessarily deal with sensitive data, I attended with an eye towards how the analyses presented might fit into the world of the datatrust. Here’s a look at a couple of examples from the workshop:

Exploding Manholes!

Cynthia Rudin from MIT gave a talk on her work “Mitigating Manhole Events in New York City Using Machine Learning”. Manholes provide access to the city’s underground electrical system. When the insulation material wears down, there is risk of a “manhole event” which can range up to a fiery explosion. The power company has finite resources to investigate and fix at-risk manholes, so her system predicts which manholes are most at risk based on information in tickets filed with the power company (e.g. lights flickering at X address, manhole cover smoking at Y).

Preventing exploding manholes is interesting, but how might this relate to the datatrust? It turns out that when the power company is logging tickets, they’re not doing it with machine learning for manhole events in mind. One of the biggest challenges in using this unstructured data for this purpose was cleaning it—in this case, converting a blob of text into something analyzable. While I’m not sure there’s any need to put manhole event data in a datatrust, naturally I started imagining the challenges around this. First, it’s hard to imagine being able to effectively clean the data once it’s behind the differential privacy wall. The cleaning was an iterative process that involved some manual work with these text blobs.

For us, the takeaway was that some kinds of data will need to be cleaned while you still have direct access to it, before it is placed behind the anonymization wall of the datatrust. This means that the data donors will need to do the cleaning and it can’t be farmed out to the community at large without compromising the privacy guarantee.

Second, the cleaning seemed to be somewhat context-sensitive. That is, for their particular application, they were keeping and discarding certain pieces of information in the blob. Just as an example, if I was trying to determine the ratio of males to females writing these tickets, I might need a different set of data points extracted from the blob. So, while we’ve spent quite a few words here discussing the challenges around a meaningful privacy guarantee, this was a nice reminder that all of the challenges in dealing with data will also apply to sensitive data.

Anonymizing Relationships

Of particular relevance to CDP was Graham Cormode from AT&T research and his talk on “Anonymization and Uncertainty in Social Network Data”. The general purpose of his work, similar to ours, is to allow analysis of sensitive data without infringing on privacy. If you’re a frequent reader, you’ve noticed that we’ve been primarily discussing differential privacy and specifically PINQ as a method for managing privacy. Graham presented a different technique for anonymizing data. I’ll set up the problem he’s trying to solve, but I’m not going to get into the details of how he solves it.

Graham’s technique anonymizes graphs, particularly social network interaction graphs. In this case, think of a graph as having a node for every person on Facebook, and a node for each way they interact. Then there are edges connecting the people to the interactions. Here is an example of a portion of a graph:

Graham’s anonymization requirement is that we should not be able to learn of the existence of any interaction, and we should be able to “quantify how much background knowledge is needed to break” the protection.

How does he achieve this? The general idea is by some intelligent grouping of the people nodes. I’ll illustrate the general idea with an example of simple grouping—we’ll group Grant and Alex together, meaning we’ll replace both the “Grant node” and the “Alex node” with a “Grant or Alex node”, and we’ll do the same for the “Mimi” and “Grace” nodes. (We would also replace the names with demographic information to allow us to make general conclusions.)

Now, this is reminiscent of one of those logic puzzles, where you have several hints and have to deduce the answer. (One of Mimi and Grace poked Grant or Alex!) Except in this case, if the grouping is done properly, the hints will not be sufficient to deduce any of the individual interactions.

You can find a much more complete explanation of the method here in Graham’s paper, but I thought this was a good example to contrast PINQ’s strategy:

PINQ acts as a wall to the data only allowing noisy aggregates to pass through, while this technique creates a new uncertain version of the dataset which you can then freely look at.

The big problem is that these business models are not very stable. Companies set out privacy policies, consumers disclose data, and then the action begins…The business model changes. The companies simply want the data, and the consumer benefit disappears.

It’s not enough to start with compensating consumers for their data. The persistent, shareable nature of data makes it very different from a transaction involving money, where someone can buy, walk away, and never interact with the company again. These data-centered companies are creating a network of users whose data are continually used in the business. Maybe it’s time for a new model of business, where governance plans incorporate ways for users to be involved in decisions about their data.