Open Data and Privacy

TITLE

Introduction

The Canadian Government is moving towards treating its data as "open by default”. An exception to this default data that is "personal information" must be removed or masked before being disclosed as open data to any third party. However, these steps are not always enough to protect privacy in data, and information about individuals can be reidentified after the open data is released. This FAQ examines this conflict between privacy and open data.

What is Open Data?

The Open Knowledge Foundation defines “open data” as data that can be freely used, shared and built on by anyone, anywhere, for any purpose. The Canadian government refers to open data as data that is freely accessible to the public and published in a manner in which anyone can use and manipulate without restriction.

For data to be open it must:

(a) be available in whole, and easily accessible (i.e downloadable or easily requested);

(b) be reuseable and redistributable; and

(c) be universal (i.e anyone can access, reuse, or redistribute).

Generally to be considered open data it must be released in a machine readable format under an open, permissive license, with the intention that the data be used, combined, rereleased and built upon by others.

The City of Ottawa provides a good example of open data: the City releases publicly available realtime
GPS data of OC Transpo bus locations and estimated arrival times of the next three busses for a given Stop. This data has been integrated into an iphone application, called “MyTransit”, so that people can plan their transportation means more efficiently.

There are hundreds of examples of open data, many of which can be found on the Open Data 500 website. Open Data 500 is a website in which different countries can publish information on open data, within their respective nations, for use in a study on its impact and use.

There are three main open data requirements. First, the data must be in the public domain or provided under an open license that allows any part of the work to be freely used, distributed, or modified. Second, the data must be readily accessible as a whole and should be downloadable via the Internet without charge. Finally, the data must be provided in a form readily processable by a computer (“machine readable”). For example, .PDF files are not machinereadable as they cannot be processed on PCs using databases or spreadsheets software, whereas formats such as .XML, or .XSLT are machine readable.

Government open data is gathered in various ways including research and surveys, and it is ultimately the product of the services that the Government of Canada provides. Departments prioritize the release of data based, in part, on requests filed via the 'Dataset Suggestion' form on open.canada.ca.

The majority of open data in Canada emanates from the Federal government and can be accessed via their online portal: http://open.canada.ca/data/en/data set. However, there are many other groups that publish open data, including:

Provincial and municipal government bodies;

Scientific researchers;

For example, the data collected from the Human Genome Project can be downloaded in its entirety from the National Center for Biotechnology Information’s ( NCBI ) website.

Businesses that release information about their own operations;

Users that write reviews and/or tweets online;

Marketing companies are able to decompose publicly available reviews or tweets as either positive or negative based on the language that is used. One European consulting firm has used this technique to monitor the frequency of negative comments about a competitor’s telecomm service. During periods where there is a high frequency of negative comments, it may be advisable to run ads touting its own network’s reliability.

Open government data from other jurisdictions is available through their respective online portals, including: the United States of America, the United Kingdom, Australia, Germany, and France.

Governments : The Federal, Provincial and Municipal governments are the primary publishers of open data in Canada. They use open data as a means to communicate with the public and to improve accountability/transparency of government operations. In addition, the government uses open data to inform their decisions in an effort to improve efficiency, and to deliver smarter, more personalised public services.

Businesses : Innovators and businesses use open data to develop services which are more targeted to users’ needs than those developed inhouse by governments. Open data is often integrated into telephone applications that can be exploited to derive value through advertisements, pay-for-play, etc.

Marketing companies : Marketing firms use open data to create “profiles” for potential customers so that companies can target a particular type of person for their products and services. For example, marketing companies may combine aggregate demographic data from Statistics Canada with telephone directory data to infer personal characteristics about certain individuals.

NonProfit Organizations : Open information on the location of resources, such as schools, hospitals, roads, and on the quality of health, education, and economic systems can be used to identify areas of greatest need and calculate additional resources required. Nonprofit organizations use this type of information when developing a strategies for daytoday operations.

An example of Non-profit organizations making use of open data occurred following the earthquake that struck Haiti in 2010. In the aftermath, volunteers across the world combined data from sources such as satellite maps, World Health Organization maps of health facilities, locations of police facilities from the Pacific Disaster Center, and data from mobile phone companies. The OpenStreetMap project became a source of information to guide both governments and private aid workers. This map helped responders better match the supply and demand for various resources to help those who needed it most.

Another example is College Abacus, which allows students and families to compare the “net cost” of attending different colleges in the US using information that schools are re quired to port to the federal government.

1. Open data ensures government accountability. Open Government Data can be used to help the public better understand what the government does and how well it performs. The data enables people to hold the government accountable for wrongdoing or unachieved results.

2. Open data fosters innovation and the creation of Apps that bring convenience to everyday life. At the municipal level, open data applications make use of realtime GPS transit data feeds, restaurant inspection reports, and permit data. At other levels of government it may include geospatial data, statistical data, public health information, etc. The federal government encourages the use of open data through their annual competition, called the Canadian open data Experience (CODE). This “hackathon” brings together more than 900 developers, students, and open data enthusiasts from across Canada to develop innovative applications using federal open data.

3. Open data aids in “improved personal decisionmaking”. Information collected about individuals can be shared with those same individuals to inform their decisions. For example, to encourage conservation, consumers can compare their use of certain household utilities with that of neighbors.

4. Open data promotes “online consultation/crowdsourcing” . In pursuit of more effective use of data, some businesses and public interest groups share their data with the public. For example, Goldcorp published 400 megabytes of geological survey data about its Red Lake, Ontario, property in 2000. Goldcorp offered prizes for public submissions that identified likely locations of gold within the mine. The submissions that Goldcorp received resulted in the extraction of more than $3 billion in gold.

5. Open data helps enhance education . Adaptive learning systems use online lessons to teach specific concepts and monitor student progress by recording indicators such as the number of times the student pauses a lecture, the time needed to answer a question, and a student’s reliance on “hints”. This data on student performance can provide teachers with insights to intervene and further personalize student interactions. An example is a teaching program called, “Knewton”, which stores each student’s learning tendencies so that the lessons become more personalized as the student progresses through the course.

The Federal Government of Canada has released over 245,000 data sets. These data sets can be filtered into 46 categories based on organization (Aboriginal Affairs, Environment Canada, Health Canada, Transport Canada, Western Economic Diversification Canada, etc.). Canada’s strategies to ensure privacy is maintained are not explicitly described. Furthermore, no new legislative framework has been created to specifically guide the move towards open government in Canada.

Th simple phrase, “personal identifying information” (or PII), has taken on different definitions depending on who is defining (e.g. government versus academic) and for what purpose the definition is being used (e.g. PII for data privacy versus PII for privacy law in general). The Government of Canada agrees with the judicial interpretation that PII is information about an individual in which there is a real possibility that the person can be identified. The information can range from a name to a biological sample. On the other hand, the California Senate Bill 1386 defines PII as information that includes social security numbers, driver’s license numbers, financial accounts but not email addresses or phone numbers. The data the California bill defines as PII data includes data common for authenticating people (identifying data from ID cards) but excludes data data that are known to violate privacy (contact information).

Anonymization is the act of destroying all links between deidentified data sets (data sets in which people’s identities are removed to prevent the data from being used to identify an individual) and original data sets. Data are labelled as anonymous once all PII is removed. The term is used to imply that the data can no longer be reidentified. Anonymization strategies include: replacing data with other variables (names for numbers); suppressing/omitting data from a set; generalizing data used (specific dates become general years); or perturbing the data by making random changes. Each treatment accords varying levels of protection/risk reduction from attackers. Data anonymization is not foolproof. There are many examples of attackers defeating anonymization efforts.

An unfortunate truth is that data is either useful, or completely anonymous, but rarely (if ever) both. Therefore, since people expect open data to be useful, there are many privacy risks. First, when combined with other data sets, anonymous data can be reidentified. Second, personal data can be directly released in “anonymized” data sets, though often accidently. Finally, third parties can use their personal data to reidentify anonymous open
data.

Reidentification occurs when individuals are identified from information gleaned form a supposedly anonymized data set. Reidentification is successful when a hacker is able to find hidden PII in a data set, or when two, or more, data sets are combined together to identify people. This involves the use of anonymous data sets with nonanonymous data sets (for example, public voter lists) to reidentify individuals present in the anonymous data.

There are many examples of individuals reidentified from anonymized data sets. However, before providing examples, it is important to note that reidentification of data sets is likely underreported and most often reported in academic literature if novel techniques are involved. Real life examples include:

The “ AOL Data Release ” was an incident in which the search history of millions of AOL users was released on its website. The users’ identities were anonymous; however, one user was easily identified through her internet searches by a New York Times reporter.

The “ Netflix Prize Data Study ” was an incident in which Netflix gave notice of a competition for someone to figure out how to link Netflix’s anonymized, open data sets regarding ratings with individual users. Netflix aimed to use the combined information to recommend specific movies and show to users based on their ratings. This competition led to the reidentification of thousands of users.

The use of New York City’s Taxi logs is another example in which the anonymized data of the taxi trips, and drivers, were easily reidentified. Vijay Pandurangan blatantly stated that the anonymization of the data set was bad enough that anyone, in two hours or less, could figure which driver drove every single trip in the entire data set.

Data from 1700 Facebook profiles from a particular American University were publically released. Theoretically, the data was anonymous; however, similar to the AOL incident, the source of the data was easily identifiable which placed the identity of the student profiles at risk. The anonymity of the data set was debatable even before publication given, like a fingerprint, no two ‘friend networks’ are exactly the same.

While all PII must be removed from open data before it is disclosed, governments may accidentally publish personal information with open data. The government of Canada acknowledge that data the government releases sometimes (unintentionally) includes includes personal information, such as income, education or occupation.

A website in the United Kingdom (UK), “WhatDoTheyKnow”, enables the public to ask what data the government knows. In 2015, a blogger with the name “FOIMonkey” posted an article outlining numerous cases in which personal data was accidently released on the WDTK website when government information was released. Some notable examples include:

“A local council accidentally included car number plate information belonging to 31,378 people who had been issued with parking tickets in a PDF”;

A “local council also accidentally included the names of 275 council tenants in a spreadsheet, along with other personal information about them”; and

“A police force included names and offence details of 199 individuals who had been arrested for indecent assault in an Excel spreadsheet”.

The blogger goes on to list 47 more incidents, stating that his list represents a third of the data loss incidents of which he is aware.

There is a need for a balance between privacy interests and interests for an open and transparent government. Governments use data, most often, in the interests of improving services or future planning; however, in pursuit of transparency, governments sometimes cross the line. There are very few examples of third parties associating open data with personal information, simply because when this occurs third parties have no intention of publishing such incident given the breach of privacy implications.

One example from the US is the creation of “Proposition 8 map”. Proposition 8 was an amendment proposed for the California constitution seeking to ban gay marriage. Unfortunately for donors, election campaign contributions are public record. The information left by donors on the public record were combined with data sets of addresses; an interactive map showing where each donor lived was then created.

Another example is the Conservative Party(of Canada’s approach to campaigning . The Conservative Party combined data on family traits (number of children, education, ages, divorcee, widow etc.) with known data from voter information (name, address, phone numbers). The Party used different techniques to obtain the data. Their approach enabled the Party to microtarget people most likely swayed to vote in the Party’s favour.

“What is College Abacus”, College Abacus. The nonprofit organization, College Abacus, gives prospective college students the ability to compare tuition and associated costs between different American Colleges using open data.

Academics

Ira Rubinstein and Woodrow Hartzog, “Anonymization and Risk” (2015) (New York University Public Law and Legal Theory Working Papers, Paper 15-36). This paper outlines several variables that affect the risk level of company’s data sets, including: volume, sensitivity, class of recipient, and use of data.

Teresa Scassa, “Privacy and Open Government”, (2014) 6 Future Internet 2. Scassa looks at the tensions between an open, transparent government, and the public’s interest in maintaining a level of privacy.

GOVERNMENT SOURCES

House of Commons, Standing Committee on Government Operations and Estimates, Open Data: The Way of the Future, (June 2014), 41st Parliament, 2nd Sess. This report discusses open data in the Canadian government and recommendations for protecting confidential information.

Office of the Privacy Commissioner of Canada, Interpretation Bulletins. The Office of the Privacy Commissioner of Candaa publishes bulletins providing its non-binding interpretation of court decisions and findings related to PIPEDA.

Joel Gurin, “Secrets of Sentiment Analysis” Open Data Now (24 October 2013). This article discusses the uses of sentiment analysis to derive value from publicly available tweets, reviews, etc.

Linda Tischler, “He struck gold on the Net (really)” Fast Company (31 May 2002). This article discusses Goldcorp’s plan to make their geological survey data publicly available in an effort to improve their gold extraction efficiency.