3
21-3 Data Mining Multilevel databases weren’t a commercial success  Mainly military clients, finding all possible inferences is NP-complete However, the combination of (sensitive) information, stored in multiple (maybe huge) databases, as done for data mining, raises similar concerns and has gotten lots of attention recently So far, a single entity has been in control of some data  Knows what kind of data is available  Who has accessed it (ignoring side channels)‏ No longer the case in data mining, data miners actively gather additional data from third parties

6
21-6 Confidentiality Data mining can reveal sensitive information about humans (see later) and companies In 2000, the U.S. National Highway Traffic Safety Administration combined data about Ford vehicles with data about Firestone tires and become aware of a problem with the Ford Explorer and its Firestone tires  Problem started to occur in 1995, and each company individually had some evidence of the problem  However, data about product quality is sensitive, which makes sharing it with other companies difficult Supermarket can use loyalty cards to learn who buys what kind of products and sell this data, maybe to manufacturers’ competitors

7
21-7 Data Correctness and Integrity Data in a database might be wrong  E.g., input or translations errors Mistakes in data can lead to wrong conclusions by data miners, which can negatively impact individuals  From receiving irrelevant mail to being denied to fly Privacy calls for the right of individuals to correct mistakes in stored data about them  However, this is difficult if data is shared widely or if there is no formal procedure for making corrections In addition to false positives, there can also be false negatives: don’t blindly trust data mining applications

8
21-8 Availability Mined databases are often created by different organizations  Different primary keys, different attribute semantics,… Is attribute “name” last name, first name, or both? US or Canadian dollars? Makes combination of databases difficult Must distinguish between inability to combine data and inability to find correlation

9
21-9 Privacy and Data Mining Data mining might reveal sensitive information about individuals, based on the aggregation and inference techniques discussed earlier Avoiding these privacy violations is active research Data collection and mining is done by private companies  Privacy laws (e.g., Canada’s PIPEDA or U.S.’ HIPAA) control collection, use, and disclosure of this data  Together with PETs But also by governments  Programs tend to be secretive, no clear procedures  Phone tapping in U.S., no-fly lists in U.S. and Canada

11
21-11 Another Example (by L. Sweeney)‏ 87% of U.S. population can be uniquely identified based on person’s ZIP code, gender, and date of birth Massachusetts’ Group Insurance Commission released anonymized health records Records left away individuals’ name, but gave their ZIP code, gender, and date of birth (and health information, of course)‏ Massachusetts's voter registration lists contain these three items plus individuals’ names and are publicly available Enables re-identification by linking

12
21-12 k-Anonymity Ensure that for each released record, there are at least k-1 other released records from which record cannot be distinguished For health-records example, release a record only if there are k-1 other records that have same ZIP code, gender, and date of birth  Assumption: there is only one record for each individual Because of the 87% number, this won’t return many records, need some pre-processing of records  Strip one of { ZIP code, gender, date of birth } from all records  Reduce granularity of ZIP code or date of birth

13
21-13 Discussion In health-records example, the attributes ZIP code, gender, and date of birth form a “quasi-identifier” Determining which attributes are part of the quasi- identifier can be difficult  Should health information be part of it?  Some diseases are rare and could be used for re- identification  However, including them is bad for precision Quasi-identifier should be chosen such that released records do not allow any re-identification based on any additional data that attacker might have  Clearly we don’t know all this data

14
21-14 Value Swapping Data perturbation based on swapping values of some (not all!) data fields for a subset of the records  E.g., swap addresses in subset of records Any linking done on the released records can no longer considered to be necessarily true Trade off between privacy and accuracy Statistically speaking, value swapping will make strong correlations less strong and weak correlations might go away entirely

15
21-15 Adding Noise Data perturbation based on adding small positive or negative error to each value Given distribution of data after perturbation and the distribution of added errors, distribution of underlying data can be determined  But not its actual values Protects privacy without sacrificing accuracy