Welcome to my Security Blog

Welcome to the security blog section of my site. I've been in IT over 25 years with 15 of those in the security field. Some of my notable accomplishments in the field outside the normal reduce fraud, risk, drive costs lower while improving coverage, etc.:

- Arcsight SIEM as an antifraud tool (2005), this was later turned into the product FraudView by Arcsight

- Various leaps in device printing technology (2005-2017), details will be published in my device print blog

I've been in security a long time and the topic of PII comes up more frequently than you can imagine. The typical business user of information wants a 'list' of data points that are PII, while this is easy to provide the list changes over time. What's more troublesome is trying to explain how some combinations of data cause the data set to now be PII.

The following data points are often used for the express purpose of distinguishing individual identity. Therefore they are clearly classified as PII under the definition used by the National Institute of Standards and Technology (NIST):

Full name (if not common) Home address Email address (if private from an association/club membership, etc.) National identification number Passport number IP address (when linked, but not PII by itself in US) Vehicle registration plate number Driver's license number Face, fingerprints, or handwriting Credit card numbers Digital identity Date of birth Birthplace Genetic information Telephone number Login name, screen name, nickname, or handle

The following data points are traits shared by many people, and can not be used to distinguish individual identity. However, they are potentially PII, because they may be combined with other personal information to identify an individual.

First or last name, if common Country, state, postcode or city of residence Age, especially if non-specific Gender or race Name of the school they attend or workplace Grades, salary, or job position Criminal record Web cookie

When a person wishes to remain anonymous, descriptions of them will often employ several of the above, such as "a 21-year-old white female who works at Starbucks". Note that information can still be private, in the sense that a person may not wish for it to become publicly known, without being personally identifiable. Moreover, sometimes multiple pieces of information, none sufficient by itself to uniquely identify an individual, may uniquely identify a person when combined; this is one reason that multiple pieces of evidence are usually presented at criminal trials. It has been shown that, in 1990, 87% of the population of the United States could be uniquely identified by gender, ZIP code, and full date of birth.

So the question is, how do we determine if the data points being requested constitute PII? I'd like to propose creating a simple way to calculate what I call Bits of Identity1. The formula can be adjusted for city, state, country, world or anything in between.

After working in the device print area (see this site) for years, I realized the same issue exists in reverse.. namely what data points can I use to uniquely identify a device. In the identity game, it's the same in reverse, as in how many bits of identity does this data point add?

If you have data pertaining to the United States, which in 2010 was 309.3 million, then you require -Log2 (1/309300000) bits of data to identify an individual. This equates to 28.2 bits. Now that we know our target, we can calculate this easily.

For example, if a user needs to work with just gender, there are 2 choices (Male and Female). That single data point has -Log2(1/2) bits or 1 bit of identity. So far so good. Now, what if your user wants zip code as well? There are ~43000 zip codes in the US, given any zip code at random would add an additional -Log2(1/43,000) bits or 15.39 bits of identity. So the two data points together get you to 16.39, still far short of the target 28.2. However some of these data points like zipcode are tricky, for instance, there are zip codes with say 200 people in them if you had one of those you would be looking at -Log2(200/309300000) or 20.56 bits of identity, add in gender and you are 21.56 bits which is getting close.

Let us calculate some common data points below:

First or last name - This varies, but using the site HowManyOfMe we can estimate the following: Last name: -Log2 (1/151671) assuming even distribution results in 17.21 bits of identity First name: -Log2 (1/5163) assuming even distribution results in 12.33 bits of identity

Age, especially if non-specific If even distribution assuming 0 to 100, -Log2(1/100) results in 6.62 bits of identity. If you look at a given age (see Demography of the US) and take for example 74 year olds, you are closer to -Log2(1/2000000) or 20.97 bits of identity Gender or race Gender: -Log2 (1/2) results in 1 bit of identity

1 As I don't believe this idea has been published before, if you are using this idea or using it to create a derivative work, please give attribution to John Kula and this site.