Data Anonymization

Recently, talk of “anonymizing” or “pseudo-anonymizing” data has been picking up, both publicly online and in private conversations with our clients.

There have been questions on what these terms mean, what they mean for user privacy, and the pitfalls around the practice.

Currently, “anonymizing” is not defined or clearly addressed in TRUSTe’s privacy program requirements. However, we have developed an understanding of the practice over time that we apply evenly to all of the participants in our privacy programs. We also provide guidance on privacy best practices to clients on this topic and other practices, which are not covered by our program requirements.

TRUSTe defines anonymizing as taking information that is currently Personally Identifiable Information (PII) and permanently turning it into non-identifying data. We identify pseudo-anonymizing as taking data that is currently PII and turning it into non-identifying data that can be returned from its anonymized state to PII in the future.

One of the simplest forms of anonymization that takes place every day on nearly every website: analytics. Services like Google Analytics take PII such as an IP Address combined with other detailed information, then anonymizes and aggregates the data to provide useful graphs such as the percentage of site visitors that use Mozilla Firefox. In this situation, anonymization increases user privacy, because the site does not need to retain any PII to get the information they require.

Thus, data that can be easily de-anonymized, that is turned back into PII, does not protect privacy in the same way. For that reason, TRUSTe does not recognize or recommend pseudo-anonymizing data. In fact, it may do more harm than good. Consumers may be confused or misled if you were to tell them that you retain no PII, when in fact you could at any moment (presto-changeo!) recover PII that has been pseudo-anonymized. TRUSTe believes that anonymization must be one-way in order to be effective.

One of the most popular methods of anonymizing data (other than the aggregation described above) is hashing. It is important to note concerns expressed by researcher Ed Felten (@EdFelten) on his blog which question whether hashing is an effective way to anonymize data at all. Whether or not hashing is the best technological means to anonymize data, in many cases it does not have the privacy protective effect many online service providers expect.

This is because a pitfall of anonymizing data is that in some circumstances, the anonymized (or pseudo-anonymized) data itself can be PII. For example, a web service may store a hash of a user’s email address and name, along with some other data associated with the hashed PII. The service believes the data has been anonymized and that they have retained no PII because an attacker obtaining the hash would not know what the hash means, and the service cannot recover the user’s email address and name. The associated data will only be recovered when the user next enters their email address and name. While this may be a good security rationale (although again see Ed Felten’s blog on why that may not be the case), it fails to understand the privacy implications by ignoring the definition of PII.

The definition of PII is not merely a list of important pieces of information such as a phone number, address, social security number, etc. TRUSTe defines Personally Identifiable Information (PII) as “any information or combination of information that can be used to identify, contact, or locate a discrete Individual”. In this situation, the entire reason for keeping the hashed data is to be able to identify a discrete user the next time they return to the site. Therefore, in this case, the hash is PII. So while it may still be a good practice to hash data in this way, online services need to understand that their obligations in how they treat this data (including notice and choice requirements) may not change simply because they believe the data has been anonymized.

Anonymization is a useful tool, but as with everything in the privacy world, context is the key. Online services need to take the time to fully understand their decision to anonymize data, and what it does and does not mean for user privacy.