What exactly is a GDPR Taxonomy and how can Azure Data Catalog help?

This blog contains a commentary on the GDPR, as Microsoft interprets it, as of the date of publication. The tools and services referenced herein are not designed to ensure GDPR compliance but to assist you and your organization with your data classification and categorization, an important step in the journey to compliance. The application of GDPR is highly fact-specific and not all aspects and interpretations of GDPR are well-settled. As a result, this blog is provided for informational purposes only and should not be relied upon as legal advice. We encourage you to work with a qualified legal professional to discuss the meaning and applicability of GDPR and how best to ensure compliance for you and your organization.

MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS BLOG. This blog is provided “as-is.” Information and views expressed in this blog, including URL and other Internet website references, may change without notice. This blog does not provide you with any legal rights to any intellectual property in any Microsoft product.

About the Authors: Alice Kupcik is a Senior PM in the SQL Database Security team at Microsoft and is passionate about all things around data privacy & protection. Tony Smith is an Azure Data Solution Architect with a primary focus on analytics & data platforms.

GDPR is on the minds of IT managers & compliance officers and will be an ongoing requirement going forward for many organisations. One of the initial challenges, however, is simply the identification & categorization of GDPR impacted data sets in disparate locations across the enterprise. Furthermore, it may not simply a case of flagging data as being in or out of scope for GDPR, instead, organizations may decide that they want to define multiple levels of categorization and in turn they may choose to apply these categories to data at the source, table & even attribute level. Finally, how do we scale the process of identification, logging and categorization of the potential multitude of relevant data stores?

Enter Azure Data Catalog, Microsoft’s solution to understanding & cataloging the data estate of any organization. Designed to promote self-serve data discovery it has several out of the box capabilities which intersect with & support parts of the GDPR process. Furthermore, as a SaaS solution, it has a minimal footprint, low start up cost, requires minimal training and can therefore be rapidly deployed at scale.

The previous post in this series of 3 explains the GDPR Discovery & Categorization issue and outlines how Azure Data Catalog is ideally positioned as a low cost, rapidly deployable answer to that problem. This post will explain in brief what is involved in GDPR classification and will present a simplified taxonomy as an example. The final post will cover implementing that taxonomy in Azure Data Catalog following a step by step process.

Please note these posts are not intended to provide advice on the legal requirements of GDPR. Organizations will need to assess for themselves how they will meet their GDPR responsibilities.

Some organizations will have relatively simple & clear cut GDPR requirements. However, others may have more complex data sets which fall into a number of different categories. Those categories, in turn, may carry different levels of responsibility in terms of GDPR treatment. The logical result of this is a GDPR category taxonomy & a set of dependent policies for the categories within the taxonomy.

With this in mind, this post will introduce a simplified GDPR taxonomy to illustrate the types of categorization required. It should be clear that there cannot be a one size fits all approach. Different business models will require different taxonomies. The example provided should not be treated as a recommendation for any specific organization. However, the example presented here should help illustrate some of the concepts potentially involved in GDPR classification which businesses may come across in their own investigations. Although this post is primarily about the categorization example it will also highlight the areas where Azure Data Catalog can help as well as those areas needing additional repositories.

Why have a GDPR taxonomy?

GDPR is largely identified with personally identifiable information (PII) & there is certainly an overlap, however, GDPR classifications may be more complex. For example, a business address & email might not be considered as sensitive as personal address & email. Some data might be fully anonymized & obfuscated, other data only partially so. Equally, some data may be anonymous in isolation but not when combined with other data. Broadly speaking, there are many categories of GDPR data and each set will have separate levels of sensitivity & applicable policies.

Depending on the particular context, implementing a data taxonomy may allow businesses to lay the groundwork for reducing their risk and/or compliance overhead. The most sensitive data can be identified as such & treated accordingly, allowing less rigid and more appropriate controls to be put in place for other data sets.

Taxonomy basis – categories

As mentioned previously, there is no definitive list of data categories. The following is put forward as an example and not an exhaustive representation.

The categories highlighted here are not complete. An organization may split some categories and merge others. However, they illustrate that categories can be more complex than they might first appear, for example “anonymized” data may fall into 2 or more separate categories.

GDPR Example – Policies

There are also significant policies & responsibilities that will be created by GDPR, which might include items such as those below:

Again, this is not a complete or recommended set of policies but can be treated as an example of the work needed. Many of these policies may be identical across several categories, however, there will likely be specific policies required in some areas. In any case, some level of policy will likely need to exist.

As mentioned above, this post is primarily around GDPR concepts & categorizations and how Azure Data Catalog can help in broad terms. ADC carries metadata and categorizations at a data source level and below. As such, it can store and apply the GDPR classifications which we will build up during this post (again, see previous post for a high level explanation of ADC & GDPR). Technically, it could also hold links to the various policies which apply to individual data sets or attributes. However, this may not be particularly efficient, as multiple policies will apply to multiple data sets. Instead, it may be more useful to hold a link somewhere between GDPR categories & GDPR policy versions which apply to those categories. The policies themselves will essentially be documents (or at least forms of some kind) and will need to be held on a website, sharepoint site or other central repository. ADC should not be a repository for those policies.

GDPR Example – Category Attributes

Although policies will apply at a detail level, it may also be useful to define certain characteristics at a high level.

For example, it’s likely useful to classify data which might at first glance appear to be GDPR scoped but which actually is not. Communication preferences (email, phone) may only be in scope for GDPR if contact information is referenced. Alternatively, although pseudonymized data may not need to carry communication preferences information, it may still be considered part of the customer’s profile and therefore, potentially still subject to the “right to be forgotten”. Finally, some data may be subject to exceptions to those rights- an example would besales transactions that are part of the business’s financial history, which may change its suitability for deletion.

The overall message here is that categorization and classification for GDPR may be complex for any given business depending on its particular model. The above are limited examples of questions organizations will have to decide upon.

GDPR Example – Data Attributes

Finally, in our example we’ll present some sample attributes that might apply for each category. Some of these are fairly well known, such as “Name” & “Email” but others are not necessarily obviously GDPR relevant.

Again, these are examples for illustration only.

Conclusion

In this post we’ve built up a sample GDPR classification for a “typical” organization. We started with the base categories which the distinct areas which GDPR type data will fall into. We then presented some sample policy headings which would apply to those categories in part or whole. This was followed by the highlighting of some example additional attributes which can be assigned to categories and we ended by presenting some examples of physical data attributes which would actually be tagged as GDPR relevant data. Although each organization will differ, the above should hopefully give a grounding in some of the areas to consider.

Having put this example together, the next post will walk step-by-step through the application of this taxonomy within Azure Data Catalog.