Solving GDPR Discovery via Azure Data Catalog

This blog contains a commentary on the GDPR, as Microsoft interprets it, as of the date of publication. The tools and services referenced herein are not designed to ensure GDPR compliance but to assist you and your organization with your data classification and categorization, an important step in the journey to compliance. The application of GDPR is highly fact-specific and not all aspects and interpretations of GDPR are well-settled. As a result, this blog is provided for informational purposes only and should not be relied upon as legal advice. We encourage you to work with a qualified legal professional to discuss the meaning and applicability of GDPR and how best to ensure compliance for you and your organization.

MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS BLOG. This blog is provided “as-is.” Information and views expressed in this blog, including URL and other Internet website references, may change without notice. This blog does not provide you with any legal rights to any intellectual property in any Microsoft product.

About the Authors: Alice Kupcik is a Senior PM in the SQL Database Security team at Microsoft and is passionate about all things around data privacy & protection. Tony Smith is an Azure Data Solution Architect with a primary focus on analytics & data platforms

GDPR is on the minds of IT managers & compliance officers and will be an ongoing requirement going forward for many organizations. One of the initial challenges however is simply the identification & categorization of GDPR impacted data sets in disparate locations across the enterprise. Furthermore, it may not simply a case of flagging data as being in or out of scope for GDPR, instead organizations may decide that they want to define multiple levels of categorization and in turn they may choose to apply these categories to data at the source, table & even attribute level. Finally, how do we scale the process of identification, logging and categorization of the potential multitude of relevant data stores?

Enter Azure Data Catalog (ADC), Microsoft’s solution to understanding & cataloging the data estate of any organization. Designed to promote self-serve data discovery it has several out of the box capabilities which intersect with & support parts of the GDPR process. Furthermore, as a SaaS solution it has a minimal footprint, low start up cost, requires minimal training and can therefore be rapidly deployed at scale.

This subject will be covered over 3 related posts. Firstly, we’ll present the GDPR discovery problem, introduce Azure Data Catalog in more detail and illustrate how its capabilities can be applied to GDPR. As GDPR categorization is itself potentially complex, the following post will discuss GDPR taxonomies and offer a simplified GDPR example. Finally, the 3rd post will go step by step through the process of implementing that GDPR taxonomy in Azure Data Catalog.

Please note these posts are not intended to provide advice on the legal requirements of GDPR. Organizations will need to assess for themselves how they will meet their GDPR obligations.

GDPR – the data discovery problem statement

Furthermore, there are separate categories of GDPR relevant data. Some may be anonymized and therefore unidentifiable. Other data may be held in security logs and therefore needs retention for audit & security reasons.

These varying forms of GDPR data can lead to the need for a taxonomy, a hierarchy of GDPR data classifications, with each classification potentially requiring a distinct policy for retention, correction or removal. In any case, in order to comply with, for example, a “request to forget” (a customer asking to have all personal records deleted) an organization may want to create an inventory of every potential sink of customer data which could be covered by this directive as well as a view into whether that data is GDPR relevant.

Personal Data just got more complex…The definition of “personal data” has evolved in the past few years. Instead of referring to the obvious identifiers of name and address, it can potentially now include additional data that is linked or linkable to an individual, for example via a random identifier (such as a GUID or hash value) assigned by an internal system.

Data Sets with the obvious identifiers removed that contain a random unique identifier are referred to as “pseudonymous” data sets, and they are very much in scope for GDPR. Unfortunately, due to the lack of “obvious identifiers”, these data sets are even more difficult to find & classify.

Azure Data Catalog – a Primer

Before outlining how ADC can help, it is worth spending some time introducing the tool and how it operates. Some of these concepts are covered in more detail on a previous blog which shows how ADC can help make sense of your data lake.

ADC is a SaaS service offered by Microsoft aimed at categorizing and surfacing the entire data estate for an enterprise. Although offered on Azure it’s intended to be used across almost any data source on the cloud or on premise and has connectors for most of the major data sources available (with more being added).

ADC can hold tags, glossary terms (more on this later), friendly names, descriptions, experts and other metadata for a given data set within a data source. It is also designed to follow a “crowdsourcing” model by default in that many users in different roles (Business Analyst, DBA etc.) can add their own metadata for a given data source without overwriting each other – all versions are held against the data source distinctly & simultaneously. That said, it has (and is adding) additional role based capabilities to limit update capability to certain users (such as a data steward).

The primary purpose of ADC is to provide rich support for self-serve data discovery as users can now search for data sources, find SMEs and identify the process to request access for data sets. However, its rich metadata capabilities, particularly its support for tags & glossaries, make it eminently suitable for supporting parts of the GDPR categorization problem.

ADC Glossaries

Any user can add a tag to an element in ADC. However, like all tagging, this is subject to potential duplication – 3 different users can use “Customer Info”, “Cust Inf.” & “Customer” as distinct tags and all mean the same thing. Glossaries can address this issue as well as the related issue of the same business term having different meanings depending on context.

Example GDPR Taxonomy entry in Azure Data Catalog

They provide a preformed taxonomy or hierarchy of terms which can be defined up front and independent of any data sources, perhaps by a nominated role such as a data steward. This has particular relevance for GDPR which naturally leads to a taxonomy of data.

Sample GDPR Taxonomy

Glossaries provide the structure around which GDPR classifications can be defined and supported. Further, because the same term can have different meanings in different contexts, each meaning can be defined with a separate description within a distinct taxonomy.

Azure Data Catalog & GDPR

So with that introduction to ADC, what can it offer and how can it help in GDPR initiatives? Below are some key examples.

Create and enforce a rich GDPR classification taxonomy

At the lowest level individual attributes can be tagged as “Email”, “social security number”, and “Mobile number”, however, GDPR categorizations are more complex. ADC can support the development of these complex taxonomies up front and as they evolve, and can then apply them to data sets as needed.

Capture every data set from any location in the Enterprise

ADC has an extensive set of connectors reaching across virtually all major data sources. With these connectors it can interrogate data repositories and pull back column names, types and even a preview of the data held in some cases (depending on the capabilities of the connector). As such it can very quickly build an outline view of all major tables and data sets across the organization, down to the column level, depending on the source. However, even if a data source is not supported, ADC allows for manual entries to be set up for those sources.

Support Governance & Ownership

Azure Data Catalog provides placeholders for stakeholders of the GDPR taxonomy, owners of tables/data sources and the list of experts for tables/data sources (based on entries in the organizational active directory). Although there are no specific placeholders for individual policies such as export, correction, deletion processes ADC does provide a “freeform” documentation tab at the table and data source level which can be used to expose links to theseprocesses.

Categorization at multiple levels

ADC supports tagging or glossary terms at the data source, table or individual attribute level and multiple tags & terms can be applied. In the 3rd post, we’ll show how both “leaf node” categorizations (Address) and higher level nodes (PII Data) can both be applied such that the data set is marked “contains GDPR data” and individual attributes within that data set are flagged with actual GDPR categorizations.

Cost effective, minimal footprint crowd sourcing model

Azure Data Catalog is extremely cost effective. There is a fully functional free tier and the standard tier, at the time of this writing, is in the region of $1 per user per month. As such it can be rolled out to 100’s or 1000’s of users very quickly, who in turn can be tasked with cataloging any and all instances of Customer/GDPR data. Rather than tasking a team of data stewards with tracking down every GDPR data source, the enterprise can be leveraged, with the data stewards tidying up and qualifying new entries as they come in.

Flexible REST API

ADC has a functional and straightforward interface and it also provides a complete RESTful API (everything available in the UI is available via an API). This makes it extremely extensible.

Bulk load tools can be built quickly to export existing entries into a spreadsheet or similar. These entries could then be retagged in the spreadsheet (if for example the taxonomy entries have changed) then updated back into ADC via the API

ADC entries can be used to check against audit logs, to determine who is accessing GDPR relevant data and what they are doing with it (for example the data source connection string held in ADC can be used to monitor or even prevent exports). While not a direct feature of ADC, once categorizations of data sets have been applied, it can act as a rich supporting repository for tracking, logging or even blocking questionable activity against GDPR data stores (such as exports to secondary sources).

Conclusion

GDPR is a hot topic and high on the priority list for many organizations. While other tools exist on the market, ADC represents a low cost (even free) option which can be deployed essentially overnight with minimal training. Usage can be scaled up to address the short term GDPR priority then scaled down again. The entire data set can be exported- meaning there is no lock-in. It should also be noted that at the end of the ADC GDPR process not only will the organization have moved towards addressing some key GDPR burning questions, but it will also have generated a rich picture of the entire data estate, greatly promoting and accelerating self-serve data discovery, all for about $1 per user per month.

In the next post we’ll take a sample GDPR taxonomy and walk through its components. The final post will take that taxonomy and provide step by step guidance on how to apply it via Data Catalog.