As enterprises around the world bring more of their sensitive data into Hadoop data lakes, balancing the need for democratization of access to data without sacrificing strong security principles becomes paramount. According to a recent research report by Securosis, “Hadoop has (mostly) reached security parity with the relational platforms of old, and that’s saying a lot given their 20-year head start.” Despite these advances, data anonymization, and data security continue to pose interesting challenges to enterprises that need to protect sensitive information from a regulatory, competitive, or privacy perspective.

Hortonworks continues to invest in functionality provides our Hortonworks Data Platform customers, seamless and robust data security capabilities that eases their security administration burden while empowering them with effective data access. Data lakes require new patterns to access and manage the sheer volume of data. Data masking and filtering capability offer a viable approach to managing thousands of individual data views for consumers throughout the organization. This approach provides an agile way of presenting customized views of data, as compared to inflexible mapping of data to user groups.

In this two-part blog, we will review two new data protection capabilities in Apache Ranger – dynamic column masking and row level filtering of data stored in Apache Hive. These features have been introduced as part of our latest HDP 2.5 platform release. In the first blog, we will review the rationale, usage scenarios, and background, while in part 2 we will walk through the technical implementation of these new features.

Dynamic Data Masking

“Without wearing any mask that we are conscious of, we have a special face for each friend” – Oliver Wendell Holmes

Data masking technologies aim to protect data while preserving its usefulness. While data is the life-blood of the modern data-driven enterprise, the ubiquity, and scale of data access across enterprises can pose serious risks: credit card numbers, personally identifiable information (PII) such as email addresses, social security numbers, national identification numbers, health care information, and good old-fashioned IP are all targets for attackers. Sensitive data sets are appearing and expanding ever more rapidly as enterprises grow and scale their Hadoop data lakes by ingesting data from other business applications and centralized in their Hadoop platform. Additionally, democratization of data access to such data lakes adds significantly to the attack surface within an enterprise.

To reduce the risk of compromising such data, it has become imperative to eliminate both unwanted access via authorization policies but also provide suitably anonymized versions of sensitive data depending on the context in which a user accesses such data. Dynamic data masking technology provides this precise layer of data security by replacing sensitive information with a non-sensitive pseudonym or proxy at runtime (i.e. upon access or use) but doing so in such a way that the proxy for the data that looks — and acts — like the original. This means that such non-sensitive proxy data can be used in business processes without impacting the supporting applications or the underlying data storage. Dynamic data masking functions as a data shepherd applies protection for sensitive information as it is accessed by Hadoop applications that implement such business processes.

Dynamic data masking via Apache Ranger enables security administrators to ensure that only authorized users can see the data they are permitted to see, while for other users or groups the same data is masked or anonymized to protect sensitive content. The process of dynamic data masking does not physically alter the data, or make a copy of it. The original sensitive data also does not leave the data store, but rather the data is obfuscated when presenting to the user. Apache Ranger 0.6 included with HDP 2.5, introduces a new type of authorization policy called “Masking Policy” that can be used to define which specific data fields are masked and what are the rules for how to anonymization or pseudonymize the specific data. For example, a security administrator may choose to mask credit card numbers when displayed to customer service personnel, such that only last four digits are rendered in the form of XXXX-XXXX-XXXX-0123. The same would be true of sensitive data such as social security numbers or email addresses that are masked to be rendered in different formats based on data masking rules.

Row Level Filtering

“No two persons ever read the same book in the same way and have the same perspective!” – Edmund Wilson

As enterprise data is consolidated into data lakes, there is increased co-mingling of data with different usage contexts, patterns of usage, and varying levels of sensitivity. Security administrators are increasingly challenged with providing a rational strategy for how to provide access to only the relevant slice of data from this mixed-use environment that is fit for a specific usage context for a particular user. What this implies is that they need to provide access at a very granular level to not only columns but also restrict and segment which sets of rows in a database that a specific user is permitted to see and use for their processing based on external regulations, internal control policies, as well as privacy considerations.

Row-level security enables security administrators to define precisely which rows in a database table can be accessed by users based on user attributes such as their membership in a specific group or the runtime context in which they are querying data. This functionality enhances the reliability and robustness of HDP to provide increased security for data access to Apache Hive tables. Row level security reduces the surface area of security systems by moving access restrictions to Hive layer. Hive will now apply these restrictions reliably and in a performant fashion every time the data is accessed from Hive tables irrespective of the access tier. Row level security simplifies the authoring and administration of row level data segmentation and provides fine-grained control over which rows a particular filter will operate on. Row-level filtering simplifies security for apps running on Hive by bringing in seamless behind the scenes enforcement of row level segmentation without having to add this logic to the predicate of the query.

Row-level filtering has cross-industry application for data protection. A few of the example application scenarios for row level filtering are outlined in the table below.

Healthcare A hospital can create a security policy that allows doctors to view data rows only for their own patients, and that allows insurance claims administrators to view only specific rows for their specific site. Financial Services A bank can create a policy to restrict access to rows of financial data based on the employee’s business division, locale, or based on the employee’s role (for example: only employees in the finance department are allowed to see customer invoices, payments, and accrual data; only European HR employees can see European employee data). Information Technology A multi-tenant application can create logical separation of each tenant’s data so that each tenant can see only their own data rows.

Apache Ranger 0.6 included with HDP2.5, introduces a new type of authorization policy called “Row Filtering Policy”. Such policies run as a behind-the-scenes query filter condition that narrows the set of Hive table rows that are displayed in a Hive query output to the relevant authorized set based on the policy. In effect, these policy filtering conditions are always on and are evaluated upon access to automatically obviate the need for security administrators to add these filtering predicates manually or create multiple views to enforce row level security in Apache Hive.

Example Use Case

Consider a hypothetical medical insurance company HealthFul Inc. that operates out of the United States and the European Union. In HealthFul’s data lake, there is data for all of their customers across all regions co-mingled in a Hive table. The data includes many sensitive facets such as their customers’ National IDs (Social Security Numbers for US persons, or German PersonalAusweis for German nationals), Credit card numbers, Medical Record Numbers (MRN) and Insurance Policy Ids. Due to the business context in which this firm operates they are subject to EU privacy regulations as well as HIPAA. EU privacy laws mandate that EU person data should only be accessed by EU personnel. HIPAA requires suitable encryption and anonymization of sensitive medical and PII information. This scenario and how Apache Ranger’s capabilities are used to provide data security are illustrated in the graphic below:

HealthFul can set up policies in Apache Ranger to dynamically mask sensitive content in the relevant Hive columns for their US Customer support personnel who service the health policies of their customers via Ranger’s masking policies. The company can additionally utilize the row filtering capabilities for Hive tables in Apache Ranger to segment the rows so that US support teams can only view records of US customers while EU personnel can only view records of EU customers.

In Part 2 of this blog, we will walk through the technical details of how to implement row filtering and dynamic column masking security policies for Hive data in Apache Ranger and how these can be setup and validated via an example scenario.