Implementation of the First Version of the Data Gatekeeper

The Data Life Cycle is the sequence of stages that a unit of data goes through from its creation to its deletion. The Data Life Cycle specifies 4 operations on data, namely Create, Read, Update and Delete.

The Data Life Cycle management is a crucial requirement for businesses that handle personal data. When these businesses deploy their data store on a cloud platform owned by a third party, the management of the Data Life Cycle has to be shared between the stakeholders.

According to the new General Data Protection Regulation of the European Union, starting in May 2018, the Data Subject shall have a major role and decision over the usage of its personal data. In order to ensure that the requirements of the Data Subject are taken into account, a personal entity, the Data Controller has to take any action to abide to the requirements [1].

Our objective in the Rest Assured Horizon2020 project is to provide facilities to a Data Controller to manage the life cycle of data owned by Data Subjects, taking into account their requirements. In order to do that, we provide a Data Gatekeeper, a framework that collects planned usages on personal data from Service Providers and requirements from Data Subjects about the access, processing, storage and usage of their personal data.

Based on these inputs, the Data Gatekeeper generates security policies.

These security policies will then be enforced each time an operation is requested on data, using a component called Data Protection Decision Point. The Data Protection Decision Point is the core component of the Data Gatekeeper. When receiving a processing request, the Data Protection Decision Point evaluates the security policies and generates a decision, granting or denying the request.

Figure 1: High Level Architecture of the Data Gatekeeper

A Sticky Policy is a Security Policy associated with a piece of data. Any processing on the data has to comply with its Sticky Policy.

The association between the piece of data and its Sticky Policy can be a hard bound. In this case, the data is tampered with the addition of the Sticky Policy. This often results in difficulties for processing the data, or updating the security policies.

Another solution is to use a logical bound (soft bound) between the piece of data and its Sticky Policy. In this case, the data is not tampered, and the security policy can be updated easily. In return, the Enforcement of the Security Policy is more complex and related to the structure of the system and the technologies used. In some case, it also has to take into account movement of data and replication of data.

Linked Data Technologies are technologies used in the Semantic Web for publishing structured data and their dependencies. Especially, a piece of data can be uniquely defined by an Uniform Resource Identifier (URI). Resource Description Format (RDF) is a data model used for representing the structured data [2]. It uses triples of data: subject – predicate – object. The subject and the object are both resources, linked by the predicate. An object in a triple can be a subject in another triple. Ontologies can be defined on top of RDF dataset. They allow reasoning and specific querying over the RDF Datasets.

SPARQL Query [3] and SPARQL Update [4] are querying language over RDF Datasets. They support the feature of SPARQL Property Paths that filter an RDF Dataset, based on a list of ordered predicates. A SPARQL Property Path go through an RDF Dataset, starting from a Subject, going through the list of predicates and leading to a set of objects.

Apache Jena [5] is an Open Source Java Framework, under license Apache License Version 2.0. It provides Triple Store and SPARQL Query and Update implementation.

In the Data Gatekeeper, Sticky Policies are implemented in a RDF format, providing a soft link between the Policy and the set of data that have to comply with the Policy. Sticky Policies follow an ontology. In particular, the ontology specifies on which services the Data Subject is registered and which usages from the service he/she allows on its personal data. For each usage, a set of references to personal data that can be delivered is specified. Sticky Policies are stored in a TripleStore, a specific data store for RDF data.

When a query is made, involving an operation of the data life cycle management, the Data Protection Decision Point has to be requested before any action is taken. The Data Protection Decision Point will then determines if the query is valid, i.e. is compliant with the Sticky Policy, generated from the Data Subject requirements.

Using an ontology with human understandable terminology to structure the Sticky Policies ensure that a Data Subject is able to understand which consent that he/she gives for processing its personal data.

During the processing phase, a service provider wants to process some data on a Data Store. Before delivering the data to the Service, a Policy Enforcement Point, a component on top of the Data Store acting as a reverse proxy, asks the Data Protection Decision Point, specifying the usage that the Service plan to do on the data, and which data are concerned by the processing. This is done by exposing a web service with a REST API, waiting for such queries.

The Data Protection Decision Point extracts from the TripleStore of Sticky Policies which Data Subjects have agreed to the specified usage over the set of defined data. Then the Decision Point forwards the list of Key Identifiers to the Policy Enforcement Point. The Policy Enforcement Point then filters the dataset over the list of Key Identifiers coming from the Data Protection Decision Point. This results in the processing of data owned by Data Subjects that have given their consent.

A first version of this Data Protection Decision Point is deployed in the Rest Assured environment, offering a REST API for evaluating sticky policies and forwarding decisions.