Subscribe to the latest research through IGI Global's new InfoSci-OnDemand Plus

InfoSci®-OnDemand Plus, a subscription-based service, provides researchers the ability to access full-text content from over 100,000 peer-reviewed book chapters and 26,000+ scholarly journal articles covering 11 core subjects. Users can select articles or chapters that meet their interests and gain access to the full content permanently in their personal online InfoSci-OnDemand Plus library.

Encyclopedia of Information Science and Technology, Fourth Edition (10 Volumes) Now 50% Off

Take 50% off when purchasing the Encyclopedia directly through IGI Global's Online Bookstore. Plus, receive the complimentary e-books for the first, second, and third editions with the purchase of the Encyclopedia of Information Science and Technology, Fourth Edition e-book.

InfoSci®-Journals Annual Subscription Price for New Customers: As Low As US$ 5,100

This collection of over 175 e-journals offers unlimited access to highly-cited, forward-thinking content in full-text PDF and XML with no DRM. There are no platform or maintenance fees and a guarantee of no more than 5% increase annually.

Abstract

This article proposes an innovative approach and its implementation as an expert system to achieve the semi-automatic detection of candidate attributes for scrambling sensitive data. Its approach is based on semantic rules that determine which concepts have to be scrambled, and on a linguistic component that retrieves the attributes that semantically correspond to these concepts. Because attributes cannot be considered independently from each other, it also addresses the challenging problem of the propagation of the scrambling process through the entire database. One main contribution of this article's approach is to provide a semi-automatic process for the detection of sensitive data. The underlying knowledge is made available through production rules operationalizing the detection of the sensitive data. A validation of its approach using four different databases is provided.

Article Preview

Introduction

In an ever-changing competing environment, organizations are under increasing pressure to find ways of protecting the sensitivity of data regarding both individuals and organizations. Companies face the challenge of creating and/or updating non-production environments for testing purposes. In general, companies create a copy of the production system, which may include the data repository and the administrative settings. They provide a test environment for improving application delivery process. However, there are many risks associated with the test environments open to external consultants. In some cases, testing can become a liability. Thus, there is a need to provide realistic test data. This requires masking techniques for sensitive data. By masking the data, users see only a representation of the data without having access to the sensitive ones. Data related to customers, products, materials, and financial accounts may be sensitive and should be masked or anonymized. Sensitive information like address, telephone numbers, and contact information has to be de-identified. By scrambling the data, we substitute sensitive information on customers, orders, products, order profitability, etc., with fictitious, but still consistent data, preserving the overall structure and semantics of the test database. Data masking, also referred as data obfuscation, data de-identification, data depersonalization, data scrubbing, etc., represents a solution for data protection from both internal and external security threats. It enables the creation and/or the updating of data in non-production environments, without the risk of exposing sensitive information to unauthorized users, such as external consultants in environments like ERP systems. Let us mention that, unlike encrypted data, masked data maintain their usability in testing environments. Data masking encompasses several techniques such as generalization, mutation algorithms, customization, etc. It can use shuffling techniques for names protecting. A related technique called linked shuffling can de-identify the address. Phone numbers can be scrambled using random number generators. Date transformer allows obfuscating dates. Finally, account generator performs data de-personalization of account numbers. Thus, data masking tasks provides several benefits, such as providing realistic data for off-site and offshore software testing. Even if techniques and tools are available, scrambling huge databases is a fastidious process. Surveys reveal that companies often neglect to scramble their datasets, generating business and financial risks. ERP systems rely on thousands of tables, each one composed of more than twenty columns. Deciding which column is sensitive and has to be scrambled is an enormous task.

Protecting data privacy using scrambling is composed of three steps. The first step deals with the choice of data to be hidden, notably anonymized, randomized, swapped or, more generally, obfuscated (Bakken, Parameswaran, Blough, Franz, & Palmer, 2004). The second step consists in choosing, for each sensitive part of the database, the adequate scrambling technique, particularly but not exhaustively among those mentioned above (Fung, Wang, Chen, & Yu, 2010). The third step is related to the application of data sanitization to the entire dataset preserving data integrity. To the best of our knowledge, the literature and industrial solutions are concentrated on this third step (Ravikumar et al., 2011; Askari, Safavi-Naini,& Barker, 2012). This paper contributes to the first step by proposing an innovative technique that automates the detection of the sensitive attributes. By semantically modeling the different data, we enable the semi-automatic detection of data sensitivity. This technique encompasses two functionalities: (1) automatic detection of the values to be scrambled; that have to be validated by a domain expert and, (2) automatic propagation to other semantically linked values.

Our contribution is original in the sense that it encapsulates general and domain knowledge into rules. We propose a rule-based approach implemented under an expert system architecture. Rules are devoted to the selection of sensitive data with regard to their semantics. Furthermore we present a deduction mechanism modeled by a semantic graph to ensure the propagation of the sensitivity on near values and the consistency with the other relations. Moreover we propose a prototype with a set of clever interfaces to capture the rules. Let us mention three important aspects of our approach: