Many organizations I speak with are increasingly concerned about the privacy of their information. The reasons are obvious... information leaks, regulations, law suits, and the risk of penalties top the list. However even beyond those, companies now recognize that protecting the information of their customers can be a key differentiator against competitors and help prevent situations that would lead to customer loss. As these organizations adopt policies to control their risk, they find that data masking requirements appear in various quadrants of their IT organization, including the data integration space.

To help our customers respond to these challenges, IBM has released the new InfoSphere DataStage Pack for Data Masking. You can find that announcement at the following link:

In this blog, I'll preview some of the primary features of this new stage and then provide some sample illustrations of how it may be used for in-line masking for the protection of data in an ETL process.

Capabilities

The Data Masking Pack is fully integrated with DataStage and operates just like any other stage on the DataStage job canvas. This makes it very simple to insert into any process where data elements such as customer names, addresses, national identifier numbers, credit card account numbers, email addresses, and the like must be protected.

The pack has been built upon the same obfuscation technologies used in the market leading InfoSphere Optim products. Combined with the features of DataStage, it provides some key capabilities to help prevent the exposure of sensitive business data, including:

capabiliies to simulate realistic data in situations where data type and format must be preserved

the performance, scalability, and reusability of InfoSphere DataStage

support for masking complex file types, including mainframe and EBCIDIC

Sample Masked Data

For folks who are new to this topic, you may want to see an example of masked information to have better context for the rest of the conversation. In the image below, we are masking with the email address policy. The original values are on the left of the screen and the masked values just to the right. The three rows illustrate options that are available for this particular policy... preserve domain name, preserve user name, and generate both. Depending on where this data will be reused, you may prefer either one of these policies.

Job IllustrationOnce installed, the Data Masking (DM) stage appears as any other on the Designer pallete. The screen shot below illustrates how that may look in a sample job. In this case, we are simply moving data from one file to another and applying the masking rules as the only transformation, but this job could have any series of transformations, aggregations, pivots and the like. The DM stage is capable of setting the masking policy for any number of fields, so in most use cases you would only need one such stage in a job. Of course, not every field requires a mask, and those fields can simply pass through this stage unaffected. The DM stage also includes validation checking. For instance, if you are masking a Social Security Number, you may want to reject any data that doesn't conform to a standard pattern for SSN. In those cases, the user can set a property to either send that data down a reject link (not drawn in this particular job). Alternatively, the job can be set to abort on those conditions or simply pass the data through unaffected. This provides very robust handling for exceptions.

To configure the stage, the user enters the Masking Policy Editor (shown below) where they are presented with three separate sections:

output column: lists all columns in the record stream and allows the developer to choose which ones require a masking policy

masking policy: any of a series of policies for the obfuscation of data, including National ID for a variety of countries, Credit Card Number, Random, Repeatable, etc...

mask policy options: depending on the policy selected, the relevant options for configuration of that policy

The developer simply works through the drop down lists to select the columns and policies that are required. In the screen shot to the right the masking policy is set to "Hash Lookup" which gets a column or multiple columns from predefined lookup tables. This feature is important where the customer requires that a particular data value will always map to the same substitute value. The pack includes substitute data for several reference sets, including first name, last name, company name, and address.

You can see from these few illustrations that the masking policy is very straightforward to configure. It's also important to note that the runtime components scale across the DataStage engine like any other parallel stage. So, regardless if you're running on an SMP, MPP or Grid, you can leverage the entirety of that compute power to mask huge volumes of data in-line.

SummaryIf your organization is challenged with data privacy issues related to moving data throughout your organization, I'd enjoy discussing with you the unique benefits DataStage can introduce into those scenarios. As always, feel free to drop me a line anytime.