The Five Laws Of Data Masking

Tomorrow I’ll be giving a webcast over at ZDNet (sponsored by Oracle) on the Top 5 Database Security Resolutions for 2008. The resolutions have changed a bit since I first posted about them over here, and I decided to swap in data masking for the last one. I almost pulled it back out after I found out my sponsor (Oracle) just released a data masking product (I try to avoid being too promotional in my webinars), but it’s something I’ve been talking about for a while and it’s too important to pull just because a few people might think I was being biased.

We’re up to nearly 600 people registered for the event, making it one of the largest webcasts I’ve done.

But enough self-promotion; it’s time to talk about data masking.

Data masking started popping up as an issue about 3 years ago. At the time I was covering database security, but client calls were bouncing around between me on the security team and someone over in application development. It’s one of these annoying security issues that crosses organizational boundaries and ends up the responsibility of those will little security experience. It’s an issue that grew organically- first popping up in some audits related to GLBA (a financial services regulation), and now something we see required for PCI and a few other regulations.

Data masking is really a bad term for what we’re talking about. We can technically mask data anywhere, but when we use the term data masking we usually mean “test data generation” or “analytical data generation”. It’s the conversion of production data into either test and development data or data for a data warehouse (OLAP). For this post we’ll focus on test data generation, but the same techniques can be used for an OLAP where you want data that represents production data, but still protects the sensitive stuff.

And that’s our goal- to take sensitive data from a production system and convert it into non-sensitive data suitable for testing or analysis. We can do this through substitution, transposition, obfuscation, de-coupling, scrambling, hashing, or even encryption.

I’m going to quickly eliminate hashing and encryption from the discussion- those techniques are very effective at protecting data, but the result breaks the second rule of data masking- that the data is still representative of the source, without being sensitive.

Organizations are increasingly finding that data masking is mandated for regulatory compliance. It’s also an extremely effective way to reduce enterprise risk. Development and test environments are rarely as secure as production, and there’s little reason developers should have access to sensitive data. Analytical systems are often accessed by a wide variety of users, most of whom shouldn’t see sensitive data, with only a fraction of the access and other security controls in transactional systems.

With that, and since I get way more hits if I have the “x laws” in the title, here are the Five Laws of Data Masking:

Masking must not be reversible. However you mask your data, it should never be possible to use it to retrieve the original sensitive data.

The results must be representative of the source data. The reason to mask data instead of just generating random data is that masking allows you to protect sensitive information that still resembles production data for development and testing purposes. This could include geographic distributions, credit card distributions (e.g., leaving the first 4 numbers unchanged, but scrambling the rest), or maintaining human readability of (fake) names and addresses.

Referential integrity must be maintained. Your masking solution should maintain referential integrity- if a credit card number is a primary key, and scrambled as part of masking, then all instances of that number linked through key pairs must be scrambled identically.

Only mask non-sensitive data if it can be used to recreate sensitive data. It isn’t necessary to mask everything in your database, just those parts that you deem sensitive. But remember, some non-sensitive data can be used to either recreate or tie back to sensitive data. For example, if you scramble a medical ID but the treatment codes for a record could only map back to the original record, you also need to scramble those codes. This is called inference analysis, and your masking should protect against it.

Masking must be a repeatable process. One-off masking is not only nearly impossible to maintain, but it’s fairly ineffective. Development/test data needs to represent constantly changing production data as closely as possible. Analytical data may need to be generated daily, or even hourly. If masking isn’t an automated process it’s inefficient, expensive, and ineffective. I know of some organizations that centralize masking and offer it as an internal service to the enterprise.

These “laws” are just to start the discussion on masking. In future posts I’ll discuss my recommended data masking process and what features to look for in tools.

Comments:

If you like to leave comments, and aren't a spammer, register for the site and email us at info@securosis.com and we'll turn off moderation for your account.

By Adrian Lane on 01/25 at 05:56 AM

I find it ironic that Oracle would use the term ‘‘Mask’’ when people like Hector Garcia-Molina use the term ‘‘obfuscation’’ in relation to research projects into this field.
-
I have been using the ‘‘data obfuscation’’ since I heard the term 1997. I like it term because it encompasses the transformation into the obscure state. Mask implies concealment, but not an alteration, and this needs to be considered a form of hash function. But enough of the geek word smithing ...
-
In full agreement with your definition of the required components ... looks like you put a lot of thought into this. I would not have even considered making the process repeatable ... in your view is it really a requirement or an option?

By Techdulla on 01/25 at 06:56 AM

Just wanted to say good job on the webcast today, so much info in such a short period of time.

By rmogull on 01/25 at 07:06 AM

@techdulla, thanks!
@alane: It’s not Oracle- I was using obfuscation for the first few years I covered it, but it became clear the market (end users) were using masking. I agree it’s not the right term, but I’‘ve learned the hard way that those aren’‘t battles I usually win despite my best attempts.

That’s what people Google, so that’s what the vendors have to call it.

I really do think it *has* to be a repeatable process in most cases. It’s so rare to only need it once, and people fall into the rat hole that ends with someone assigned to do nothing but re-write scripts to obfuscate all the time.

Probably not as important for small orgs, but mandatory for the big ones.

By 41% Of Enterprises Mask Test And Development Data on 01/31 at 04:18 AM

[...] And don’t forget data masking law number 5. [...]

By J Doherty on 06/24 at 08:04 AM

Understanding Data Masking, Obfuscation, De-Identification has become more complex than one could ever consider. Most large companies can’‘t figure out where to start, never mind how to accomplish securing data across the enterprise.

In the past five years I have seen companies enter this market via the archiving/subsetting space applying that technology to a data masking solution and try to market as complete test data management solutions. But in reality data masking is not about shrinking database it is about security.

Vendors use terms like scambling, substitution, nulling etc but true masking is about de-identifying using complex combination of masking technigues. But what needs to be realized is that masking needs to be consistant over time and across data sources to make testing and developement environment effective and allow for repeatabiltiy of use cases and full integration testing.

Oracle has thrown there hat in the ring to compete against 3rd party vendors but it does not support complex data masking, ease of use and cross database platforms.

So as more companies identify the need for securing non production database and begin to eliminate the use of real data, they need to look at finding a solution that protects all NPI and PII because data privacy is not an application specific issue it is a corporate issue.

J

By rmogull on 06/24 at 08:08 AM

J-

Full disclosure, I believe you are a vendor in this space. You can still post, but when criticizing other approaches it’s important to disclose your position since it comes with bias (for better or worse). It means you have experience in the area, but you also have a stake in the game.

By J Doherty on 06/24 at 08:18 AM

I did not mean to critize but to clarify what is going on in the datamasking and security space. I have been working and worked with companies in many aspects of data security over the past 15 years. I apologize if I was misunderstood.

By rmogull on 06/24 at 08:21 AM

No problem, just letting people know…

By On Oracle World and Inference Attacks | securosis. on 09/25 at 05:47 AM

[...] This year I was invited to speak on a panel on data masking/test data generation. As usual, it’s something we’ve talked about before, and it’s clearly a warming topic thanks to PCI and HIPAA. I’ve covered data masking for years, and was even involved in a real project long before joining Gartner, but it’s only VERY recently that interest really seems to be accelerating. You can read this post for my Five Laws of Data Masking. [...]

By Anil on 01/12 at 06:17 AM

it’s really very good and informative.
Recently I took up the data masking initiative to implement on Sybase ASE and IQ database
I am interested in knowing the Data masking Process and features need to look in the tools
As I just started looking in to this, Can you please recommend what are processes and features need to look?
If you could help me with any note/white papers or like comparison between tools will greatly appreciated

By Manmeet on 03/15 at 05:20 PM

I really like your your rules. We have built our product and do cover these and more rules.We are also stepping in the Dynamic and related unstructured data.

But as I see I dont think 41% organizations Mask the data. Can you tell the source.

By Matteson on 11/17 at 07:36 PM

There is another alternative to data masking which is the use of sythetically generated, realistic, even longitudinal data sets created as a service by such companies as ExactData and utilized to avoid the risk of comprimising sensitive data. Thanks for the infomation.

**Editorial addition: the person submitting this comment is a member of the recommended company. We ask that vendors please identify themselves as such when discussing their own products and services in comments**

By AFarber on 01/21 at 10:55 AM

So, why are you still using data masking?

Most of us have no idea when it comes to figuring out ways to acquire the right kind of data we need for any type of test or development project. We

By Rbhill on 01/24 at 08:51 PM

If a company is guilty of one of these the data should be leaked:

By Urvashi Saxena on 10/12 at 04:48 PM

I don’t work for, but have worked with, software supported in Eclipse which follow these rules. IRI FieldShield for source data masking, and IRI RowGen for test data creation, both preserve referential integrity with de-identified, irreversible field-level masks in databases and flat files.