This article describes a lightweight copy-and-generate approach for making a sanitized version of a production database available to development teams with SQL Clone and SQL Data Generator.

We build the latest database version (schema only), then copy into it, from the production database, the data for any tables and columns that don’t require masking. For columns containing sensitive or identifying data, we generate the data instead.

The sanitized version of the database is then made available to the development team through SQL Clone, meaning that deploying each sanitized clone will take just seconds and require only tens of MB of disk space on each server.

The technique doesn’t rely on in-place modification of sensitive data, so it avoids the possibility that someone with access to the transaction log, which contains details of how the data was modified, could reconstruct the real data. However, since it does copy a lot of the other production data verbatim, it may still be possible to make deductions about the masked data, such as the real identity of a person or organization, based on the data that isn’t masked.

Why Use Real Databases in Development and Testing?

Over 70% of respondents to our recent database provisioning survey told us that they regularly copy production databases for use in non-production environments, such as development and testing. Some of those people who copy production databases subsequently run scripts to sanitize the sensitive data, but more still tell us that they should.

Some will cry out in horror, but assuming that the data can be sanitized in a way that complies with the current legislative framework in force, the results are particularly useful for testing. Organizations such as London Electronic Exchange LMAX, the birthplace of Continuous Delivery, have long considered that running early builds and tests against sanitized copies of production data is crucial to the overall quality and reliability of their deployments.

It’s part of the shift left culture of DevOps: if developers can write, test, and run their own code on servers and databases configured to match as closely as possible what’s present in the production environment, then…

“…most of the work to successfully integrate our codes and environments happens during our daily work, instead of at the end of the release.” (The DevOps Handbook)

Of course, not all database development work requires production data. Much of it is well-served by specially created sets of data that support effective unit testing. Test data sets that include the likes of "Mrs. O’ Brien" and "Mr. Null" are very effective at flushing out bugs.

However, there are times when you need to work with real data, meaning data that is as close as possible in its character and distribution to the production data. By running tests with this data, development teams can gain confidence that their test results reflect as accurately as possible the behavior that will be observed immediately after deploying a change to production. Occasionally, there are also times when only working with the real production data will do, such as when a production bug cannot be replicated in the development and test environments.

Database Provisioning Strategies

The strategies for getting the real data into the development and test cells depends on the legislative frameworks controlling the use of the data that your organization collects. If the data is essentially public domain, you can simply restore the latest production backup. For personal, financial or medical data, it may not be possible, legally, to copy production data, so you’ll need to use a tool to create test data sets. It is a broad spectrum, though, and it may be possible to copy most of the data and generate, or mask, only the sensitive data.

Whatever the strategy, it’s often a hard and time-consuming task to automate reliably the process of restoring a copy of a large production database to each development and test machine that needs it. A database restore process can take many hours for large production databases, and it can put a strain on storage capacity to make multiple copies of these databases. The obligation to mask or sanitize that data first adds further complexity to the task. In my experience, scripted solutions to automate the provisioning of sanitized databases have tended to be very brittle and hard to maintain.

If certain data can’t be copied outside the production environments, or developers and testers shouldn’t be able to view real values for certain types of sensitive information, but the team needs to development test against a full database, what are the options for masking that data?

What Is Data Masking?

What exactly do we mean by data masking? Perhaps you use Dynamic Data Masking in your production databases, whereby the database engine masks sensitive data, as it is retrieved from the database, to prevent unauthorized users from viewing it. It uses masking rules applied to each column that contains sensitive data, but users with full administrative access (and let’s face it, that often includes your developers when the database is in a non-production environment) can still circumvent the controls and look behind the mask, as can others in some circumstances, as Phil Factor points out in his informative article on Simple Talk.

Perhaps you use Transparent Data Encryption to ensure that only those servers that have the right encryption keys and certificate can view the data. However, if developers need to work with copies of this production data, then the development server will need to have all these. In other words, you’ll need to take the mask off, and the data won’t be protected.

Many people use the term data masking at an organizational level, meaning that the data should be available to some people, and in some places but not others. In such cases, they see the need to enable database testing with realistic data, but this means that they need to replace sensitive data when it is moved, for example from production into the development environment.

How do we replace this sensitive data though? One person’s masking means obfuscating (XXXX-ing out) sensitive fields using techniques such as character masking and numeric variance, as described by John Magnabosco. However, this form of in-place data modification might not prevent someone with access to the LDF or MDF files of the database being able to recreate the real values.

One alternative is to generate the sensitive data, rather than copy it, which is the technique I’ll use here.

Copy-and-Generate Data Masking

I’m going to use some Redgate products to perform a copy-and-generate data masking operation. The advantage of this approach over an in-place update is that the sensitive data I want to scrub has never existed in my new database copy. Attacking the MDF and LDF files with a hex editor will not yield evidence of the sensitive data. I’ll use Brent Ozar’s download of StackOverflow for my example.

First, I’ll create an empty copy of my schema (represented by the white cells in Figure 1), then run routines to populate it, either with the data from my source database or with generated data. In Figure 1, we’ve copied the top right cell of data from the source to the target and generated the bottom right, since the phone number is identifying data. Next, we’ll copy the next two cells, and so on.

Figure 1.

Finally, I’ll create a SQL Clone image from which the team can create and deploy clone databases to the SQL Server instances on their development and test servers.

All this would need to be scripted for reuse; I want to run this in my maintenance window overnight so that I always have a sanitized copy of recent production data.

1. Prepare an Empty Copy of the Schema

The first step is to create an empty copy of my source database; all the schema but none of the data. If you have Redgate’s DLM Automation (and if you’re a Toolbelt owner, you do), you can do so using the built-in New-DlmDatabaseRelease PowerShell cmdlet, as shown in Listing 1.

Alternatively, you might just use the Redgate SQL Compare command line and PowerShell. Alternatively, if you are on SQL Server 2014 SP2 or SQL Server 2016, you could use DBCC CLONEDATABASE. This addition to the DBCC instructions creates a data-free copy of a schema, as a new database. It was designed for diagnostic purposes, and do heed the warning in the output; a cloned database should be used for diagnostic purposes only and is not supported for use in a production environment (and SQL Clone databases should not be used in Production either).

Another point to note is that the default behavior of DBCC CLONEDATABASE is to include the statistics, which may give rise to other confidentiality concerns, as outlined by SQL Server Program Manager Parikshit Savjani in a recent post. To avoid these worries, I’d use the additional NO_QUERYSTORE and NO_STATISTICS options which were added in 2016 SP1, then set the database to READ_WRITE, and recovery mode to SIMPLE (although StackOverflow is already in that state).

This will create the Stackoverflow-Obfuscated database (schema only) on the production server, and this will be the image source. If you can’t access production directly, to create the image, you can simply generate it from a backup.

2. Prepare a Copy-and-Generate Data Load Task

We could now start writing INSERT INTO…SELECT * FROM T-SQL scripts, or an SSIS job, to populate our empty copy of the schema, and then insert generated values for the non-sensitive. However, that’s an awful lot of bespoke coding work. Instead, I’m going to use an existing tool which can do both the copy and the generation.

Redgate’s SQL Data Generator (part of the SQL Toolbelt) has been around a while now but has recently undergone some tweaks that make it more amenable to the copy-and-generate task. Specifically, it will now copy sequential identifiers from the source, instead of recreating the sequence. You’ll need to be on a recent version if you want to follow along.

I’ll configure SQL Data Generator by pointing at the SQL Server instance where both the Stackoverflow and Stackoverflow-Obfuscated databases are hosted, in this simple example.

Figure 2.

Note, I’m pointing the tool at the empty schema I produced. After a quick scan, it will show me the tables. All I need to do is step through them, and at the top level define the source to be my original StackOverflow database.

I’ll also need to make sure the number of rows to insert is set to Same as mapped data and the Copy IDENTITY column values selection is checked (as a general practice, although StackOverflow doesn’t use them).

Figure 3.

For any columns that are a source of sensitive values, within each table, I choose an appropriate generator to create those values, instead of copying them from the source.

The first one that struck me was LastEditorDisplayName on the Posts table. I’ll choose the pre-configured regular expression and text file source for that, then pick the percentage which should be null (I took a guess, but generally you’d just find this out with a quick query).

Figure 4.

That’s basically it for configuring SQL Data Generator. Next, I’ll save the SDG project file, based on this definition, and invoke it using a PowerShell script:

This takes a little while: 3 hours 45 minutes for me. StackOverflow is 95 GB or so, and has a simple table structure, so clearly some larger or more complex databases wouldn’t allow this in an overnight batch. My output looked as shown below:

3. Create a SQL Clone Image From the Sanitized Copy

SQL Clone’s PowerShell cmdlets make this straightforward. I can run this from any machine which has the cmdlets installed; it doesn’t need any other SQL Clone components installed. The executing user must have permissions to access the SQL Clone Server. See the worked examples for other variations.

Before making it into an image ready for cloning, it may be wise to perform some basic database maintenance, such as rebuilding indexes and updating statistics, so that each clone performs adequately, although I haven’t covered that here.

This wasn’t too painful: 47 minutes. I think this will just about fit in my nightly maintenance window, although I might have to keep a close eye on it if my database is growing (whose isn’t?). I can now see the image in SQL Clone:

Figure 5.

4. Creating the Sanitized Database Clones

Finally, we use SQL Clone to create and deploy clones from this image, making the full database available in multiple SQL Server instances across your development and test workstations. All the long-running operations have taken place overnight, and now making each clone will require only a few seconds, and each one will take up only about 70 MB of local disk space.

All that’s left is to put those scripts in a batch (SQL Agent would do), pick a schedule, and keep an eye on it. You won’t want to allow images to accumulate forever, so you’ll want to clear them out after a defined period. I won’t cover that here, but there are examples to follow for that as well.

Further Data Masking Challenges

Using SQL Clone and a simple copy-and-generate technique, I’ve made copies of my production database available for development and testing, having first replaced the sensitive fields in my database, in a way that prevents end users from accessing the original values. What’s more, if I build this into a scheduled deployment process, for example using Octopus Deploy, then I’ve got a repeatable process, with logs I can show to an auditor, if need be.

However, there are plenty more challenges. I deliberately kept this example simple so, for example, I only generated data for two columns that contained obviously sensitive data. Sometimes, though, sensitive data crops up in unexpected places; you might mask the LastName column but then find that real names and other sensitive data also crop up in the “Call Notes” column of a table. You’ll need to ensure your masking strategy accounts for all sensitive data.

I also haven’t replicated the data characteristics very rigorously when replacing the sensitive fields, which may be a problem for certain types of testing. You might require techniques to replace related identification data with generated content that is coherent and plausible and which makes sense to an end user (so if City=New York, then State=California is not acceptable). A tool like SDG can do a lot better job than I demonstrated here, in generating realistic data.

In some ways, I chose an easy sample database; none of the sensitive values are key values, foreign keys are absent, and so are other database objects or features which would complicate the process, like triggers, cross-database references, and replication.

All these challenges make data masking more interesting, but the obstacles aren’t insurmountable. Others are harder to overcome. I recall working on Financial Services databases, where only one database would have ValuationEventid, as a key column, but the values in this column were used in non-key columns in other databases. That would have meant needing to copy-and-generate across databases, at the same point in time, so that these ValuationEventid values were properly synchronized.

It can also be hard to generate data in a way that reproduces the inconsistencies, the cruft, the strange anomalies, which gather over time in all databases and are the source of many issues that appear only in production. I’ve always said that exploring a production database is often like going on an architectural dig; it is the living output of all versions of your systems that have ever existed. I recall with a strange wistfulness conversations along the lines of: “How did those values get in there?” “Ah, that’s probably from when we used to use Excel to input them…”

Data Masking and the GDPR

Finally, it’s worth reiterating that since I’ve only masked some of the data, it may be possible to infer from the unmasked data values, the true identities and values of the data that is masked. This raises the question: To what extent does the data need to be masked or anonymized? It’s a topic that is addressed directly by the new General Data Protection Regulations (GDPR), with its references to pseudonymization:

“The principles of data protection should therefore not apply to anonymous information, that is […] data rendered anonymous in such a way that the data subject is not or no longer identifiable.”

The process of anonymization has been neatly defined by the ICO as “Turning data into a form which does not identify individuals and where identification is not likely to take place“. Note the use of the words not likely. It doesn’t necessarily have to be completely impossible for someone to identify real data values, just unlikely.

However, the big step change is that under the GDPR, all organizations who primarily process data, or are over a certain size, are now legally obliged to enforce the data protection processes that will make identification unlikely, if they’re going to use identifiable data outside the tasks for which consent has been given. They will need to be able to prove that they are doing so, regardless of where in the world that data is stored or processed.

All such organizations need to consider, urgently, how this data is used within their organization, and whether the proper processes are in place to mask and protect it.

Summary

The intent of this article was to provides teams with a lightweight, self-service process for developing and testing on real databases while ensuring that applications don’t use, and developers can’t see, the real data values for columns that contain sensitive data.

Developers working closely with data are right to demand early access to up-to-date and realistic data sets. The DevOps imperative to ‘move problems left’ in the pipeline is making those demands harder to resist, while regulatory forces are requiring ever more control of exactly what data is moved around the organization.

Ultimately, the severity of the data masking challenge depends on the complexity of both your databases and your data. A simple approach like the one shown won’t be appropriate for many, but it may suit some and could mark a good starting place for others.