Even small extracts of data need to be created with caution, if they are for public consumption. Sensitive data can 'hide' in unexpected places, and apparently innocuous data can be combined with other information to expose information about identifiable individuals. If we need to deliver an entire database in obfuscated form, the problems can get harder. Phil Factor examines some of the basic data masking techniques, and the challenges inherent in masking certain types of sensitive and personal data, while ensuring it still looks like the real data, and preserving its referential integrity, and distribution characteristics.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

This article describes the practicalities of data masking, the various methods we can use, and the potential pitfalls. In subsequent articles, I’ll demonstrate how we can mask, or sanitize, different types of data using tools such as SQL Provision (which integrates SQL Clone and Data Masker for SQL Server), and SQL Data Generator.

Basics of data masking

Data masking, sometimes called data sanitization or data protection, is a term for the technology and processes that are used to anonymize or pseudonymize personal, private or sensitive data. Anonymization refers to the process of entirely removing all real data values, while maintaining the overall verisimilitude of the data, so that it looks real. Pseudonymization attempts to maintain a part of the data, while anonymizing the rest of the attributes because they are sensitive or irrelevant.

Why mask data?

The custodian of data has a difficult balancing act between conforming with privacy laws, discretion and the obligations of confidentiality, on one hand, and the need to make certain parts of that data available for development work, training, testing, medical research, open government, security, and the law. It is no longer legally possible to do this if the data contains information that must be protected. The data professional is faced with the practical task of providing data, legally, in a way that cannot possibly reveal, by inference attacks, private information about individuals, or confidential business information.

There wasn’t much call for data masking until recently. If you were required to provide a data extract for research, legal evidence, public information or security, it was sufficient to create a view, or set of views, that included only the required information. Sensitive information, and information beyond the remit of the information request, simply wasn’t included. It was this view that was exported. One of the main drivers for masking has been the increasing demand from developers, to use all the data from the live database of the application to develop with, as part of a DevOps process.

What can go wrong?

It is relatively simple to pseudonymize the contents of a single table, or even a handful of tables, so that the real data cannot be gleaned by casual inspection. It is more difficult to achieve a robust pseudonymization that is proof against decoding by a skilled person.

It is a frighteningly complex task to anonymize or pseudonymize the data of an entire database, in order to do development or testing, and there have been many cases where the personal information about individuals has been extracted from public, ‘pseudonymized’ data. You also need to mask data that, by itself, seems innocuous but, when used with other publicly-available data, can identify individuals.

It is relatively easy to entirely anonymize data, but this doesn’t meet many requirements. To be useful for development or test work, for example, the masked data must have the same distribution and characteristics as the original data. For medical, epidemiological research, it is often vital that the medical information and locations are correct. For example, if the long-term effects of a nuclear reactor accident are being investigated, the locations cannot be masked. We cannot tamper with information that is required for court proceedings. Data for staff training must be as realistic to casual inspection as possible. Generally, data must be pseudonymized, and partially anonymized, to fit the purpose for which the data is required.

What sort of data needs to be masked?

Although personally-identifiable data is currently the focus of legislation, there are plenty of other types of data that need to be masked, before it can be made public, such as business and financial information, or information that can aid the carrying-out of a criminal offense. The organization that processes or stores the data must determine which data elements need to be masked, and how, based on the legislative framework within which it operates.

Personally-identifiable information (PII) is common to most data masking requirements. PII is any data that can be used to identify a living person, and includes such elements as name, date of birth, National Identification Number, address details, phone numbers or email addresses, disabilities, gender identity or sexual orientation, court orders, electronic wage slips, union affiliations, biometric and ‘distinguishing feature’ information, references to the serial number of devices such as laptops that are associated with, or assigned to, a person.

However, though a single data set can be successfully anonymized, the more data sets that are combined, the easier it is to identify individuals. An example of this is data that identifies where an individual was at a given time. This can be a hotel or travel booking, internet access record, meeting appointment, or GPS data. By combining two or more of these data sets, we can often identify a record as belonging to an individual.

To protect data, we need to know in which columns, of which tables, and in which databases, it is held. Data masking is relatively easy when each item of data is held in just one column, but real life isn’t like that. If the data has been denormalized, to speed up reporting, then we must find every table to which the data we want to mask has been added.

Where does masking take place?

Data Masking software either does its obfuscation in-situ, in passage, or uses the ‘on the fly’ approach, sometimes known as the ‘Server-To-Server’ technique.

On-the-fly data masking

In the ‘on the fly’ technique, the obfuscation is done as part of copying, or cloning, the source database. The pseudonymization rules are applied as part of the process of moving the data from the source database to the target. This has the advantage that the data is never present in an un-masked form in the target database, so there is no danger of the original data being recoverable from the target database’s transaction log. If the changes were audited, that part of the database that does the audit can, and should be, withheld.

Although there are great security advantages to the obfuscation being performed on the original, security-fenced server, the process becomes ‘all-or-nothing’, and must be repeated if it proves to be inadequate. Dynamic data masking can sometimes be used to copy versions of the data, via a login for which dynamic masking has been applied, but this type of masking does not produce realistic data and can easily fall foul of constraint checks, when the data is imported into the target.

In-passage data masking

In-passage data masking takes place after the data is exported from the source and before it is imported. This usually works on the securely-held text files of the contents, in comma-separated value (CSV) format, XML, JSON or other transfer format. The process can involve Regex, at its simplest, but is more usually read into some form of document database.

This avoids the problems of artefacts of the data appearing in logs and audit trails. It also side-steps the interference of constraints, rules and triggers. Unfortunately, this will only be a postponement of the consistency checks on the masked data because the masked data will be checked on import and will need to comply with database rules once these checks are re-enabled.

In-situ data-masking

In-situ data-masking takes place after the database is copied, so that the software simply operates on the copied database. The masking rules are executed and controlled by software that then connects to the target and controls the execution of the masking rules. Although this allows incremental application of masking rules over time, the original data will remain in the logs and on the database pages.

Masking techniques

There are several methods for masking data depending on the type of data being masked, and on the masking requirements.

Complete substitution

Data in one or more columns of a table can be randomly substituted with values from an appropriate list, or by using a data generator that can supply credible values that pass check constraints. Credit Card numbers can, for example, be substituted with numbers that obey the validation rules of the provider. This preserves the look and feel of the existing data but requires a library of credible values. The disadvantage of this is that it is bound to alter the way that the data is distributed, unless the data generation technique can check the original distribution and conform to it.

Data shuffling

Data in one or more columns can be shuffled randomly against the key fields, much like a deck of cards. This might require certain rows to be respected or some other condition to be applied. Although this technique gets over the problem of data distribution, the obvious problem is that the individuals with unusual surnames such as Jack O’Diamonds are still there. Shuffling would not have cheered the victims of Ashley Madison if they had unusual names. Also, unless the shuffling is truly random, card-sharp techniques can be employed to unshuffle the pack.

Search and replace masking

This is used to modify text data by means of rules and regular expressions. It is often used to remove names or identifiers from text, substituting random length ‘###’ strings.

Number variance

Each number or date value in a column can be varied by some random percentage, whilst maintaining the original variance, range and distribution. This can useful where it would otherwise be possible to identify individuals by an exact match, such as date of birth, but is otherwise more of a blurring than an obfuscation. It will, for example, give an approximate value for salary information.

Masking out

A technique commonly used by dynamic data masking within the RDBMS. Certain fields are replaced with a mask character (such as an ‘###’). This removes the actual content while preserving the same formatting. For example, dates such as these:

12/03/2012

28/10/2017

01/08/2015

If accessed by a login for whom masking is appropriate, would appear as:

XX/XX/20XX

XX/XX/20XX

XX/XX/20XX

The masking characters effectively remove much of the sensitive content from the record while still preserving as much as possible of the look and feel. However, dynamic data masking can be defeated, easily, if the user has direct access to the database to execute queries, because the masking rules are held at table-level and the masking software cannot track the data from the original table. However, it is useful where users must be trained on the live system, and can interact with the data only via an application.

Practical difficulties with masking

All this might sound straightforward to achieve, but the detail of data masking is as complicated as the database that is being obfuscated and depends also on the purpose of data masking.

Denormalization

If a database has been denormalized, the sensitive date will be stored in several tables, and isn’t always likely to be in an obvious place. The name of a customer, for example, will appear against addresses, phone numbers, invoice headers, correspondence, JSON documents, XML fragments, references, logs, transcriptions of conversations and so on. To mask even a simple customer name could be nearly impossible. Even a well-normalized database can accidentally reveal personal information if an XML or text field is stored.

In some cases, before changing a value there must exist several arcane rules that specify what else needs to be altered and where.

Row-level data consistency

Column data is very rarely unrelated. An obvious example is that a male customer is unlikely to have the title of ‘Lady’ or ‘Mrs’, or to be called ‘Joyce’. The last names of customers will vary greatly according to where they live. Company names vary considerably with their line of business: The Red Lion, White Hart, Eato Burger Bar or Wheatsheaf are unlikely to be Banks, for example. Context is important.

A single column cannot even be shuffled without harming the verisimilitude of data. To ensure that data is correct within its context, data masking software must relate columns both internally, within a table and externally, table to table. Successful masking requires that you can modify columns in groups of rows, within a table, to contain identical values to the one you’ve changed, or to modify columns in another table to contain identical values to the one you’ve just changed. You sometimes need to substitute different values, according to a value in another column.

Constraints

Databases have many constraints, rules and triggers that are there to ensure that data is consistent and reliable. In other words, they are there to restrict the very activity you are attempting, the direct alteration of data in the database tables.

A CHECK constraint can do basic checks on the value in a column, but can also ensure that there is consistency at the table level. By altering one or more values, these CHECK constraint rules can be violated.

You can also run into problems if the column containing the value you are altering is participating in a PRIMARYKEY or FOREIGNKEY constraint. This can mean that many masking rules can only by executed in a particular order, or in a particular way.

You can, of course, temporarily disable triggers, unique keys, check constraints or foreign key constraints while you perform the masking. This can be extremely useful, but the disabling of triggers can conceivably result in inconsistency, depending on the operation that the triggers needed to perform. Also, of course, you’ll have to reenable the constraints and keys at some point, and this can be the time you realize that there are a lot of inconsistencies that need mending.

Faced with the difficulties of altering data within a database, you might think that a better approach is to apply in-passage masking to text versions of the base tables, using scripts and Regexes, or by reading the files into a document database and processing it there.

This gets over the problem of log artefacts, which can reveal the original sensitive data, but the problems begin when the sanitized data is imported into the target, the empty copy of the database that is destined to be the masked version. This is because at some point, triggers, rules and constraints will have to be enabled and the chances of the data being gracefully absorbed into the new database can be frighteningly remote.

Distributed databases

Another problem can happen of your extracted data set originates in more than one database or instance. Masking software tends to work only on a single database, and you can get problems with masking the data within several databases in a way that yields consistent data.

Masking data that is part of a primary key

If you attempt to alter the data in a column that participates in a PRIMARYKEY, then you are likely to destroy the referential integrity of the database. To do this effectively, using a substitution strategy in a database, you will need to create a correlation table, which contains copies of the before- and after- values of the column to be masked. The correlation table is then used to relate the masked, or substituted, key values to the original ones, making sure that the new values are unique.

Having temporarily disabled all the relevant UNIQUE and FOREIGNKEY constraints in the target table, the correlation table is then used to alter the values in the PRIMARYKEY of the target table, and then to replace the original key values in the FOREIGNKEY columns of any referencing tables, with the new value. Then, we can re-enable the constraints in the target table. Finally, the temporary correlation table must be dropped for security reasons, since it forms a lookup mechanism between the new values and the old values.

For each referencing table, the FOREIGNKEY columns will generally have an index, which will, to speed the process, need to remain in place during the update. Likewise, the columns containing the original PRIMARYKEY values, in the correlation table, should be indexed.

The most useful approach is likely to be the use of markov chains to produce meaningless text that mimics the style of the original. Homer can, for example, be obfuscated to:

Nor yet e’en so Achilles let his counsel take:
who should strike the bird?
As less in skill not one was absent;
nor of his lofty vessel’s prow. The dark whirlwind’s force
as this wine we pour their hearts’ best blood
Theirs and their fam’d Allies who freely in thy house
receiv’d, for twenty days, the heav’nly Maid; Daughter of Cisseus.

Structured text (XML/JSON)

Structured text, such as in an XML column, can contain the very values that are being obfuscated, so they cannot be left alone. There is no point altering only the name column in a table, if the names also appear in an XML document that is supposed to hold supplementary information.

This structured text can contain data that is used by the application under test, training or development, so it will need to be parsed and, if necessary, subject to field-by-field masking, just as if it were normal relational data. There is no short-cut to this chore.

Intelligent keys

Some data items such as postal codes, employee numbers, credit card numbers and bank codes, have a structure containing several attributes. Unless the substitution is done with the same rules, they are likely to fail any validation checks performed within the application, and to violate CHECK constraints within the database.

The problem arises in the front-end application data-entry screens, which will check the content of these ‘intelligent keys’ prior to update. If we’ve masked data in these columns in a way that violates the internal rules governing the attributes of this data, then any screen that displays the value will never permit an update because the validity checks fail.

You can, of course, try shuffling the data in these columns, but a postal code, for example, could easily be checked against an address in the same row, so it is possible that the shuffled data will be rejected as inconsistent.

Training and demonstration

Sometimes we need to anonymize a database for training or demonstration purposes, in such a way that certain data is consistent enough to support a training ‘scenario’. If you need to demonstrate an application process involving a certain user with, for example, certain accounts or other records, then the obfuscation process must take this into account. The easiest approach to this problem is to introduce consistent records into the obfuscated version of the database, after the obfuscation is completed.

Specialist data

There is certainly data that is common to many databases and can therefore be served by a list, such as names of countries, currencies and regions. Likewise, we can perform substitution from lists containing typical forenames or last names, valid credit card numbers from the most popular providers, and so on.

However, substitution of data soon gets complicated when we consider how to mask specialized words, such as the names of manufacturing parts, zoological taxonomies, because now we must deal with the diversities of culture. Even in Europe, lists would need to be in the sixteen most common languages.

There comes a time when lists need a supplement, which in its simplest form might be a reverse RegEx expression that can generate random numbers or strings according to a regex expression.

This technique is most effective if the proportion of NULL values introduced into the output can be specified, and the distribution of random values can be altered to fit the distribution of the data. Ideally, there should be a way of scripting random data generators to accommodate any sort of condition constraints or data consistency.

Summary

Quite often, there is no need at all to mask data. The most effective way to protect sensitive data is simply not to include it in the data set that you wish to distribute. Many requirements for masked data, such as for distributing reports publicly, can be accommodated far better by creating a view, or set of views, that gives the required data without any of the unnecessary sensitive data fields. Why extract a credit card number, for example, if it must be obfuscated anyway? Why disguise people’s names when they need not be in the data extract?

Even data extracts need to be created with great caution. Apparently innocuous data can be combined with other information to expose information about identifiable individuals. Location information, identifiers, or descriptive information, such as racial origins, can be used to unmask data. Fields containing compound identifiers, such as bank codes, can contain location information. XML fields can occasionally compromise a data extract by containing some or all of the original record, in denormalized form.

Also, the referential nature of data in the average database means that the apparently simple matter of changing the data in databases can get complicated quickly. Where data has been denormalized, a record can store the same value in several places, and so changing the value is a matter that needs caution and research. Replacing a value in a column that is participating in a primary key can be almost impossible, without a great deal of skill and knowledge.

There are times when development and testing requires data that is as close as legally -possible to the real production data. It is likely to be a regular demand. If, therefore, the creation of a masked or anonymized database is done regularly, it is best to automate the process so the pain of doing so isn’t wasted, and can be reused. Also, the routine can be tested alongside the database as it is extended and developed, to catch the problems that are occasionally introduced to the obfuscation process during development work.

If you found this article useful you might also enjoy this article by Grant Fritchey

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

You may also like

With a database, many people can view the data at the same time. If someone makes a change and commits it, then the change is visible not only for that connection, but all the others as well. Everyone sees the effects of that change; that is what databases are for. But what if you want

Developers, when working on databases rather than the application code, often find they have less freedom to experiment than they are used to. It’s not necessarily easy to set up a database for testing, especially if the process isn’t automated. They’ll need to dig around in source control, build the database at the correct version,

Microsoft Future Decoded is your essential guide to disruptive technology and how it will impact people and businesses, today and tomorrow. Back for a fourth year, Microsoft Future Decoded will focus on current and emerging trends at the cross over of business and technology. With cloud, big data and artificial intelligence already rewriting the rules

In this webinar Redgate's Arneh Eskandari will highlight the importance of provisioning as a part of your Compliant Database DevOps story and how it can accelerate your delivery processes while remaining in line with relevant legislation.