Protecting Privacy in the Development and Testing of Human Resources Applications

Article

Published: November 2010

Internal policies and
governmental regulations dictate how Microsoft must handle and store personally
identifiable information (PII) about employees. The Human Resources Information
Technology (HR IT) team at Microsoft created a strategy for using live data
feeds from applications that, by function, contain PII. HR IT can test the functionality
of the technological solutions that it is developing and testing under real-world
conditions without risking inadvertent disclosure of PII.

This paper is intended for technical decision
makers and IT professionals who are interested in learning how to help keep
sensitive data safe but still allow it to be used during development and
testing.

Windows Server 2003

Internet Information Services

Microsoft .NET Framework

Introduction

HR IT must access live HR systems to develop and test the
custom applications that it builds. To help ensure that the data that HR IT
uses in the development and test environment is as safe as it is in the
production environment, HR IT developed a strategy for sanitization (removal) of PII from this data. This
article details the strategy, discusses how the team piloted and tested the data
sanitization solution, and describes techniques that the team uses to help
ensure the success of the sanitization process.

To begin the data sanitization project, HR IT needed to
answer the following questions:

What data is PII or may be associated with PII?

What is the correct tool or solution to sanitize data when multiple
applications rely on specific data points and key data elements?

Will the sanitization strategy affect development or disrupt the
testing framework?

Will the sanitization strategy create overhead in terms of time and
performance?

Identifying Personal Data

Identifying data that contains PII can be complex. Although
some data values are obviously PII, such as last names, Social Security numbers
(SSNs), driver’s license numbers, and financial information, the attributes of other
data points may be less clear. Information such as hire dates, job titles, and
leave dates may not appear to be inherently PII when taken as a single value,
but when used in combination, they could reveal an employee’s identity.

For example, if a malicious user of Microsoft systems knows
that a leave of absence occurred on certain days, he or she could use the date
for the beginning of the leave and the date for the end of the leave, together
with items such as profession, discipline, and job title, to identify the
individual. The malicious user could then use this information to associate
information like stock level, salary, and other PII. HR IT had to determine
what combination of data points it should sanitize to make sure that a malicious
user cannot infer an identity.

Choosing the Sanitization Strategy

The HR IT team reviewed three strategies for sanitizing data.

Sanitize PII and non-PII. This approach
includes sanitizing all data points (PII data points and non-PII data points
that might lead to identification of an individual). However, many business
rules may be associated with non-PII points, and the sanitization of them might
make testing the business scenarios impossible. For example, an employee’s job title
is used in several stages of HR IT applications, such as applications for
recruiting, staffing, and performance management. If this information is
sanitized, it will cause tests that rely on it to fail.

Sanitize only PII data. This
approach sanitizes only high business impact (HBI) PII data points such as SSN,
personnel number, license number, and e-mail address, but does not sanitize
non-PII data points such as job title, work phone, and salary. Sanitizing HBI
and PII data points reduces the chance of making an association of personal
information with the database record values. However, this approach still
carries risks. For example, if the chief executive officer (CEO) of the company
earns the highest salary, a malicious user can simply look for the highest
salary or job title to identify the record of the CEO determine other details. An
organization can mitigate this risk by running scripts to eliminate these
special cases from the data.

Sanitize non-PII data values. This approach
works when business logic exists on the primary key that contains PII data
points. With this approach, all data points are sanitized, except the primary-key
PII data point. For example, an application may have business logic related to
SSNs, so the SSN cannot be sanitized. In this scenario, the remaining data
points like first name, full name, and last name can be sanitized, so the SSN cannot
be connected to an identifiable person. The drawback to this approach is that
the high volume of data that must be sanitized may affect performance.

HR IT chose the strategy of sanitizing non-PII data
values for relevant applications. Advantages to this approach include the
preservation of data relationships and constraints. Additionally, because the
data points used for external systems are not sanitized, external application
interfaces such as SAP continue to work as expected.

Choosing the Sanitization Technology

There are many ways to sanitize PII, such as scrambling
data values, using data substitution methods, masking data, clearing data,
shuffling records, and employing encryption and decryption. Hardware encryption
is another method for securing data. Unfortunately, hardware encryption cannot occur
on virtual servers.

Multiple third-party tools are available to sanitize data,
and custom tools can be developed through application programming interfaces
(APIs) for hashing algorithms. HR IT chose to use a Microsoft internal tool for
several reasons, including the ability to use the stronger SHA-1 hashing
algorithm and to customize the workflow of the sanitization process.

The Microsoft Data Sanitization Tool is an internally
used application built on the Windows Server® 2003 operating system, Internet
Information Services (IIS), and the Microsoft® .NET Framework. The Data
Sanitization Tool uses a three-part workflow with pre-sanitization,
sanitization, and post-sanitization phases. A three-part process helps ensure that
the administrator can control the fields to be sanitized and can rebuild data
relationships and constraints.

In the pre-sanitization phase, the Data Sanitization Tool
scans the database schema and creates XML files that represent the schemas for
each database to be included in the sanitization. The administrator then
chooses the data points to be sanitized and begins the sanitization phase. The
sanitization phase removes all indexes, primary and foreign keys, triggers, and
default constraints from the tables being sanitized. In post-sanitization phase,
the administrator rebuilds the indexes, keys, triggers, and constraints.

Implementing and Testing the Solution

The HR IT team used a four-phased process in implementing
its data sanitization solution: envisioning, triage, implementation, and stabilization.

Envisioning. In this phase,
a technical privacy manager (TPM) provided the Data Sanitization Tool to the
adoption team. The TPM also explained what information would be required to
start the process. For example, the adoption team needed to supply the name of the
databases to be sanitized, the frequency of sanitization, cross-database
dependencies, and target location.

Triage. In this phase, the TPM and one lead person from each
discipline (development, testing, and project management) met and decided on the
PII data points to be sanitized, as well as the hashing methods to be applied
on the PII data points.

Implementation. This phase began after the PII points, hashing method,
and cross-database dependencies were identified. The same hashing key must be
used to sanitize across environments. The HR IT team used the Data Sanitization
Tool and then developed an application wrapper that calls the Data Sanitization
Tool executable files and completely automates the sanitization process. The team used two internal applications, a leave-of-absence reporting
tool and a career management roadmap, to pilot the data sanitization solution
before applying it to all relevant HR applications at Microsoft.

Stabilization. In
this phase, the TPM or operations team member who has permissions to view
unsanitized data is able to compare the sanitized and unsanitized data by using
tools that analyze the sanitized and unsantized data. These tools validate that
the data sanitization has been properly implemented and all the identified PII
data points had been sanitized. Then, the test team received the sanitized
database to do more vigorous testing to make sure that data integrity and the
user interface (UI) were maintained after sanitizing. If the test team found
that the test scenarios failed, they consulted with the TPM to do triage and
evaluated another way to sanitize the data that would not disrupt the testing
scenarios.

Testing the Results

HR IT found that several generic high-priority test cases
should be part of the testing kit for the sanitization process. The HR IT team
is working on eventually automating the testing application process to cover all
of the following scenarios:

Field overflow. Data type and
data length should be preserved before and after sanitization.

User interface. The UI should
not look distorted after the data is sanitized, and string spaces must be
preserved for the values that are sanitized.

Domain integrity at the column level.
If a field column such as first name appears many times in a table, every time
scrambling occurs on the repeated data point, it should produce the same
sanitized value. For example, if "SMITH" is scrambled to "SSSKKKK,"
all rows of that table should produce "SSSKKKK."

Entity integrity at the row level.
Integrity must be maintained at the row level—for example, a first name, last
name, and full name. If the first name is scrambled to “XXX” and the last name
is scrambled to “YYY,” the full name should be scrambled to “XXX YYY.”

Referential integrity. A self-referential
key must be maintained when the data value references itself in another column
of the same table. Passing the same hashing key while running the sanitization
process helps maintain data integrity when primary-key and foreign-key
relationships exist across tables on a particular data column. Sanitizing across
databases also requires use of the same hashing key.

Issues related to sanitizing e-mail
fields. In many applications, an e-mail address may be used for performing
Windows® authentication or as part of distribution security groups on which
role-based permissions are granted. If the e-mail address is sanitized through random
string values, the application authentication logic will fail. The solution is to
replace e-mail addresses with valid test-only e-mail accounts and make them
part of the security groups.

Repeatable sanitization. Each
import and run of the sanitization process should yield the same sanitized
values to facilitate successful and repeatable test cases.

Sequential data sanitization.
The data sanitization tool must respect the sanitation order that the
administrator chooses, if a workflow depends on the database. For example, first
name and last name may need to be sanitized before full name is sanitized.

Verification of constraints
and indexes. Indexes and constraints must be intact before and after
sanitization of the database.

Best Practices

During the course of planning and implementing the data
sanitization project, the HR IT team learned lessons that yielded the following
best practices.

Use an Appropriate Strategy for Data Sanitization

Choosing the right strategy is based largely on the nature
of the data elements and how applications use that data. The HR IT team found
that the primary key within the data was frequently necessary to integrate with
other applications. In most of the cases, not sanitizing the primary key is strongly
advocated because it eliminates the need to perform additional testing on
important business rules that rely on the primary key. For example, a primary
key of e-mail address might be used as a unique identifier across applications.
This data cannot successfully be sanitized without losing the key.

Use a Proof of Concept

Using a proof of concept was important for the success of
the project. By using only two applications, the HR IT team limited the scope
of the effort while gaining insight into best practices for sanitization. The approach
that HR IT used for the proof of concept was to sanitize non-PII data values. HR
IT sanitized the data values associated with the employee and the data that the
employee provided. It would still be possible to identify the person by his or
her name, but it would not be possible to match that employee with the data associated
with the employee, because the data itself would be scrambled. For example, employee
career tracking, competency assessments, stock level, and similar data points
are now sanitized. Employee names are known and available from other systems
for regular business use. The method described, however, does not apply to
customer-related PII.

The strategy of sanitizing non-PII data values works well
for HR-related applications. Data points like personnel number and e-mail
address are the key elements used within the business logic for such
applications. Leaving these primary-key data points intact enabled the HR IT
applications to continue operating as usual.

Create Small Test Samples for Process Testing

Some applications are tightly integrated with other applications.
HR IT recommends creating a small sample of data that originates from internal
databases for process testing when sanitizing all non-PII values is not
feasible. Using samples of production data as input for the data sanitization
also presents an alternative if the cost and time for sanitizing data is
unacceptable because the size of the database that must be sanitized is large
(100 gigabytes or larger).

Conclusion

Many Microsoft HR IT applications use common access
points to data that previously contained PII. When developers create
applications, they frequently need access to relevant, realistic test data. HR
IT undertook an effort to provide data that was free of PII but still remained
useful for developers.

As part of the process, HR IT examined three approaches
for data sanitization and chose to sanitize information that was not personally
identifiable while still maintaining the important key relationships that many
applications need. HR IT used an internal application for data sanitization and
enhanced the application by automating the sanitization process. HR IT
identified several testing scenarios as high priority to help ensure that the
sanitization was successful and did not affect any applications.

For More Information

For more information about Microsoft products or
services, call the Microsoft Sales Information Center at (800) 426-9400. In
Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the
50 United States and Canada, please contact your local Microsoft subsidiary. To
access information via the World Wide Web, go to:

This document is for informational purposes only.
MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft,
Windows, and Windows Server are either registered trademarks or trademarks of
Microsoft Corporation in the United States and/or other countries. The names of
actual companies and products mentioned herein may be the trademarks of their
respective owners.