Many companies have a match of their production platform available in a QA environment, with the only difference being that the QA environment connects to QA datasources (which generally contain fake data entered by QA Engineers) whereas the production environment connects to production datasources (which have real data).

Recently, we have had a couple of production issues due to data edge cases that were not present in the QA data. In response to these incidents, our QA team would like to regularly copy over production data to the QA datasources to try to catch more of these issues. Our development team has balked at this for a variety of reasons.

Is this commonly accepted practice in the industry? Does it have a name? Is it something we should avoid?

5 Answers
5

In my company, we use a separate test environment where we copy production data daily. This environment is periodically used to detect issues like the ones you have encountered.

The vast majority of our testing is carried out with synthesized, non-production data. Some of this is produced by hand, but most is produced by scripts we build. We periodically analyze the data in production to improve the way we create our test data. Still, there are occasions where copies of production data are better.

It's not simply a matter of copying data, though.

We set up production jobs to copy the data to a common test environment at a pre-determined time each morning.

We scrub the data to modify all personally identifiable information (PII), along with any other sensitive data. This is often called obfuscating or "greeking" the data. This PII is replaced with logically correct, but non-personal data. This task isn't always easy to do, but is required by our security and privacy standards.

We remove any transactional data that would get in the way of our testing (some in-flight data gets copied that isn't relevant to our tests - we remove that so as to make testing simpler)

We end up with a fresh copy of production data each morning before the testers come in. Testers can use this common data in read-only mode, or they can copy it to their individual test environments, modify it as needed and use it during their testing.

That sounds a lot like what we've done at my last two jobs. It took the better part of a day to scrub a production backup at my last job, but the data was invaluable. For example, by analyzing the database I was able to determine which of hundreds paths were actually getting used in a complicated workflow, and therefore which were the most important for us to test. That analysis converted an unmanageable test problem into a manageable one.
–
user246Feb 22 '13 at 17:47

I haven't heard of a standardized term for this technique, but it is commonly used. We do this pretty regularly at my current company, but avoided it almost completely when I worked at Microsoft, even though there were a number of cases where it could have greatly improved our ability to troubleshoot issues. At Microsoft, they considered the risk of exposing private customer data too great compared to the benefit we would get from increased flexibility in our testing and troubleshooting. If you do it, here are some things to consider.

What data are you storing in production? Is any of it private or sensitive, or owned by a third party? If so, you may need to have a way of fuzzing the data or simply removing or filtering out the sensitive data.

How much space is your production data going to consume compared to your test data set? Do you have the capacity in your test environment to handle that much data? If not, do you have a way of breaking up the data based on date or something else?

Does your data persist in multiple states? For instance, we have log files that are parsed and processed in multiple ways and ultimately stored in various data stores (sql, hadoop, cassandra, etc). Do you want to pull down the initial log, or the data from the data stores, or both? What kind of process would you need for each of these?

Assuming you pull down production data, run tests, and modify that data in your test environment, what happens when you need to pull down from production again and it overwrites the changes? Will that break your tests? You'll need to set up a process to handle this as you will need to refresh from production sometimes, what will the frequency be? How disruptive would this be?

Does your data depend on other data? Sometimes we will pull certain data from production, and still not be able to reproduce the problem because the problem only occurs under certain circumstances, or with certain account or user settings which we also need to pull from Production.

I've done this in many places myself, where we copied over production data because it was so very useful in troubleshooting issues that were only apparent in Customer data and it also provided additional test scenarios and data structures that we did not then need to always recreate. It's extremely useful, and if your Development Team is balking at this you may want to find out why, perhaps they've been burned in the past, or do not see the value. I've let Developers use test data in the past for checking issues, and in my current job we copy Production data down not just to the Test environment but to the Developer environment as well, but not as often.

I have scripts that seed data I need for our automation into the Production data, so we don't have Test accounts in Production. This provides for some automation testing and just to be sure about notifications we make sure that SMTP is set to not set outside our domain, that way we don't have rogue emails going out in places that we do not, or have no need, to scrub.

Pulling down production data is useful, but you definitely need a process around it to clean up data, put in your test data, and verify that it was imported correctly. I also find it useful to learn other aspects of the Production system that you normally may not come in contact with.

It really depends on the type of company you are, or the products you are testing. It also goes to your testing approach. Are you basing your tests on the data available, or are you creating data required by your tests.

IMHO, the most effective way is to get a copy of production data, and perform analysis on it.

De-duplicate the data.

Perform "equivalence partitioning" on the data to identify test data records that, whilst different test exactly the same thing.

Then use that analysis to generate a base set of data that covers the production scenarios, then on top of that, add any additional data that you need for your test scenarios.

The main difference between regular test data, vs production data is that testers and developers tend to use the system the way it was designed, where real users will tend to leave fields blank, miss key or insert junk data in. So it is a lot "dirtier" than most test data suites.

This is easy enough to uncover if you pull it in to a spreadsheet and run a few macros on it.

So to completely answer your question:

Testers absolutely must have access to analyse production data.

You may have data privacy issues that prevent it being used in raw form in testing environments the you should look in to requiring obfuscated and anonymised data and invest in tools to do that.

If testers can't be trusted with the data, I sure as heck wouldn't trust developers with it :-)

For highly sensitive data which can't be obfuscated, and raw data was required. Such as general ledger information linked to stock trades, in highly complex financial and regulatory instruments (where the dollar values and individual, sequenced entries are critical to the testing), then we would treat the test environments with the same data protection and privacy controls as any production environment.

One alternative I have seen is to have the company set up a "test" environment within their domain, then give specific personnel within your company access to it. That setup provides two benefits: it provides a pre-deployment test environment to evaluate new versions directly against production data in a protected environment and also provides a platform to run analysis of defects directly in a production environment.

That solution allows a level of interaction with the customer that would not be available otherwise and benefits the seller, the buyer, and the end customers.

Interesting approach. One concern is that if QA relies too heavily on this environment you will start seeing buggy code run against production data.
–
smp7dFeb 22 '13 at 16:24

From a philosophical standpoint, I agree completely. Testing, questioning, and checking should be performed as early in the process as possible. However, I have been often surprised at the ability of users to create ingenious data scenarios. Having that environment available at the customer site as a final pre-deployment check has uncovered some issues that we were able to resolve prior to "going live".
–
Jeff_LucasFeb 27 '13 at 13:21