Generating test data is the only reliable way to accurately run tests repeatedly and consistently knowing that the input test data hasn’t changed.

Some applications rely upon specific data which is either hard to find, or hard to fake. For example, the web application I am working on displays different promotions based upon which day of the week you are using the system, and also changes prices depending on the day of week and time of day.

If you were using production data for testing, you would either have to run tests at specific dates/times to test different promotions/prices, or you would have to change the server date/time to test these. Changing the date/time on the server will effect anyone else using that server, so should be avoided. It also means that as you run your automated tests continuously against new check-ins, if you don’t use a known set of generated test data, you will get different results depending on time of day.

When developing an entirely new feature, there won’t be production data that you can use for testing, so you will need to generate some in this case.

Generating specific test data will often take longer sourcing production data, but will retrieve results over time as tests are run very consistently against a known data set.

You should use production data for testing

When you’re testing a web application, you’re as much testing the data as testing the application behavior. Using production data will ensure that what you are testing will be as close as possible to the actual behavior once the feature is released to production users.

If you generate test data and use it to test, who is to say that this test data is actually valid. If you generate test data through lower level means (such as SQL insert scripts), you may introduce test data that isn’t representative of that in production that may either introduce errors in functionality when actually running against production data, or errors in test that won’t actually exist in production. As your database schema updates and evolves, you will need to also keep your data generation scripts up to date so they are reflective of production at all times.

If you do use production data, you need to be clever about how to source data. Querying the database using SQL scripts is an effective approach as it will enable you to quickly find real data that you can use to verify a story has been implemented correctly.

It will also allow you to identify outliers and edge cases that can be tested using real production data against the system in development.

If there any concerns about using production data for testing, these can be mitigated by obfuscating the data so it is indistinguishable.

4 thoughts on “Should you use production data or generate test data for testing?”

There is a term seed data. There is nothing wrong with it. You may need it when :
* new feature is developed. There is no production data yet. Seed data may even become a part of deployment.
* production data is not complete for full test coverage at the moment. Good example – some sale promotion type that rare active in production.

…Also for very large data sets, production like, or production data is essential for performance testing so you have a more accurate indication of how the database will perform, 6 million ‘smiths’ will work differently to prod data lol.

It purely depends on what applicaiton/module is under test, what stage of the developement lifecycle, and what the purpose of the testing actually is – in no way can you “paint all surfaces” with the “same brush” so to speak (I hate cliches, believe me, but couldn’t find another way to put it for a Friday!)… My two cents