Data Is a Cost Problem

One of the major challenges that software developers and testers face on a daily basis stems from an inability to get realistic data. Many times as a developer, you are interacting with a downstream service and you must use whatever data is available in that environment because the process for obtaining actual, usable data for your scenario is extremely time-consuming. Often you cannot find the data that you need and it must be taken from production, which introduces a new host of challenges.

To complicate matters, personal data cannot be used from production because it increases an organization’s risk for theft, loss, or exposure. Take the recent breach at Yahoo, where 500 million email accounts were breached, or the ~68 billion LinkedIn users whose data was recently compromised. These breaches took place at the production level where security is high. Production data used in development areas is not uncommon and security tends to be lower. Operating in this way presents a significant risk to an organization’s brand reputation. Thus, sensitive data must be scrubbed or masked, which is a time-consuming process requiring data expertise.

Using Service Virtualization to Overcome Data Costs

No matter what, data is a cost problem because it slows you down. By using service virtualization, you can not only take control of a dependent application’s behavior and functionality for the purposes of stabilizing your test environments, but you can completely control those dependency’s data sources and supply whatever data you require that day for your effort. At this point, the rules change because you are now not only in control of the data, but also the logic. You can create services that behave the way you want them to, as opposed to strict adherence to their normal behavior patterns.

In a previous blog, I discussed defect virtualization, which has the same basic principles. But there we were talking about service logic. This blog will take the next step and talk about data control. As we begin, let’s focus on the current data challenge that testers and developers face day-to-day.

A Typical Day of Data in the Life of a Developer

At the beginning of an application's development process, the data required for testing is usually simple because the full functionality of the service has not yet been realized. As development continues to add functionality, testing maturity increases, and so does the data complexity.

For example, let’s use the example from a previous blog post – let's say I am an airline, developing functionality on my tickets page. I need to verify that users can get tickets for their flights, and depending on how far out in the future the flights are, the user will get one of several responses, which will change as the time gets closer. At the beginning of the development process, I could simply generate a bunch of complex data with flights three months in the future, which would allow me to do all of the testing I need for the moment. But of course, the problem is that I just lit the fuse on a time bomb. In three months, this beautiful data will expire, and chances are I will have forgotten about it. Suddenly all of my tests will start failing, at exactly the wrong time because the release will be coming up and I simply will not have time to regenerate the data... Sound familiar?

Forge a Sustainable Path

By introducing service virtualization early in the development process, you can lay the foundation to provide solutions to these data challenges. A virtual service’s data can be derived from numerous locations, but at the beginning, simple virtual services start with fixed data. You create these “fixed assets” or mocks to address the what-if scenario testing stages and keep things very simple. The idea here being, “I just need a service that will respond with this particular payload.”

As virtual services mature, it becomes necessary to separate the data from the service so that if you want to add logic into the simulation, you don’t actually have to open up the virtual service to manipulate the data. In fact, mature users create a virtual service in such a way that the data source handles the bulk of the logic. They can then hand the data source off to a tester or test data management team to insert any data that this service might need in the future. Adding new functionality to the service is as simple as adding a row to the data source. This allows the virtualization effort to be shared and one virtual service can accommodate multiple teams. Virtual services become living organisms that grow and change as needed.

Where Does This Data Come From?

Once development has created the initial simple service, it is time for the testing team to take over. Testing teams will have more complex data requirements. Where does this data come from? Typically, you derive this data from record and playback. This is often the first step when creating a virtual service. You record the transactions between an application and the dependent backend systems and use this recording to create your virtual service. This allows you to create a very usable baseline data source that can be extended whenever the need arises. In my airline example, this would allow us to get realistic flight numbers and destinations. The data would have all of the complexity necessary, including multi-segment and international flights. The data source correlation handles all of the complex request/response relationships, and since subsequent changes to the “real” data can simply be re-recorded and merged into the existing virtual service, getting new data becomes trivial.

The data that we record does not come from production, and this protects us against a data breach in the lower environments. The challenge with this data is that since it doesn't come from production, it’s not as complete or up-to-date. This is where data generation and manipulation become a powerful function of service virtualization.

Nonexistent data can be supplemented with simple generated data to accomplish exactly what we need. In my airline example, the flight dates in the responses can always be today’s date offset by three months. By using data generation, this task becomes trivial.

We can continue to massage and manipulate the data by providing dynamic data to manage any “non-defined” request/response relationships. These are the types of relationships that could never exist in a static dataset. In the airline example, let’s say that when a request to the downstream component is made, it provides the user’s current location and this will be used in the response as the departure. Since our test cases would be constantly changing, a real service would have to maintain all of the current locations so that they can be supplied in the response. By using a virtual service, you don’t need to maintain all of the locations, you can simply dynamically return the user’s current location as the departure city.

Finally, the use of negative data can be provided either statically or inserted into the datasource to ease negative or abnormal testing. In my airline example, for instance, this would be inserting a random canceled or delayed flight to validate that the user is notified before they leave for the airport.