https://emanuilslavov.com/https://emanuilslavov.com/favicon.pngOptimizing For Happinesshttps://emanuilslavov.com/Ghost 2.22Fri, 24 May 2019 14:11:16 GMT60One of my biggest problems in software testing has always been the test environment. And more specifically the problem that I had little control over it. I could not deploy whenever I liked, neither could I upgrade it, or I was not allowed to have access the internals. I had]]>https://emanuilslavov.com/on-test-environments/5c66b4cea56bf100c03e3de5Mon, 18 Feb 2019 13:25:38 GMT

One of my biggest problems in software testing has always been the test environment. And more specifically the problem that I had little control over it. I could not deploy whenever I liked, neither could I upgrade it, or I was not allowed to have access the internals. I had to ask someone else (most likely a system administrator) to change/update the environment, or to extract data from it.

In my early days the test environment had to be upgraded by hand — it was installed on real hardware after all. A single upgrade would usually take days.

Then came the virtual machines. You still had to do stuff by hand, but now at least the host os would enable some level of automation. At one of the companies I worked, we had a nightly build that would compile all the code changes from the previous day to into a binary file. I created a script to take this binary file and to build a virtual machine image with it and setup the software under test with a basic configuration. When people came to work in the morning they could just download the new image and start testing on the latest build. This saved us hours as before everyone was performing the same overlapping operations on their own test environment.

Then came the containers and they streamlined the process of creating and configuring test environments even further. A single container runs a single process and it would boot very fast compared to a virtual machine.

Nowadays it is really important to be able to run high level automated tests after every commit. If a test breaks you’ll know exactly which commit was the culprit. For this reason running tests on predefined period (e.g. every night, every 4 hours) is suboptimal as the run will include changes from more than one commit which further complicates finding the cause of a failure.

Having fast, dedicated, created on demand test environment makes the above and even more possible. It is the gift that keeps on giving, and those are the reasons:

Being able to trace, play and pause a single test across different services

Even if all tests pass, to continue to probe for unexpected/unseen problems and anomalies

Have the test environment described as a code

Exclude the environment as a cause when investigating a flaky test (because it's used for automated test only, can be reproduced with code the same way every time and its state and data are reset every time the tests start)

During the past few years I spoke at conferences about fast tests, deep oracles or super stable tests. None of those advancements could be possible if we don’t have fast to boot, dedicated test environment that we have full control over. Such an environment can be created with regular servers (although is going to be slower), but currently the best way is to use containers. Maybe in the future, unikernels will be the norm. I’ll try to keep my recommendations technology agnostic as much as possible. I’ll also be writing from the perspective of high level tests.

The high level plan for creation of such environment is:

Clear any previous state and data

Download the latest artifacts to be tested

Start the containers and do app specific setup

Setup the service virtualization anything outside of your control

Run the tests

Collect and analyze the generated data

On the next commit go to step 1.

Those points are discussed bellow.

Start with one Service

Even if you have fully containerized setup of you app, start your test environment with a single container. This will be the walking skeleton test environment app. It’s one service only but in order for it to run you’ll need more than the service itself. There will be databases that need to be attached, API authentication service of some sort. You’ll need to figure out the mechanism to download the latest binary artifacts on each commit and to configure them. Run a few tests to “smoke test” (pun intended) the environment and once this is ready proceed to add more services to you test environment setup. Also it’s going to be easier to debug infrastructure problems with one container instead of seventeen.

Fully deployed test environment with 18 containers

Reset

Before the tests start, you need to reset any leftovers from the previous run. They may include artifacts, data, logs, configurations. This is needed for two main reasons — to start each test run on a clean slate with exactly the same conditions each time and also to be able to extract meaningful information from the system after the run completes and correlate it to specific code revision. At minimum you should restart the containers (or the OS). The next thing is to remove any old source code and logs.

Depending on how your app is structured, deleting the log files may not be an option. In this case the best thing you can do is to keep the log file, but truncate its size to zero.

sudo truncate -s 0 docker inspect -f '{{.LogPath}}' docker_nagual

For databases reset, check the section bellow.

Start Order

If your app contains multiple services their starting order can be important. For example one service may check upon start if it has a connection to Redis cluster and shutdown if such a connection cannot be established.

If you’re using Docker, docker-compose has a mechanism to explicitly order the way the containers start. This is accomplished with depends_on keyword. However this solution is not bulletproof. Docker considers a container to be started when the boot process concludes. The container may be up, but it may take additional seconds for the main process to actually start, to open a network port, or to respond to a query. In such cases you need to take additional measures to ensure that the starting of containers does not continue unless the process is really in a running state.

One inelegant, but effective way is to not start the process in the current container unless a port in another container is not already open. One simple way to do that is by running netcat in endless loop that breaks only when a three way handshake connection concludes successfully. Here is an example:

In the example above first for the port of Zookeeper needs to be open, then the Kafka port needs to be open and just then the main app starts. Both Zookeeper and Kafka and running in separate containers and it takes a while for them to start to start. Our app requires them both to already be running, otherwise it shuts down.

In order to start from а clean slate each time, besides restarting the servers/containers, replacing the code with the newest revision and removing the old logs, you also need to have fresh and empty databases.

One of the principles that define good automated test is that the test should create all the data that it needs, in a prelude, and then continue with the actual requests to the app. This enables the tests to be run on empty database also to be run safely in parallel and in random order.

Even with empty database, you still need the latest DB schema before the tests start. All the tables should be there, with the proper type of columns (also triggers and indexes if any). Depending on how the backend is written maybe some specific indexes need to be installed in a NoSQL DB.

The point is that you need the latest schema applied. There are two ways to do that. The best one is to be able to recreate the schema from a script. It assumes that every time, even for the smallest schema change, this script need to be updated. It’s a really good practice as this DB creation script is under version control, can be reviewed, its history tracked etc. Unfortunately, in the companies I’ve worked so far, even if such script exists, it was not kept up to date. In this case the other option is to dump the existing DB schema from a running server and apply it to the testing environment.

The sed command at the end just removed the auto increment, and it’s optional.

Before the start of the tests, you can dump the latest schema (w/o the data) directly from a stage or production environment and then you restore it on an empty DB instance. This approach has two distinct disadvantages: the first one is that it may be a bit slow, depending on the size and the complexity of the schema. The second one is that you actually need to have the credentials to dump the DB schema (which may be problematic for security reasons) also you need a network to do this. It actually defeats the goal of having self sufficient, hermetic test environment that you can setup without internet.

Depending on the design of your app, you may actually need data in some of the SQL tables or in the NoSQL collections. These are so called configuration tables: currency and country codes, language constants, mapping coordinates etc. In this case you need to specifically dump the tables along with their data.

If you subscribe to principle that each test creates the data that it needs, you’ll need one more piece of data. You’ll need at least one user in the system to authenticate against, to get the whole ball rolling. I call this user the God user, as it is the first in the empty test environment universe. It’s OK to create this user with a single DB command as it usually affects only a few rows.

The data that is created in the database is synthetic and it has very little value after the tests pass. In order to speed up test execution we actually run the database not form the filesystem but entirely from memory. For MySQL, the simplest way to do this is to append an option to the command line that starts the server:

mysqld some_options --datadir /dev/shm

Running a database from memory only also speeds up the time to delete this data. As it is stored only in memory, when the container is restarted the data evaporates. There are more details in this presentation or watch the short version video.

Stub External Dependencies

A modern application most likely will have some sort of 3rd party integration (social networks, payment providers, storage etc). In order to have fully hermetic test environment, they should not require internet connection while running. As mentioned above, you may be required to have an internet connection during the setup phase - for example do download a repository from GitHub or to dump a DB schema from running instance. Generally however, its always a good idea to stub/mock/fake any external API requests that are not under your control. This also includes resources internal to your organization that are not easy to get to run as part of your hermetic test environment: e.g. a mainframe server or Google Big Query instance.

The three green boxes are external dependencies that are stubbed. The rest have their own containers.

At my previous gig, we developed an HTTP simulator to help us to stub social networks (think of it like creating a mini-Facebook that we have full control over). In order to create such a setup, we need to append a line in each of the app’s containers /etc/hosts file to intercept and influence the traffic of the 3rd party services. Think of this as a man in the middle attack.

echo $SIMULATOR_IP api.instagram.com >> /etc/hosts

Deep Oracles

One of the great advantages of having a fully independent and dedicated test environment is that you can use it to find unknown and unexpected defects. As automated test detect only defects that they were programmed to detect, with dedicated test environment you can continue probing and looking for clues of defects. There are three main areas to look defects in:

Logs. Look for any unexpected error or exception in the application or services logs

Data. Look for bad, incorrect data, out of the ordinary.

Metrics. Collect application metrics from the OS, database, filesystem, while the tests are running and then compare them to a baseline or threshold limits.

Here are some miscellaneous tips that did not fit any of the categories above:

If parts of your app run interpreted code (PHP, Ruby, etc.), you can save some time and download/clone only the latest revision, because you don’t need the history. If you’re using Git: git clone —depth=1

If you use configuration services such as Zookeeper or ETCD, consider if you can substitute them (for automation test purposes) with plain text based config files. This can be done with an environment variable switch in the code. The result will be significant time savings as loading such configuration every time (i.e. to Zookeeper) leads to very slow setup time.

If you use some sort of dedicated service for API authentication (like Keycloak), in order for your test to be hermetic, you either need to create such a dedicated service/container in your test environment, or to simulate it like any 3rd party dependency outside of your control (we’ve successfully simulated the Keycloak authentication mechanism with Nagual). There is a third way however, to use env variable to turn on and off the API authentication. It will not affect the functional tests in any way but you need to be careful not to mess up the deployments in production and leave the APIs unauthenticated.

If your app uses message queues, consider that for testing purposed you can bypass them and execute synchronously the request. As this requires changes to the backend code, it is more intrusive to the functional tests than the bypass of the API authentication mentioned above. However, you’ll configure one less container and will speed test execution. Like all the tips in this section this is highly context specific.

If you have company created base images for containers, by all means — use them. The same is true if you already have codified the way that your infrastructure is created. Utilize the infrastructure as code if you have such, don’t reinvent the wheel.

In order to avoid setting up development environments make sure that you also containerize the environment in which you run your automated tests. Our tests used to be written in Ruby and we’d have to install RVM on the machined that runs the tests to manage the dependencies. Nowadays it’s easier to spin up a dedicated Ruby container and leave the host OS untouched (besides having Docker installed).

]]>In 2016, I gave a talk at ExpoQA conference in Madrid. The talk was called Need for Speed[1]. In that talk, I omitted an important piece of information which I thought it was obvious to everyone. In hindsight however, it looks like it was obvious only to me.

One

]]>https://emanuilslavov.com/api-test-whats-in-a-name/5be03200f7df6000bfa3ebc6Mon, 05 Nov 2018 12:41:14 GMTIn 2016, I gave a talk at ExpoQA conference in Madrid. The talk was called Need for Speed[1]. In that talk, I omitted an important piece of information which I thought it was obvious to everyone. In hindsight however, it looks like it was obvious only to me.

One of the slides in this talk is about how we decreased the execution time of 600 API tests from 3 hours to (currently less than) 3 minutes:

After the ExpoQA conference I was reviewing the notifications in Twitter and saw someone live tweeting at the same time as this slide was up. The text was something along the lines of: “Wow, Emanuil, what king of slow system are you working on, so that it takes 3 hours to execute 600 API tests?”. I did’t gave it much thought back then, but in the next years, whenever I give a talk I started clarifying what I mean by API tests[2].

Currently, most people understand “API test” as a single request to a web based API (REST or SOAP) and then an evaluation of the response. Such a test can be performed by tools like Postman, SOAPUI or similar.

What I mean by an “API test” is something different. It is a high level test[3] which consists of more than one request to an API endpoint. In fact, in our current tests, a single “API test”, on average, makes between 15 and 20 requests to the backend. Such test performs the following operations:

You may think that this test resembles more or less an end-to-end test and you may be right. Those API tests are using the same requests as the ones that a web based UI (or mobile app) will make to the backend.

A browser making a web service call when 'Publish' button is pressed. Same result can be acheived with just an API test

So these tests are essentially the same as the automated UI tests[4], but they are skipping the unstable UI part.

The main advantage to such tests is that while they still exercise the backend logic of an application they are orders of magnitude faster and way more stable. The drawback is that they do not test the frontend. However the frontend can be tested separately from the backend by mocking the real backend.

Those frontend tests can be performed with the real UI assembled - e.g. for functionality that does DOM manipulation, or they can test functions only - that do not interact with the UI. In case of the former tests, the mocking of the backend, can be done by a wide variety of tools, including one that I’ve developed for similar purpose - Nagual.

I think having separate API tests and dedicated UI tests with mocked backend is superior approach in the modern applications than having only full UI based end-to-end tests. Why do people continue to write only the later? For a bunch of reasons, the main being:

They are used to work that way. A while ago[5], all the HTML + all the data, was constructed in the backend, and was sent to the frontend only to be visualized. They only way to interact with the backend of an app was though the UI interface of a browser (or a desktop application). But apps have evolved. Currently the backend sends only data in JSON or XML format and it’s up to the frontend to construct the HTML and bind the data to it. We have a shorter path to reach backend[6] — the web services.

In the old days there were lots of differences between the web browsers. Even basic HTML could be rendered differently. Back then it made sense to use automation to run the same test on different browsers. Nowadays, the differences are negligible and they don’t matter for most of the apps[7].

The management loves to watch a running Selenium UI test — it feels like magic. Clicking, selection from drop down menus, entering text, navigation — all without human interaction. The work that a Selenium UI tests performs is highly visible. A web service test run is invisible save for occasional data dump in the console and the end status of the test.

“The only way to test a system is the same way an end used would use the it.” Not true, as we can decompose the user actions to smaller, more manageable chunks of work. API and unit tests are enough. Occasionally we’d still need some high level Selenium UI tests, but those will be mostly to verify that the all the moving parts of an app are wired correctly — e.g. the fronted talks to the right endpoint.

There is large industry focused solely in UI automation. It constantly boast that the only way to do test automation is via the UI. From commercial tools instead of open source Selenium. To on demand cloud environments where you can rent a dazzling array of browser + operating system or mobile OS environment. And to top it all — a wide range of UI automation test consultants offering their services.

Most of the people who write UI automated tests are either not developers or are very disconnected with the developers. As the developers are usually writing JS unit tests only, they know that there is not much sense in testing the same functionality with the slower end-to-end UI tests. However some companies have QA engineers that write only end-to-end UI tests. They may not work with the same cadence as the developers. Or the QA team is totally separate from the development team - e.g. by outsourced QA department. In this case one of the easiest things to measure[8] is how many tests cases are created (raw count).

1. Later that year I was invited by Google to present it at GTAC2. And for that matter, all test related terminology - e.g. ‘unit tests’ means different things to different companies.3. It means that the backend system (or large parts of it) is fully deployed. There is real database and network connections between the different components. The connections to the 3rd party services are either established or stubbed at a higher level.4. Usually automated with a tool like Selenium.5. It depends in which industry but for most ‘a while ago’ is 7-10 years.6. This is where the majority of the business logic resides.7. The most notable exclusion may be e-commerce apps, where a few pixels or different color can make a big difference.8. But also the most inappropriate as this is purely a vanity metric.

]]>(This blog post is expanding on a presentation I gave recenntly. Slides are here)

Modern software is a complex beast. We need to be sure that even after the smallest change is introduced, our application continues to function as intended (also small does not equal safe). This type of testing

(This blog post is expanding on a presentation I gave recenntly. Slides are here)

Modern software is a complex beast. We need to be sure that even after the smallest change is introduced, our application continues to function as intended (also small does not equal safe). This type of testing is called regression testing. 15 years ago, when I started working as a tester, the regression tests were performed by hand. I was part of a testing team, sitting in crammed room with one small window, getting paid $160/month after taxes to do just that. Performing the same boring, mindless, repetitive regression tests before each software release. Everyone in the team was thrilled when got the chance to test a brand new feature (exploratory testing). Unfortunately the majority of the work was regression testing.

Performing manual regression testing works up to a certain point. As more features are completed, more tests move from the ‘exploratory’ type to ‘regression’ type. Like a Ponzi scheme, in order to continue operating, you need to hire lots of testers for cheap. At some point either the market drives up the wages, or the regression tests become way to many to manage effectively.

This is the point at which automated regression tests start to make sense.[1] They are great because they do exactly what they are programmed to do — no more, and no less. They don’t get tired, they don’t need sleep, they don’t get distracted, they don’t get nervous. In one word they are consistent. Companies can now hire less testers to do manual regression testing.[2]

Automated regression tests can be divided to two types depending on what level they operate in the application stack. Low level, small tests operate on code level — unit tests. High level, larger tests operate on the external interfaces of the application — usually an API or UI level. They also require the full deployment of the application. In this blog post I’ll focus on the high level tests.

There are three basic problems with high level tests[3] — they are slow, they are flaky[4] and when they fail it is usually hard to locate where the problem is. I’ve given a number of presentations about how the problems can be addressed.

In this post I want to highlight one more problem with the automated regression tests. Their rigidness, their inability to spot problems that they were not programmed to spot, their inability to detect the unanticipated.

A couple of years ago was presenting at a well known software testing conference on the East coast. I was very surprised that the talking point form the previous paragraph was brought by the majority of speakers as a way completely denounce automated tests. Their main idea was that we went too far with the automated tests and we should go back to manual testing. The investment in test automation is just not worth it and that it will never be able to replace a smart human.

I think we’re a long way from replacing humans in any creative and problem solving activity, but I want to highlight how we can make our tools smarter in order to augment our abilities.

The following are six simple approaches we use to expand the reach and multiply the value of our automates tests to be able to detect defects that they are not specifically programmed to detect. The best part of them is that no change to the test itself is needed and they can work with every programming language and testing framework.

Flaky Tests

Lets start with something simple. Flaky tests are a reality for everyone who writes functional high level automated tests. Flaky tests are caused by three major factors: the tests themselves, how to application is setup to run or the application itself. How can you find if a single test is flaky? You can, for example, you can run it 100 times against an application that does not change (deployed on dedicated testing environment). If the tests passes 100 times in a row, then it has less than 1% chance to be flaky[5].

To speed up test execution, the functional tests are usually run in parallel, which puts additional strain on the application under test.

In my experience, the majority of the flaky tests are caused by the tests themselves (around 80%). But in order to gain more knowledge about an application we need to also pay attention to the flaky tests that are caused by the setup of the application and by the application itself (around 20%).

If you are sure that the test itself is not the cause of the flakiness, then it means that it’s the application. Don’t despair as this is an excellent opportunity to learn more about it and to improve. The following are some of the problems that we have discovered when investigting flaky tests.

Setup causes

Load balancer or pool issues. Tests may fail when the load balancer points all the requests to a few backend machine (because of resource constrains). Tests will fail if a machine from the pool is malfunctioning

Write concerns. Hardcoded read from a machine in replication that is not geographically the closest (fastest one)

Lack of or insufficient retries when problem occurs a distributed application.

DB connections are not closed when write operation completes

Random Test Data

I used to work for a company that had a great set of automated UI tests. They were around 800. Before the start of each test, it was creating all the data it needed — users, clients, transactions etc[6]. And then the actual test would commence. The tests were reading test data from an XLS file. One row - one test case. The problem was that the data was always the same. Each user had the same first and last name. It was always - John Payer.[7]

Using random test data generation instead of static test data goes a long way. Instead Payer as last name why not iterate through different names? How about the famous O’Conner case (single quote in a name, used to find problematic SQL queries)? Or someone named Müller (ISO/IEC 646 character), or Славов (Cyrillic UTF-8 characters). Instead of having separate tests for each of the supported encodings, a better approach is to have the names automatically generated from a library. This test data generation function will be used by all existing tests at no extra cost. Such libraries are widely available for all programming languages. We’re using one for JavaScript and another for Ruby.

The company I currently work for develops SaaS that collects data from the biggest social networks (Facebook, Twitter, Instagram, YouTube, Linkedin). To Facebook alone, we currently send more than 10 million API requests for a day. In order for our tests to be stable and reliable, we developer a second generation HTTPS traffic simulator. Initially, the simulator would return the same post text when we tried to simulate new Facebook activity: ‘automated tests message’. As our clients started to discover bugs in the code related to parsing of the Facebook posts, we started using the ‘problematic’ posts in the traffic simulator.

The three most problematic components of a Facebook post text for us were: special characters, non-ASCII encoded text, and mentions[8]. We constructed a simple function that returns a ‘random’ text containing any of those three characteristic in different order.

Here is another example of the random data returned by the simulator. This is how the data is generated when a request to Facebook is being made to retrieve a picture post. More real life examples can be found here.

Having an automated test that generates all the data it needs is by itself a great leap forward. Go one step further: randomize the test data and you will multiply the value of that test.

Attack Proxy

At present, the majority of the high level automated tests are using HTTP protocol to communicate with the application under tests. UI tests are using HTML/XHR sent over HTTP and web service tests are using JSON/XML sent over HTTP as well. As the HTTP protocol can easily be sent through a proxy server, a variety of tools exist to alter the HTTP traffic[9]. Those tools can be used to cycle through the input parameters of the HTTP requests in order to trigger security vulnerabilities in the tested application. Note that this technique is a subset of the previous one (a randomization of test data) — but with the sole purpose of finding a security vulnerability.

In the previous technique, we expect all of our tests to pass and investigate any failures. In this technique the majority of the tests will fail because we inject random data that will not be accepted by the application most of the time. What is valuable are the responses we get back. They are interpreted by the attack proxy and it decides if our application is vulnerable to certain attack.

Some of he major types of vulnerabilities that can be detected with this technique are: shell command injections, SQL injection, Cross Site Scripting (XSS), Cross Site Request Forgery, information disclosure.

Having automated tests run through an attack proxy is way more effective at finding vulnerabilities than just pointing an attack tool at your SaaS application and expecting to crawl make request on its own. Your tests know how to login, how to navigate, how to insert the right data (e.g. a credit card number, or a valid customer identifier) in order to reach all the deep corners of your application. At a company that I worked for, we’ve uncovered pretty nasty XSS vulnerability three screens down, on the confirmation payments page. This page was never discovered by the automatic vulnerability detection tool because it did not know what values to enter to reach that deep. But automated UI test could. We found this vulnerability in the fist days that we passed all automated test though an attack proxy.

The attack proxy works by analyzing the incoming request and replacing a valid data in specific points with attack string. There are many attack scripts to trigger different vulnerabilities for one entry point. This means that the execution time of those tests will be longer. Depending on the tool you use and its setting, expect 3 to 7 times longer execution time. This fact makes automated tests using an attack proxy unsuitable to be run after every code change. If you cannot make them run faster (without cutting corners, limiting the number of checks performed etc.), the only sensible time to run them is during the night.

Some of the tools we’ve used are ZapProxy, Accunetix, Netsparker and Burp Suite. Of those three our favorite is the last one. One point to consider - if your high level tests run in headless mode (this is also valid for the UI tests), it’s better to choose a tool that works in command line only, without UI. Some of the attack proxies have only UI interface for setup which is really hard to work with if you want to automate the nightly headless run.

Dedicated Testing Environment

The next three techniques are possible only when the automated tests are running in their own, dedicated environment. If you don’t have that setup already, stop reading this blog post and come back when you have it. Using dedicated testing environment will also increase the stability and predictability of your automated tests.

Application Exceptions

An exception is thrown when an application does not know how to handle an unexpected condition. It means that we, as developers, have not anticipated certain action, event or condition. It means that we don’t (already) have enough understanding of the problem we’re trying to solve. As our applications are getting more and more complex, not being able to anticipate every outcome becomes the norm. What matters is now we react to those unexpected events.

Lets assume you have 700 automated tests that run on a dedicated test environment. You run them and they all pass — no failures. Case closed? It turns out that you can do a lot more even when your tests complete. If you reset your dedicated testing environment[10] to a clean state before every batch of tests starts, then those tests will leave their unique marks on the pristine testing environment.

It order to be useful, the automated tests have some sort of assertion, most likely in the last step[11]. They may check for a desired entry in a database or if specific response by a web service is there, or if a certain element in present in the UI. But those assertions count is quite small. A single test is programmed to check only for handful conditions that will certify that the tested functionality is functioning correctly. During the test execution, an exception may be thrown, but for various reasons it may not bubble to the interface the high level test is using (API to UI). Thus this exception will not cause the test to fail[12]. Sometimes the developers will catch an unexpected exception, log a message such as “Something Wrong Happened” or “You should never see this error” and continue the program execution as normal.

Even if your automated tests pass successfully you still have work to do. Check all the applications logs (the code that you’ve written) and the all components logs (databases, message brokers, configuration services) for exceptions or errors. If there are none, then declare your test run as successfully completed. Any exception found means that something is wrong. In this case fail the test run and start investigating.

By default, PHP will log all it’s errors/exceptions in error.log file.

A Java application will most likely use whatever file is defined in log4j.xml file.

Here is an exception from elastic search log file that is due to change in the mapping configuration. The indexing was performed by an async task triggered indirectly by the automated test. It was fire and forget type of task so it never bothered checked if it succeeded. We caught this problem only because we were monitoring elastic search logs for exceptions.

How you search your logs files for exceptions depend on their format. In a text file a simple grep for the case insensitive string ‘exception’ might be enough. You may have to parse any JSON/XML log files and most certainly you’ll need to parse binary format log files.

As you progress with the search for exceptions in your logs, you’ll find it useful to exclude certain exceptions — based on the type of exception, on the message or on the stack trace. Those exceptions may be expected, for example the developers throwing an exception when simple information logging would be enough. Or exceptions caused by factors outside of your control.

Sometimes it’s really useful to know which of your 700 tests that passed caused a specific exception. If you’re running your tests in parallel (to speed up execution) this is not a trivial task. One approach when you encounter an exception in the logs is to start running the tests sequentially. After each one completes, check the log files for the given exception. This type of automatic detection requires a bit of coding and it may look slow, but it will save you time. If you want to know which test causes and exception, you will most likely use the same technique but do it by hand.

Monitor your logs for exceptions after the tests are complete is a great way to extend their value beyond the simple, high level assertions in the last step.

Bad Data

An exception is a great manifestation of a condition that cannot be handled by the application. It shows exactly where the problem lies. Some error conditions are more subtle and do not reveal themselves that easily. Take for example bad/invalid data. It can take many forms, including: duplication, inconsistent format, missing, conflicting, inconsistent format, unsynchronized, unrealistic. It may not trigger an exception at first and it can be recorded and to go unnoticed for a long period. Problems start arise when this data is being used, but when this happens, the code that produced it may not even exist.

Some of the world’s most expensive defects are caused by handling bad data. One of the latest example is Mars Schiaparelli Lander. Around 3 kilometers above the surface of the planet, the internal sensors reported a negative altitude which was interpreted as successful landing. Because of this, the parachute was released, and the lander began a free fall. The bad data in this case was the negative altitude which is unrealistic.

Back to Earth, at our company we have our share of bad data. Every morning we’re reviewing the exceptions that happened in the last 24 hours in production. It turns out that 19% of them are caused by bad data. We had NoSQL collections that contain up to 5% invalid data.

One of the more glaring examples of bad data consequences caused an application in production to crash three times with out of memory errors until we figured what happened. We develop SaaS application that collects lots of data from the biggest social network. Twitter is among those networks. The account ids that we want to collect data for are stored in an SQL table. A Twitter account id is usually a large number.

The backend at the time was written in PHP. There was a cronjob that run periodically, read the Twitter IDs from the SQL table and then sent API requests to Twitter. The job was written so that it could handle data collection for a single Twitter account, or for all of Twitter accounts if the single Twitter ID was not set.

On the left side you can find the original code. At some point, unknown to us, a new row appeared in the Twitter Accounts table. it’s account ID was set to 0 (zero). When the cronjob was run, line 20 interpreted 0 as null[13], so the if condition was not satisfied. Instead of collecting information from Twitter for a single account only, the execution jumped to the else condition that was collecting data for all the Twitter accounts. When the loop circled back to that row with Twitter ID set to zero, it started once more the full data collection. There was no break from this infinite loop. All of the system memory was consumed and the application crashed. It had to happened three times until we finally figured out why.

The point is that 0 is a valid integer. SQL column type was ‘bigint’, so the db was not complaining upon row insert. And yet, given the context that we operated in (Twitter) this was bad data of the unrealistic type. Same as the negative attitude, it should not have happened.

What we developed was a piece of code to periodically check the database for bad data. To generate the list we brainstormed what bad data would look like for our context. We also used bad data that was causing exceptions in production

Whenever the high level tests pass successfully we run this same bad data check to make sure that the newly pushed application code is not generating any invalid data. The test suite fails if bad data is found.

Performance Metrics

Compared the previous two techniques this one is not binary. Most of the time, you cannot fail the build just because a metric reached a threshold. What is important is the trend over time. But let’s not get ahead of ourselves.

Besides checking for exceptions and bad data, after the test run you can do one more thing - take different metrics related the performance of an application. Since the test environment is isolated and used only by the automated tests, the results are very accurate. Here is an example of what you can measure during and after the test run completes.

Log file: size in bytes, number of lines, number of errors and exceptions, regular lines to exceptions ratioNetwork: DNS requests, number of packets, calls to external 3rd party servicesSQL server: the number of read/write queries, average transaction time, the number of threads, pool reads/requests, pool pages/utilizationNoSQL server: the number of read/write queries, total connections, network requestsJVM: objects created, threads, heap size, garbage collectionServer/Container OS: average/top CPU, memory consumption, swap size, disk i/o, network i/o

This is an example of the size of the application log file size plotted over six months, calculated after each commit. After a single commit the size grew with 54%. If the commit is small enough, you can even pinpoint the line responsible for the spike. The tests did not fail, but this can cause problems in production if disk space is limited or with CPU/memory resources if this log is parsed somehow.

Next example is that of the number of db queries, in this case this is the sum of read and write operations. Again, this number is calculated after every commit and the values in the plot are from the last six months. A series of three commits were responsible for 26% spike in the queries. This may be completely normal behaviour, but to be certain a propper investigation is due.

Those metrics are not enough to fail a build the way the presence of exceptions or bad data is. But those metrics are very helpful to track and plot over time. They are a great way to know if the newly pushed code does not flood the logs with four times more data, or that it produces 50x DNS requests[14].

The best usage of those metrics would be to set an alarm at a certain threshold. For example, send a notification when a new commit causes the SQL read requests to jump more than three times compared to the previous commit. If you're sure that 3x increase is totally unacceptable, then why not just fail the build?

Bottom line: listen to the weak signals.

At least one high level automated tests per story

All of the techniques listed above depend on one thing — having good high level coverage by automated tests. Every feature, every corner of the application should have at least two test cases: one positive (a happy path) and one negative. Those tests should go through as much layers of the application as possible, including communications with internal services over the network and database operations. The six techniques listed above will extend and amplify from thereon.

15 years ago we were not concerned with the speed of testing that much because we were releasing to customers 2-3 times a year. ↩︎

This is also good for the employees, as no one likes to work on boring, repetitive tasks. ↩︎

Note that this will never happen with unit tests as any exception on that level will cause the tests to fail immediately. ↩︎

There is a lot of magic in how PHP interprets true/false conditions (link) ↩︎

Those are real life examples, of before we had this system. We found out the hard way about them — in production. ↩︎

]]>In April 2017 I gave a talk at CraftConf titled The Ultimate Feedback Loop[1]. In this talk I shared the results of an investigation of almost 200 customer reported bugs in a span of two and a half years. Lots of insights came out but one piece of information]]>https://emanuilslavov.com/integration-tests-are-awesome/5af2a3ae2c5eca00177b19e5Thu, 26 Oct 2017 07:54:44 GMT

In April 2017 I gave a talk at CraftConf titled The Ultimate Feedback Loop[1]. In this talk I shared the results of an investigation of almost 200 customer reported bugs in a span of two and a half years. Lots of insights came out but one piece of information stood out in particular. Maybe because it challenged a deeply entrenched belief about the automation testing pyramid. According to this belief, you should have a wide base of unit tests, less API tests[2] and just a small amount of UI tests on top. However our data shows that, for our product, API tests give us the biggest back for the buck. By writing integration tests we can detect 180% more defects that writing unit tests for the same features. How is this possible?

The Testing Pyramid

According to the testing pyramid, the automated tests can be divided into three large groups — unit, API and UI.

Unit tests run only in memory, usually they call a single method, and they don’t interact with the outside world. They don't send data over the network, don’t write to anything to disk and don’t store anything in a database. And because they don’t interact with the outside world, they are so fast and deterministic.

API tests run when the whole application is deployed and operational — e.g. it has a database running, a network and so on. Modern application use the web services layer[3]. The API tests exercise the whole application as is intended to be used by the end customers. It operates one level bellow the UI. API tests interact with the application the way another application would — not the way a human would.

UI tests also need a fully deployed application to operate. However they interact with the application they way a human would do — via HTML (web browser) or in case of a mobile app with the touch interfaces. Under the skin, those interfaces are using API calls to send data to and and read from the backend. Those tests are the most realistic, however they are also the most brittle[4] and require high maintenance because there are so much more moving parts.

Unit Tests Are Not Enough

When we investigated each customer reported defect we could link it back to the location in the codebase where the fix for it was made. It's easy because developers put the bug id in the commit message.

Looking at the methods where the fixes were made, we found out that 7% of them were already covered with unit tests. And yet, those tests we not able to catch the defects. Something was obviously wrong. This prompted us to dig a little bit deeper. If a method had 100% code coverage, why was there a bug. What other kind of tests would help us detect this bug earlier?

Consultants

In hindsight, almost every defect can be reproduced with unit test. External conditions can be simulated, so can various failure modes by using test doubles. However how realistic is this in developers’ day job? Some consultants will try to convince you that unit tests are the only true way. But this is far from our reality (and the data we collected). Obviously, consultants also have to eat and will try to convince you that what they preach is the best. Just take it with a grain of salt and run your own experiments to gather information.

Our Methodology

In order to determine which types of test (unit, API or UI) had the highest chance of catching a defect, we collected two pieces of crucial information.

The first one is the place where the defect manifest itself. This is the location where a defect is observed. We use the defect description, but usually dig a little bit deeper to see where the defect can be observed initially. Most of the defects manifest themselves to the customer in the UI, but the cause may be corrupted database entry for example. So we look a little bit deeper. The location could be a method where an exception is thrown, or a database table row with incorrect data, or a misaligned UI element.

The second piece of information that we collect for every defect is the place where the developer fixed the defect. In the majority of the cases this is a single method/function in the codebase. We gather this information by looking at the commit for the fixed bug. (as noted above developers put the bug ticket id in the commit message)

The most simple case is then the defect manifestation location and the bug fix location are actually in the same method or function or even in the same class. We consider that these defects could have been detected if only a unit test was written for the given method/function. 13% of the customer reported defects fall into this category. In the presentation there is an example of missing condition in an if statement that was causing an exception later in the same method. Had a unit test been written, this defect is trivial to spot.

If a defect manifestation location is in one class, but the actual fix is in another, then we consider that this bug could have been detected only by writing an integration test[5]. Those tests touch on a number of classes and methods, interact with network and the database. The higher you go in the automation testing pyramid, the more combinations you have. We also had to apply some common sense (we excluded corner cases and unlikely to think of scenarios) of how realistically it would be for an engineer to write a test that will highlight the defect. Like with the unit tests example above, in hindsight, you may speculate that the majority of the defects can be reproduced with high level tests, but — how realistic is this (we found to that 30% of the defects can not be realistically detected by any type of automated test)? By writing a simple API test, 36% of the customer reported defects could have been easily detected.

In the presentation there is an example of missing text encoding in a method that leads to a search function not working at all. The manifestation of this defect is first observed in the database where unencoded data is recorded. The fix is not in the database but in the method that should have applied the text encoding before it is being written to the database. This is a clear example of defect manifestation and bug fix located in different locations in the system.

Some types of defects require the UI in order to be reproduced and so only such type of automated test could detect them. For us they were 21% of all the cases. It is important to note two observations:

If the defect manifestation and the bug fix for a UI defect are located in the same JavaScript function, we consider that writing a unit test for that function is enough to detect she defect. There is not need to create full UI test (using Selenium) to defect this bug.

If the defect is in the backend and it is realistic to be detected by an automated API test (had one being written), then we consider this defect as an API detectable one.

For us, API tests are a clear winner. Why? It’s mainly because of the type software that we write. It is SaaS, that collects lots of data from the biggest social networks[6]. It then labels it, filters it and calculates various statistics. Our software does not contain much algorithms for data parsing, text extraction etc (those are very easy to test with uni tests). Integration tests that touch most of the system components are way more valuable than unit tests that cover only a single class/method and rely on test doubles. The majority of the bugs that we discover lie in the seams of the system, in the interactions between the different components. It is impossible to cover those cases with unit tests.

Modern Software

Modern day software applications have the following characteristic:

They are not monolith. The current applications are broken down to smaller pieces, which makes them easy to develop and deploy. The rise of microservices architecture is a testament of this trend. For example, our current application has more than 20 components that can be deployed separately (these are discrete components, and the count does not include cluster nodes). These components use web services to talk to each other. In order to test the whole system we need to have all of them deployed. And that's where the integration tests come in very handy.

They use 3rd party services. Literally for just pennies, you can get lots of data via different service providers - FX rates, weather forecast, social media feeds, process payments etc.. Another current trend is using functions as a services (lambda functions), that are hosted by someone else. In order to produce a working software, those 3rd party functions need to be glued together. Unfortunately, to fully test this type of code you need working end-to-end system.

They are complex. Software started as single programs, able to accomplish a specific task. With the invention of the high level languages, 3rd party libraries and open source code, the speed of development increases. With the event driven architectures and multi threaded programming the interactions between different components can explode exponentially and in unexpected ways. Those interactions are very hard to test in the carefully manicured environment of the unit tests bubble.

It's only natural then, that for some companies investing in integration tests makes a lot of sense. Integration tests usually use one of the two interfaces: web services or UI. Due to the fast changes in the frontend development frameworks (jQuery -> Backbone -> Angular -> React) and the changes in the user interfaces (desktop -> web -> mobile) maintaining reliable UI based tests is very complex and time consuming. But until recently we had no choice. Web services tests were not an option because most of the web based applications were rendering HTML in the backend and were serving it assembled to the clients. The only way to interact with an application was through its UI.

Ironically, the rise of the frontend frameworks that consume only data from the backend and construct HTML on the client side, made possible the more reliable web services test. The detachment of the UI rendering from the backend opened not just the possibility to test bellow the HTML layer. It also enabled complete testing of the rendered UI with detached backend (e.g. when the whole backend is simulated).

As long as you do not change the web service endpoints in a significant way, the API tests require surprisingly little maintenance. At the current company that I work for, the API tests that were developed 4 years ago are still running. We’ve completely rewritten the frontend once, developed two different versions of a mobile app and replaced the backend language (from PHP to Java) using the strangler approach. Unit tests for the PHP backend were no longer useful, but the same API tests, testing the business logic that did not change, continued to provide value and detected bugs that we introduced when we rewrote the backend.

The Problems with Integration Tests

The purpose of any type of automated test/check is to help us identify a potential problem as fast as possible. But if the integration tests are so awesome in helping us with this, then why aren’t they widely used? Well, for starters there are three major problems that need to be resolved.

Integration tests are slow. To get the maximum value of any type of automated test, it needs to run after every system change (code, configuration, database). Unit tests are the ideal candidate for that because they run in milliseconds. Integrations tests run orders of magnitude slower. Of course, API tests run faster than UI tests, but if specific measures are not taken they are still not suitable for execution after every system change. You have to aim to run all your automated tests (not just unit), after every single change, for less than 3 minutes. More than that, and the developers get distracted and get out of the flow.

Integration tests are unreliable. Unit tests run in a bubble, in a predefined and deterministic environment. Integration tests on the other hand, run on a fully operational system with lots of factors outside of our control. A number of moving parts may interact in unpredicted ways. To be useful, those tests need not only to be fast but to be deterministic. They should not fail for random reasons. Run them 100 times on a system that does not change, and they should pass 100 times. Shameless plug: I’ll be giving a talk at Agile Testing Days 2017 on how to make your integration tests more reliable.

Integration tests cannot pinpoint the location of a defect. Then a unit test fails, you have pretty good idea where and why the failure occurs. A good unit tests will execute a single method and if it fails, it’s either because an exception is thrown or because an assertion fails. You have a pretty good idea where this happens in the codebase. Now compare to what happens when an integration test fails. If it’s a UI test then an HTML element might be missing, if its is an API test then you may get 500 internal server error. Either way, that is hardly the information we need to pinpoint what and where the problem is.

How to Solve Them

In 2016 at Google Test Conference, I made a presentation how we solved the above mentioned integration tests problems. Here you can find the video and the slides. Extended slides version from VelocityConf is here. Bellow you can find some additional notes.

Integration tests can be divided to two kings depending on the interface they are using - API or UI. API tests are inherently more stable as the medium they use is made for machine consumption (HTTP and JSON/XML). The UI tests medium is made for human consumption. The majority of the UI tests can be broken to API and unit tests.

This division is very helpful because for most applications, the backend is where the business logic is located. No need to fire a heavy Selenium test if you can achieve the same result with lightweight API test. The same is true for testing a dropdown menu in the UI. No need for a heavy Selenium test that expects the full system to be deployed. You can either test it with a unit test or stub the backend completely if you still wish to use Selenium. Only a really small numbed of full UI tests should remain that make sure that all the parts in the system are wired correctly.

Almost all modern day applications connect to some 3rd party web service (social networks, payment services, realtime notifications). To achieve fast and reliable integration tests when you have a system outside of your control, you need to be able to stub/mock/fake its responses. The industry adopted term for this is service virtualization (I don’t like it and think it is a fancy title for something so simple). At my current company we developed our own to completely isolate us from the slow internet, social networks limitations and outages.

Stable and reliable integration tests require a dedicated environment to run on. This way, you’re shielding the running tests from random events such as: cron jobs, developers pushing new code, changes to the DB etc. If possible, use containers to emulate your production environment. Our test environment consists of 20 containers, each having a single purpose. They are restarted before every test suite, all of the tests data is also cleared. Only the source code of the application is replaced with the one from the current commit that we want to test. Obviously, it’s not the same hardware, and not suitable for performance tests, but it has all the components as the production environment and they are configured the same way. If some of the systems you’re integrated with are too old or otherwise impossible to containerize (e.g. a mainframe), use service virtualization to simulate it.

The unexpected benefits

Since integration tests require fully operational system, we can use them as an early indicator, as a weak signal amplifier for all sorts of potential problems. After each test suite execution completes, if there are no test case failures we can do a number of additional checks:

All the application and container logs are examined for exceptions and errors. A test case may cause a backend exception but, if that exception not manifested in the API/UI level, the test will not know about it and it will happily pass. Most of the time, this behavior caused by bad programming practices — e.g. catch an exception, log it, but then continue as usual. Also note that since those integration tests run in parallel, it is not possible to check for exceptions after each test case concludes. It’s impossible to determine which one of them caused the exception while they are running. So the exceptions check needs to happen after all the tests complete.

All databases are examined for unexpected/wrong data. Once we had a problem with our production environment. Out of the blue we started running out of memory and once that happened our application crashed. This happened three times in a row before we figured out what was the problem. In our database we have a column with Twitter IDs for which we collect data. The ID is an integer with value 1 or greater. 0, while valid integer is not a valid Twitter ID. However it turned out that one of the entries in that column was 0. The backend was written in PHP and because of its magic type conversion capabilities, 0 equals false. This would cause an eternal loop which would consume all the machines’ memory. Long story short, we now implemented a self system check and repair for such incorrect data (as SQL data types are not fine grained enough). And while this check is straight forward for SQL databases, we get the biggest bank for our buck with unstructured data in NoSQL databases.

The status of the containers. After the test suite passes we check for the status of the container. Sometimes there is a crash, but no errors or exceptions in container logs.

Measure performance impact of any commit. The only thing that changes between test suite runs is the source code. We can take various performance metrics for each commit and compare them to the previous ones. Does this commit improve the performance or makes it worse? Some of the metrics to collect are: the time it takes to complete all the test cases, the size of the log files, max memory or CPU consumed, swap usage size, disk i/o operations count, network traffic sent in/out, garbage collection stats, the number of database operations, also various database performance statistics, cache hits/misses.

Conclusion

For a while I thought something was not right with our results - preferring integration over unit tests. It turns out that other people start to think this way. Similar articles came out in the last weeks - here and here.

But take my words with a grain. There are no universal best practices. Go and see. Experiment and measure for yourself. Think.

For example, we make more than 10M requests to Facebook APIs every day ↩︎

]]>In March 2017 I gave a presentation at a QA conference: QA Challenge Accepted 3.0. The title was No Country for Old QA and in it I summarized my experience from the last 15 years working as QA. It also included my thoughts on where our industry is going]]>https://emanuilslavov.com/no-country-for-old-qa/5af2a3ae2c5eca00177b19e4Wed, 23 Aug 2017 09:29:05 GMT

In March 2017 I gave a presentation at a QA conference: QA Challenge Accepted 3.0. The title was No Country for Old QA and in it I summarized my experience from the last 15 years working as QA. It also included my thoughts on where our industry is going and what is the future of this profession. The presentation slides are in English, but I spoke in Bulgarian. After the conference was over, I received lots of requests to publish a summary of the talk in English as people wanted to share bits and pieces with their teams. You can find the highlights bellow.

The Current State of QA

Different people have different opinions about what the QA role is and does. But until the last years it was a role almost universally present in every software company. Recently however, a breed of companies started emerging — they do not have dedicated QA role. This trend usually begin with startups. It does not make an economic sense to invest heavily in quality software until a product market fit is found. Github, Stripe and Airbnb are examples of such companies. They all are private companies, and not a small ones at that. As of March 2017, Airbnb valuation was $31B with ~5,000 employees. For comparison, Bulgaria’s GDP is $50B with population ~7,000,000.

So the QA role does not exist in startups but when the company grows it has to have a QA right? Here are some examples of big, publicly traded software companies that do not have such role[1]: Yahoo!, Facebook, Microsoft, Google. Facebook was a startup but they never had a dedicated QA. Even as they grew, they found ways to keep it that way[2]. Even Microsoft, once proud with 2:1 ratio[3], are now phasing out the formal QA role.

OK, but in some industries you absolutely need QA, like in gaming? Wrong. Here is one example:

This is a screenshot form a game called No Man’s Sky. The purpose of the game is to travel around the galaxy in a spaceship, collect resources, fight other explorers etc. The unique thing about this game is that whenever you approach a planet for the first time, the game autogenerates it, including everything on in — the geography, flora, fauna and resources. The number of planets is 18 quadrillion. So each player gets to experience a unique, one of a kind journey. How do you test all those planets in your lifetime? How do you test something nondeterministic? The short answer is: you can’t. What the developers of this game did, was to create bots that fly around, land on planets, take some screenshots, a short video and fly away to the next planet. The bots can’t visit all the planets, because our Sun will be long gone before they do. So they pick a small sample to land on. The screenshots and the videos are fed to TV screens in the developer’s room. The developers check the images from time to time for irregularities — e.g. a creature with 7 heads and 18 legs, too large vegetation. In case of a problem they adjust the algorithms appropriately.

The Cost of a Defect

Why some companies do not have a formal QA role while others do. Some of the reasons have to do with the cost of the defects. If the cost is low, there is not much sense to invest in heavy (and expensive) testing upfront.

Free vs Paid Product. If a product is free, and it has bugs, who are you going to complain to? You don’t pay a dime, you don’t have a support contract. If a product is free, it means that you are the product. You don’t pay with money, but you pay with your time, eyeballs, actions, information, social interactions etc. On the other side, if a product is paid, it usually comes with SLA and financial penalties for not meeting the terms. The cost of a defect is high in the later case.

Startup. As mentioned above, the most important task of the startup is to find the right product market fit. Everything else comes after that. The cost of a defect is low as there is a high chance that the startup will run out of funding before delivering something useful. The worst part (for the current QA engineers) is that as startups mature and become public companies, they learn how to operate without dedicated QA. I expect this process to continue.

Monopoly. If your company is a monopoly then the price if a defect may be quite low. When you are the only game in town (Facebook, government, internal IT department of a company), the customers have no other choice but to tolerate you. This is pretty clear with Facebook: your drunk photo could not be uploaded for some reason, and in your rage what are you going to do? Use MySpace instead? On the other side, if you are in highly competitive market, the cost defects may make of break your company’s public image.

Significant Impact. Think about the software that you’re writing. Can a defect cause significant money loss? I worked for a company that is processing electronic money transactions. Each defect was costing us money — literally. The most expensive one that I’ve seen costed us 100,000 EUR, but we made one of our customers very happy. Can a defect sets you back with significant amount of time? Consider Mars Schiaparelli lander crash. A software defect caused the crash, setting back the European agency the time it took to build the lander as well as the time it took to reach Mars. Can a defect cause you not to be compliant? To operate in certain industries a compliance guides need to be followed: HIPPA and PCI are two examples. The result of a defect can cause non-compliance and in some cases you may not be able to operate in such industries or pay heavy penalties to the regulators. Can a defect cause loss of life? In the case of one X-ray machine it did. On the other hand, lets go back to the Facebook picture example: the upload did not succeed on the first try, there is no significant impact.

Deploy Frequency. I used to work at a company that produced software distributed on compact disks. Every defect costed us a lot because even if we issued a patch for it. It was up to the customers to decide when it’s going to be applied. We also had to support ‘rolling’ upgrades (this also included data migrations) - meaning that you can upgrade from version 3.0 directly to 7.0 without going through the in-between versions. All this required heavy pre-release testing and there was no other way. Now, with SaaS and continuous delivery, there is only once place where you code resides, it’s easy to upgrade and apply a fix. After a fix the only thing your customer needs to do is reload their browser. Being able to deploy quickly a fix to all customers means that in most cases the defect cost is low[4].

What Happened in The Last 15 Years

I’ve worked as QA for the last 15 years and these are the most significant developments that happened during that period.

Salaries got higher. When I started working as a QA, my salary was 160 USD (after taxes). This is not much now but in 2002 it was a pretty good chunk for me, given that the standard of living in Bulgaria was low. Today, the salaries are 10-20 times higher.

Less QAs. 15 yeas ago is was it was so cheap to hire a QA that some companies had more QAs than developers. There was no need to invest in test automation or any time/effort saving activity as you had so many QA drones willing to work for peanuts doing the same repetitive tasks over and over again. Today, as a result of the higher salaries, there are significantly less QAs compared to developers[^9]. And it makes sense — you can have a product without QA, but not without a developer. The focus is on hiring developers, QA may never get hired.

A lot more is expected from a QA. 15 years ago, hiring QAs consisted mostly of checking if they have any computer skills at all, e.g. at least MS Office literacy. Today, to get hired even as Junior QA, you need to have at least one of the following skills: relational DB knowledge, programming experience, networking knowledge or experience with hardware.

No dedicated QA teams. When I started as QA, most of the companies were working with waterfall development methodology. There were big and independent teams - developers, QAs, product owners. Today pretty much no company works like that (except some outsourcers). The teams are now combined. In the majority of the cases the people who are promoted to higher positions are either with development or product owner background. Regular QA engineers do not see a career path on those teams and as a result tend to choose other careers. The constantly shrinking sizes of the QA teams also contributes to the fact that the QAs do not see themselves moving towards management position. There is just not enough employees in the QA team for a rigid hierarchy (junior, regular, senior, lead, manager). In most cases one QA lead is enough.

Moving to Other Positions Then I started working as QA, our team consisted of 12 engineers. Now only 40% of the original team still works as QA (in lead or management positions). 60% of that team have move to other endeavors. The two most common positions they moved to were development[5] and product owner[6]. It is fare to say that most of the people working as QA now will not retire working as QA. What’s more — most of the QA engineers consider this role as a stepping stone, a foot in the door, to some other position in the IT industry with more potential for career growth.

Shrinking Cycles

There is another important development that happened in the last 15 years. When working with waterfall, planning, development and testing cycles lasted for 2-4 months each. I used to plan what each member of the QA team would do, day by day, 6 months in advance in Excel Gantt chart. Needless to say this plan was never accurate, but at the time we didn’t know any better. Today, almost every company works with some sort of iterative development process with short release cycles — 1-2 weeks for SaaS, 3-4 weeks in the case of applications that needs on premise installation, or a mobile app. In order to meet those deadlines, companies rely more and more on fast feedback quality related activities.

This time is taken from the manual testing. In order to secure more time for development, manual testing is being squeezed from left by early defect detection activities (performed by the developers): static code analysis, code review, pair programming, automated tests. Since the pressure to release faster to the customers is huge, once a future is ready, manual testing is also squeezed from right[7]. Usually by activities performed by operations: analyzing customer reported defects, monitoring for errors exceptions, mitigation techniques or even full rollback in case of a catastrophic failure. All those shift actives mean less time for manual testing. Since the required manual/exploratory testing is not much[8], in some cases those test activities are performed by the product owner or by the developers themselves. One can argue that shrinking the manual testing process leads to higher quality overall and faster cycle time. All of this paints pretty bleak picture for the average QA engineer.

The PDCA Cycle

In classic project management theory there is the notion of the ‘holy trinity’. It consists of high product quality, low manufacturing price and short development time. The theory states that you can have only two of the three at the same time. However, if you want to continue to work as a QA, you need to help your organization achieve all of the three at the same time. What’s more - you need to be flexible and respond to changes. 15 years ago, your best bet to achieve the ‘holy trinity’ was to fill a room with a bunch of QA Engineers and pay them 160 USD a month. Today, your best bet are the ‘shift’ left/right activities. But tomorrow we may require new thinking to achieve the ‘holy trinity’ - possibly using artificial intelligence.

If you’ve studied the classic management theory you may think that I’m full of bullshit. But I want to introduce you to Williams Edward Deming. Credited (also with Joseph Moses Juran) at least to a degree for what we now call The Japanese Economic Miracle after the second world war. A bankrupted country in 1945, with inflation of 100% for three consecutive years and destroyed infrastructure. Yet in 1967 it rose as the second largest economy in the world. This was accomplished in part by Deming insisting that by focusing on quality first, the other two parts of the holy trinity will fall in place.

Deming postulated 14 points for improving any system, and some of them we can directly relate to software engineering:

“Cease dependence on inspection to achieve quality. Eliminate the need for massive inspection by building quality into the product in the first place.”

We’ve already figured out that testing after the fact does not yield great results. We should put more efforts in detecting and preventing defects in planning and development phases.

“Improve constantly and forever the system of production and service, to improve quality and productivity, and thus constantly decrease costs.”

By focusing on quality, cost reduction and speed (productivity) will naturally follow.

“Break down barriers between departments.”

More than 50 years ago Deming was preaching what we are discovering just now with so called ‘agile’ development methodologies.

“The responsibility of supervisors must be changed from sheer numbers to quality.”

It’s always better to produce less with higher quality. Don’t rate people based on fallible metrics — number of bugs found/fixed, code coverage percentage achieved. Forget about premature optimization. Will the feature solve a customer problem? Is she willing to pay for this solution? What is the most optimal way to produce it?

There is a chart that sits in almost every Japanese factory. A chart used for product development as well as for problem solving. It was not originally created by Deming, but it was improved and popularized by him. It’s called the PDCA cycle - plan, do, check (study), act (adjust). It can also fit nicely with any software development methodology, as we also have the same stages.

I’ve added three more components to the chart above that also affects quality - People, Product and Process. Listed at every stage are some of the quality related activities that can be performed.

Now look at the chart above. If you’re a regular QA, your responsibilities are to participate in planning meeting (if you're lucky) where you can give your opinion, and to perform manual/automated testing during the development phase. There are lots of activities that can influence quality but for various reasons, most of us never participate in them.

We can draw three conclusions from this graphic:

QAs can never be the only ones responsible for the quality of the product.

If you want to improve product quality you need to perform various activities at different stages.

Quality does not equal testing (manual or automated). There are lots of other activities, some even more important (and cheaper) than testing after a feature is completed.

Seven Steps

If you want to continue to work and grow as a QA you need to study and apply different quality related activities at different stages in the software development lifecycle. Get out of your comfort zone and start learning.

Here are seven steps to start your journey:

1. Increase the feedback loops. Figure out from what activities (other than testing), you can get information about the quality of the product. Where are the weakest links? For example, study customer reported defects, monitor for errors and exceptions in the production environment, perform customer quality surveys, how flakey are your automated tests, what is the cause of flakiness.

2. Track and visualize trends. Don't keep the data you collect to yourself, plot it and show it to everyone in your organization. You’ll be able to see if a solution works by monitoring if the stats show improvement or not after you’ve implemented it.

“In God we trust, all others [must] bring data.” W. Edwards Deming.

3. Fix defects immediately. I’m a big advocate for zero bugs policy and no defects backlog. If something is important fix it right away. It impacts your customers, the ones who are paying you to use your product. When you find out about a defect, the knowledge is still fresh in your mind and you can fix in quickly. If you postpone the fix, you lose this advantage, people leave the company and the fix aways takes longer. Make sure to implement mechanisms to prevent this bug from happening again (automated tests, alerting, database checks, purging of old/unused data).

4. Eliminate classes of defects. Sometime when you spot a defect, it’s part of a larger class of defects. Analyze each defect and figure out if you can eliminate the parent class of similar defects. For example, we had a PHP backend that contained SQL injections. We and our customers would find them on regular basis and fix them one by one. We got tired of this whack-a-mole game and we wrote a custom tool that would scan the PHP code and detect all SQL injections. We got rid of that problem reliably once and for all.

5. Create mitigation strategies. Defects will happen, it’s not possible to eliminate all of them forever. The smart thing to do is to expect the defects and build mitigation strategies. As Todd Conklin says — safety is the presence of capacity (to fail). When a defect occurs, the consequences should not be catastrophic. The examples are highly domain specific, but here are some of them: use try/catch/finally blocks to gracefully handle an exception and recover, use database transaction, periodic checks for the integrity of your data, restart a service in case of a failure, retry a failed network connection. Some of these techniques are described in great details in Release It! book by Michael Nygard.

6. Eliminate waste. Waste is everything that the customer is not paying for — it may be a code (unused feature), a documentation that nobody reads, unnecessary activity performed constantly (e.g. test before release that the software runs on Windows XP, or on Internet Explorer 9). Constantly ask why do we need to do an activity, how does this activity help the customer. “We’ve always done it this way” is your enemy.

7. Share knowledge. Whatever you learn, it’s your obligation to quickly share it with your company. The knowledge will spread rapidly and will be of use to everyone. Even if it may not be directly applicable, it may spark other ideas for improvements. If the knowledge is not company confidential, share it with a wider audience. Write a blog post or present it at a conference. This is the way to grown and improve the QA community.

Further Reading

At the end of each talk I’d like to share books for further reading. For this presentation there are two books.

The first one is “Average is Over” by Tyler Cowen. In the near future, the author predicts that 15-20% of the people would reap the benefits of AI, task automation. Those are the people who are putting the effort to learn and adapt. The rest of the people will most likely fall of the way side, working jobs that require no specific knowledge or skill. Those jobs come with great job insecurity (think the sharing economy), because anyone can do them, it’s easy to replace the worker. And precisely because it is easy to replace the worker those jobs are low paid. This book is not a pleasant one to read and I don’t agree with some of the conclusions, but it’s hard to argue with it. You can already see the effects of the economic divide between the rich and the poor in the North America and Western Europe. The author gives some ideas on how to get to the top 15-20%, but a better read on that topic is “So Good They Can’t Ignore You” by Cal Newport.

The second book is a bit more technical for those of you who think that this talk has become too philosophical — “The DevOps Handbook” by Gene Kim, Patrick Debois, John Willis and Jez Humble. It does not have the word ‘testing’ in the title or the subtitle, but it contains all the current practices that a modern software development process should use. Again quality ≠ testing.

Footnotes:

Those companies still need humans to do parts of the application testing for tasks that can not be easily automated but do not require in-depth technical knowledge. Those tasks can be performed by low paid, low skilled workers - e.g. to check if the UI is rendered correctly, or testing app installation with various Android devices. It does not make sense to pay high Silicon Valley salaries for such workers, so the work is outsourced to companies like uTest, RainForrestQA or 99Tests.↩︎

You might argue that the lack of QA is obvious with some companies like Facebook. But hey, it’s a free product, who are you going to complain to if there is a bug? Would they listen to one of their 1,000,000,000 users?↩︎

This QA to Developers ratio was the highest in the company and was for the operating systems division.↩︎

At least you don’t have to recall the whole batch of CDs when a defect is found in the last moment.↩︎

This is especially true for the QA Engineers that write automated tests.↩︎

QA Engineers have a wide view of the whole system, whereas developers are usually focused on a subset of the system.↩︎

According to some estimates, the number of developers worldwide doubles every 5 years. I can say the same thing about the QA engineers.↩︎

]]>How We Decreased Customer Reported Defects by 80%

In April 2017 I was at CraftConf 2017 presenting The Ultimate Feedback Loop. InfoQ noticed this presentation and wanted to do a short interview about our results. When I sent the replies, they decided that it’s a valuable enough information and

In April 2017 I was at CraftConf 2017 presenting The Ultimate Feedback Loop. InfoQ noticed this presentation and wanted to do a short interview about our results. When I sent the replies, they decided that it’s a valuable enough information and asked me to write a full article which can be found here.

I'm putting the article on this site to collect all my writings in one place.

Key Takeaways

Analyzing the most expensive types of bugs will save companies time, money and resources

The collected data will question widely believe dogmas in software development

In service oriented architecture (or microservices), integration tests will uncover more defects than unit tests

The majority of defects are concentrated in small number of easily identifiable functions

Simple actions can greatly reduce the defects that reach the end customers

Introduction

Software defects found by customers are the most expensive ones. Lots of people are involved in debugging (hard to do in production), fixing and testing them. All those people also need to get paid, time and resources need to be allocated away from new feature development. Customer reported defects are also an embarrassment for an organization — after all they have bypassed all the internal defenses. It’s no wonder that software maintenance costs are typically between 40% and 80% of the project cost (according to some studies they may reach up to 90%: How to save on software maintenance costs), and a big chunk of those expenses is directly related to fixing defects. It’s easy to calculate the exact cost for a bug fix, but one thing that is hard to measure is the reputation loss. Customers will not recommend an app, or will downright trash it because of bad quality.

Our Situation

Most companies are not investigating the root cause of any defect (even the most expensive ones). And at Komfo we were no different. We accepted defects as the cost of doing business — never questioning or trying to improve. Since we couldn’t find any industry data related to customer reported defects to benchmarks with, initially, we just wanted to see where we stand.

Here is an example: Our customers can report a defect through a number of channels: email, phone call, social media. Of all the reports we get, only 12% end up with actual bug fixes in the code base. The other 88% are also interesting but for other reasons: maybe our product is not intuitive to use or maybe our customers need more training. 12% of bug fixes — is this good or bad? Until other companies start publishing such data there is no way to know.

A while ago, I read a book called “The Toyota Way to Lean Leadership”. In it, there is a story about how Toyota North America lowered the warranty costs with 60% by investigating the causes and the fixes of vehicle breakages within the warranty period. Inspired to so something similar, we started gathering data to investigate how we can improve.

Data Collection

All of our defects are logged in Jira. The defects are also tagged depending on in which phase they are found — in-house or reported by a customer. We gathered all the defects in the second group, ignoring those that were marked as will not fix, or were considered improvements. We were interested purely in the defects. We started searching in the git log for their Jira ID (we already had a policy to put the Jira ID in the commit message).

In the end, we found 189 defects and their fixes in the code base, spanning a period of two and a half years. For each defect we gathered more than 40 statistics: when it was reported and fixed, in which part of the application, by what kind of test or technique we could have detected it earlier, what was the size and the complexity in the method/function where the defect was located and so on (you can check the sanitized version of the stats we collected and use them as a guideline here).

The data collection process was slow as we were gathering everything by hand. We already had our daily work to do and investigating 189 defects, gathering 40+ stats for each of them took us more than 6 months. Now that we know what exactly we’re looking for, we’re automating the tedious data collection.

Initial Analysis

One of the first things we noticed was that 10% of all the defects we were interested in, were actually not caused by our developers. Our product is SaaS that collects lots of data from the biggest social networks (we make more than 10 million requests a day to Facebook API alone). Sometimes, the social networks change their APIs with no prior notification, then our customers notice a defect. All we can do is react and patch our product. We ignore those defects from further analysis as there is no way to notice them early.

Fronted to backend defect ratio was almost 50/50 - we have to pay close attention to both. The backend distribution was interesting though. Almost 2/3 of those defects were in the PHP code, 1/3 in the Java code. We had PHP backend from the beginning, and since a year a half ago we started rewriting parts of the backend to Java. So PHP was around for a long time accumulating most of the defects in the two and a half year period we investigated defect.

There are lots of discussions about which programming language causes less defects. We decided to find out empirically for our application - PHP or Java. Only 6% of the defects could have been avoided if the code they were found in was Java in the first place instead of PHP. In the PHP codebase we have lots of places where we don't know the type of a variable. There are extra variable checks in case it is a string, a date, a number or an object. In the Java codebase we know the variable type and no extra checks are needed (which are potential source of defects).

However, the 6% "Java advantage" is reduced by the fact that when rewriting parts of the backend from PHP to Java, we simply forgot to include some functionality and this resulted in defects. Also (and we have only anecdotic evidence about this), the developers feel that they are ‘slower’ developing in Java compared to PHP.

We started investigating customer reported defects since two and a half years ago. Back then, our backend (no pun intended) was written 100% in PHP. One year after that we started rewriting parts of it in Java. The new backend went live after 6 more months. We did not immediately see a decrease of incoming defects (see the last screenshot). Switching from PHP to Java did not automatically mean less defects. We started implementing various other improvements described below and we had to wait for 6 more months until the defects started to decrease. The rewrite was done with the same developers (we have very little turnover).

What all this means, according to our data, is that in the long run, the quality of a product depends primarily on the developers involved and the process used. Quality depends to a lesser degree on the programming language or frameworks used.

The Surprises

There were three main things that we did not expect and were quite surprising to us.

The first one was that 38% of all the customer reported defects were actually regressions. This means that there was a feature that was working fine, then we made a change in the codebase (for a fix or a new functionality) and then the customers reported that the feature they were using stopped working. We did not detect this in-house. We knew that we had regressions but not that their number was that high. This means that we didn’t had any sort of automated tests that would act as a detection mechanism to tell us: “that cool new code you just added is working, but it broke an old feature, make sure you go back and fix it before release”. By writing automated tests you kind of cement the feature logic in two places. This is double edge sword though. It is effective in catching regressions but may be hinder your ability move fast. Too much failing tests after commit slows you down because you have to fix them to before continuing. It’s a fine balancing act, but for us the pendulum had swung too far to the part where we preferred fast development, so we had to reverse the direction.

The second surprise was the fact that the automation testing pyramid guidelines were not helping us to catch more defects early. Only 13% of the customer reported defects could have been detected early if unit tests were written. Compare this to the 36% ‘yield’ of the API level tests and 21% ‘yield’ of the UI level tests. A diamond shape (the majority of the tests are API level) is better for us than a pyramid. This is due to the nature of our software. It’s SaaS and the bulk of what we do is gathering lots of data from the internet, then put it in different databases for later analysis. Most of the defects lay somewhere in the seams of the software. We have more than 19 different services, all talking over the network constantly. The code for those services is in different repositories. It is impossible to test efficiently with unit test only (we consider unit tests to run only in memory, don’t touch the network, the database of the filesystem; using test doubles if they have to). And we think that with the rise of the microservices and lambda functions, high level integration tests executed on fully deployed apps will be way more effective in detecting defects than simple unit tests. The majority of the defects lay somewhere in the boundaries between the services. The only way to detect them is to exercise a fully deployed application. Those defects can not be detected by testing a piece of code in isolation (with unit tests).

Unit tests are still useful but they don’t need to cover 100% of the code base. We found out that it was sufficient to cover methods with cyclomatic complexity 3 and above (72% of the defects were in such methods) and methods with size of 10 or more lines of code (82% of the defects were in such methods).

The third surprise was that no matter what kind of testing we do in-house, we cannot detect 100% of the defects. There will always be 30% defects that only our customers could find. Why is that? Edge cases, configuration issues on production, unintended usage, incomplete or just plain wrong specifications. Given enough time, money and resources, we can detect those 30%, but for us it does not make economic sense. We operate in a highly competitive market and we need to move fast. Instead of waiting forever to find those mythical 30%, we expect them to happen, and we try to optimize for early detection and fast recovery. When such defect is reported and fully investigated, we learn from it and improve our system further.

Actions

We made four process changes for developers and testers:

Write at least one automated test per feature. If possible, focus on API tests because they help us detect most of the defects.

Even for the smallest fixes, do manual sanity check and visual verification.

Mandatory code reviews performed by the team leads. No one can push code to production without an OK review.

When testing, use boundary values where possible: too much data, too little data, the end and the beginning of the year and so on, because 8% of the defects were caused by such values.

There were also four technology changes:

Every morning we review the logs from production for the past 24 hours looking for errors and exceptions. If we notice something unusual we fix it with high priority. This way we end up noticing problems very early, and sometimes, by the time a customer calls us to report a problem, we are already working on the fix.

We had long running API integration tests that used to run for 3 hours. After some improvements (the major ones being: dedicated test environment, test data generation, simulation external services, running in parallel), the same tests now run for 3 minutes. We were invited to present how we did this at Google Test Automation Conference in 2016: Need for Speed - Accelerate Automation Tests From 3 Hours to 3 Minutes. This helps us tremendously in detecting defects because we run all the automated checks (static code analysis, unit tests, API tests) after each commit. We don’t have nightly test execution or smoke tests anymore. If all the checks pass, we can confidently and immediately release to production.

Defects in production sometimes cause exceptions that can be found in the log files of the application. Upon investigation we found that he same exceptions were present in the log files of our testing environments even before the release to production. It turns out that we had the opportunity to detect some defects before they reach production, if we monitor the logs in the testing environments. Since then, we made the following change in the automated test execution. Even if all the automated API tests pass successfully, we still check the log files for errors and exceptions after each test suite execution. A test may pass but still cause an internal exception which will not manifest itself (by failing test) because of bad coding practices (e.g. silently logging an exception and not propagating it further). In such cases, we fail the build and investigate the exception.

Around 10% of the defects were caused by unexpected data that we did not handle properly: special or unicode characters, binary data, malformed images. We started collecting this data and whenever our automated tests need to create test data they use this pool of ‘weird’ data.

The Outcome

In screenshot below, you can see the result of all the actions that we took. One bar corresponds to one quarter. Notice in the last four quarters, the customer reported defects are constantly decreasing. The last quarter we had less defects that any other quarter since we started collecting the statistics two and a half years ago. The difference between the last quarter and the one with the most defects is more than 4 times.

One last thing to note. Usually, the more lines of code (LoC) in a product, the more defects it contains. Defects count to LoC ratio remains constant. In our case, even though we kept adding more code, the defects count continued to go down in the last four quarters.

How to Start

The process of reviewing customer reported defects, finding the root cause, correlating it with the fix and gathering additional information is very tedious for most people. My advice would be to set aside a dedicated time(e.g. an hour a day) for defect investigation. Once you go over the initial hump, assuming you want to investigate defects found since a certain date in the past, you’ll get better and automate some of the data collection. Even with that however, I still don’t have the mental capacity to investigate defects eight hours a day.

Make sure that you have a way to separate the customers reported defects from the ones found in-house. When a fix is made, put the ID of the defect in the commit message so that you can correlate both later.

Investigate defects as soon as you can — the longer you wait, the more time it takes, as human memory fades quickly.

Have someone outside of the team investigating the defects. It should be done by a person that does not have an emotional connection with the code. Otherwise it’s very likely that his/her judgment will not be objective.

You can quickly get overwhelmed tracking too many metrics, figure which ones will be useful and actionable for you organization.

Collecting all this information may seem (and it is) a lot of work, but I promise you, it is worth every second you put into it. The learning opportunities are tremendous.

Conclusion

Investigating the root causes of customer reported defect will have great impact on your organization. The data collection is not easy when you start, but the learning opportunities are tremendous. It is amazing how many companies are not doing it. In times when most organization are competing for the same people on the job market, or have access to the same hardware resources (AWS) how do you differentiate? The best ways to ensure customer satisfaction, lower costs and increase employee engagement is to look inside — you already have the data. At the end, it’s all about continuous improvement.

Here are four book recommendations that will help you with your journey:

]]>This is a short post to announce a project that I’ve been working on for a while now. It’s an HTTP simulator called Nagual, that can be used to fake responses from 3rd party services outside of your control. We use it to simulate services that we integrate]]>https://emanuilslavov.com/project-nagual/5af2a3ae2c5eca00177b19e2Wed, 01 Mar 2017 12:25:42 GMT

This is a short post to announce a project that I’ve been working on for a while now. It’s an HTTP simulator called Nagual, that can be used to fake responses from 3rd party services outside of your control. We use it to simulate services that we integrate with, mostly the biggest social networks - Facebook, Twitter, LinkedIn, Youtube, Instagram, Google+. By having such simulator, our automated tests are faster, reliable and encapsulated. And the execution time is down from 3 hours to less than 3 minutes.

Which services you should simulate? As a general rule, if a service is under your control, you should not simulate it, instead run its instance (in a container). As an example, currently our application fits in 17 containers that all can run on a laptop. Sometimes this is not possible - in case of legacy mainframes, then even though you control the service you should simulate it as you can have only once instance of it.

When creating Nagual we had some specific features in mind that we could not find in any other public tool (paid or free):

Transparent: No code changes should be required to run this tool. Developers should not even be aware that it exists. The tool should be able to distinguish between legitimate traffic that should be forwarded to the real 3rd party service and traffic that should be simulated.

Fake SSL: The tool should be able to create fake SSL certificates on the fly in order to impersonate legitimate services such as graph.facebook.com and api.twitter.com. Most of the time, these addresses are hard coded in the 3rd party libraries and there is no config option to change them to a server that we desire.

Dynamic Responses: The tool should be able to generate the response on the fly. It should have access to the request (read data if necessary) and programmatically construct a response.

Return Binary Data: Sometimes, we want to simulate downloading a large picture, or video. For this purpose, the tool should be able to return binary data.

Regex URL match: Each of our tests generates all the data that it need before the start of the test. No data is shared between the tests. This makes the tests independent of each other and allows for parallel runs. In the case of a Facebook test, this means that the test should create a random, used-only-for-this-test-case Facebook page. Most of the API endpoints of the social networks are similar, with the exception of an id in the URL. Since the responses are similar, the tool should be able to regex match URLs to reuse the same function to generate responses.

Local Storage: Most of the social networks APIs are complex and to achieve a single action, you need to issue multiple requests. The state and data between the requests should be kept locally by the tool, for usage in any subsequent requests.

Nagual is not record and replay tool (due to our application specific we can not use such tool), as we don't believe such tools are viable in the long run. They create more coupling that you need, and this comes to bite you in the future.

This image is a comparison from March 2016, when we started developing Nagual.

Nagual has one more usage — as monitoring tool. Since it essentially acts as man-in-the-middle proxy, it decrypts all the SSL traffic. Normally, this traffic is seen only by the client and the server. This feature allows you to monitor and log unusual activity on your 3rd party services. We’ve build StatsD support in Nagual. Here is how the 4xx rate of errors looks like when we compare all the social networks

Nagual is not perfect for every HTTP simulation use case, we wrote it to scratch our own need. We tried to make it as robust as possible but it if does not suite you, there may be better alternatives. Also we tried to make Nagual as small as possible, the core is only 700 lines long. Feel free to use it as an example to roll your own tool.

I’ve presented Nagual to Agile Testing Days 2016, here is the link to the presentation. Unfortunately there is no video recording.

If you need more information, check the source code and the documentation on GitHub, and let me know if you have any questions.

]]>(or why you should optimize for low MTTR)

All bugs are not created equal. Ideally, we want to catch them all before releasing software to our customers, but this is a pipe dream.

In Toyota Kata, Mike Rother explains the improvement kata, which in essence is an ideal goal (true

All bugs are not created equal. Ideally, we want to catch them all before releasing software to our customers, but this is a pipe dream.

In Toyota Kata, Mike Rother explains the improvement kata, which in essence is an ideal goal (true North) that we make small, incremental steps to achieve. We can get exponentially close to it but will never achieve it.

So why there is no bug free software? Because we’re (imperfect) humans, we make mistakes. Because our software is complex. Because it operates in less than ideal conditions. Because, unlike machines and buildings, a single, unexpected character can cause catastrophic consequences. Because the price of building high reliable software is too high for most of our purposes. And because what we, as developers think as a feature, may be considered as a bug by our customers.

Every piece of software has defects, no matter how perfect it looks. They are just not found yet. And for the life of this software, they may never be found. In this case, can we consider them as defects? If a tree falls in a forest and no-one is around to hear the sound, how do we know that it really made a sound?

That why it’s really important to investigate and learn from the bugs that are reported by our customers. We may never find the other ones anyway.

Let me show you two types of bugs that escaped undetected and were reported by our customers. Both are very recent, in a backend Java code. First, the preventable one:

The problem was that the first argument when initializing BasicDBObject is ‘fields’, which is a property of the object. When this method executes, this property is null, so a null pointer exception occurs. Instead, the first argument when initializing BasicDBObject should be ‘field’ (singular), which is the first argument in incrementField method. Although this is clearly a typo, the code is perfectly valid and neither the IDE, nor the compiler complains.

And here is the problem, if this code was executed at least once — by unit, API, or manual test, it immediately would have thrown an exception, and the developer would have noticed. I’ve already written about the importance of writing any type of automated test for the code we’ve just developed (or are about to develop in the case of TDD). This is standard developer work.

The first bug was completely preventable in-house, and it’s very bad that our customers found it.

This is the second bug, the undetectable one:

I’ll spare you the domain specific details. The gist of it is that we should update a specific type of object with additional information. The problem is that no matter how many times we execute this code, we will not catch this bug.

This bug is due either because of missing functionality, incomplete specification, or just customers using the software in an unintended way. Whatever the cause is, the customers do not care and required an out-of-scheduled-releases fix.

So here is the philosophical difference between those two bugs: the first was easily prevented by a simple preventive techniques. The second one, realistically, could only be caught by the real customers. These are the two basic types of bugs in all the software.

So you may wonder, what is the ratio of those two bugs? If the second type is negligible percentage from all the customer reported bugs, why care? First let me say that, there is no exact science that can measure precisely the second type of bug. It can pretty easy become a political mess with lots of blame thrown around. Blame the product owners for missing data in the specification, the developers for failing to know inside out your complex system, or the QA for not catching the bug in the ‘testing phase’, for not acting like a ‘real customer’. However, if you care to learn and not to blame, by applying common sense, you can easily identify the second type of bugs.

At Komfo, we investigate every bug that is reported by our customers in order to constantly improve our system better. We found out that 21% of the bugs are ‘undetectable’ (this number is suspiciously close to the 80/20 Pareto principle).

Those 21% may actually be preventable, but it does not make an economic sense to release 100% bug free software. The tricky part is to find the balance between how much it costs you to detect bugs in the pre-release phase vs what the cost of the bugs that are detected by your customers is.

The preventable ones (79%) can easily be caught if we do the right things in the pre-release phase. People are now calling it Shift Left (at BGPHP15 I gave a PHP tools specific presentation about this topic).

But there is nothing we can do to detect the second type of bugs in the pre-production phase. They will always be there and we should always expect them. As Todd Conklin always says, there will always be incidents, we can’t present them, but we can learn from them, detect early and limit their consequences (by the way, you should subscribe to his podcast, it’s really good). Things like proper exception handling, periodic health checks and repair, even periodic restarts (e.g. US Navy had a procedure to restart Windows NT system in a weapon control system every 24 hours to combat memory leaks).

A common misconception (if уоu believe that all defects are preventable) is to measure only MTBF (Mean Time Between Failures). Which is exactly this — at what period a bug will occur? We will never eliminate 100% of all the bugs, so there is no point trying to increase MTBF. Instead we should measure MTTR (Mean Time To Recover), or how fast, once a bug is reported, we can identify the root cause and fix it.

One way to achieve lower MTTR is to ‘Shift Right’ and do proactive monitoring on your live environment for exceptions. Check them every day and fix all unexpected errors with highest priority. By the time your customers report and error, there is a good change that you would have noticed and fixed it already. Your customers will be delighted that you’re on top of things and all they need to do is to restart their browser or download the new version of your app.

Another way to achiever lower MTTR is if your code is structure properly, the developers can easily debug locally, with production data, and once the fix is ready you have fast automated tests to make sure you have not broken anything. Regarding MTTR at Komfo, 60% of the bugs reported by customers are fixed in the same working day.

If you liked this blog post and want to learn more about how to accelerate your software development by analyzing the reported defects, join me at CraftConf 2017. I’ll be giving a talk about the patterns that have emerged (and how we’re applying them in practice) when analyzing almost 3 years worth of defects.

]]>June this year I presented at expo:QA conference. It was a case study on how we increased the execution time of high level automation tests more than 60 times . Last week I received an email from one of the conference attendees, asking for additional details on two specific topics]]>https://emanuilslavov.com/running-test-in-parallel-optimal-number-of-threads/5af2a3ae2c5eca00177b19e0Thu, 20 Oct 2016 14:53:37 GMT

June this year I presented at expo:QA conference. It was a case study on how we increased the execution time of high level automation tests more than 60 times . Last week I received an email from one of the conference attendees, asking for additional details on two specific topics from that talk. I think the reply will benefit a larger audience as well, so I decided to post it publicly. It’s a bit long and it is split it in two logical parts. This is part one. The question was: “When running tests in parallel, how do you decide what is optimal number of threads?”.

Before The Start

When most of the companies decide to speed their automation tests, the first thing they do is run them in parallel. However this is the last thing you should do, and only when you have exhausted all the other improvement options. Why? By running tests in parallel the execution time shrinks. All the other improvements compared to the gain of the parallel tests seem insignificant. You may mistake the small improvements with random variation, so it’s not easy to track them. Their ‘gain’ is split between the threads. Essentially, running tests in parallel first, makes it really hard to spot other potential improvements (such as those that lead to more stable tests).

Self-sufficient Tests

In order to run tests in parallel, you need to do some upfront work. The tests should be independent of each other. They should be able to run in random order and pass each time. Each test should be idempotent (when you run it 100 times, the result should be the same each time). Each test must create all the test data that it needs — users, accounts, transactions, etc. Ideally this should be done using the official API services of your application, before the test starts. If not available, then you might use unofficial APIs, created and used only for testing purposes. Your last resort would be to insert data in the DB, either directly or via stored procedures. The tests should not rely on any shared data. The only exception should be configuration data that is considered static — e.g., the list of countries in the world, the list of currencies, the timezones.

Tests Not Suited For Parallel Runs

In some cases, a test may require a change of configuration data that is shared with a number of other tests. For example, a test may require that all transactions coming from Germany to be blocked, while the rest of the tests require that all transactions coming from Germany to be allowed. This situation is normal and one solution is to separate the tests that modify a global configuration in a new test suite. You can run all the "parallel safe" suite in one go, and when they finish, you can start the "parallel unsafe" suite. The latter tests should be run sequentially only. At the end of its execution, each test should reset the global configuration back to its original state.

Running Experiments

Only when all of the prep work is done you are ready to run your tests in parallel. In order to figure out what’s the optimal number of threads, you need to run some experiments. Run all the tests in one thread and record the execution time. Next, run all the tests in two threads and record the execution time. Next, run all the tests in three threads and record the execution time. You get the idea. Increment the threads with one and write down the result. We get pretty conclusive results when testing with threads between 1 and 20. After you have the results, it should be pretty easy to spot the minimum execution time and the corresponding thread count. If you graph the results they should look like this:

The Law Of Diminishing Returns

At some point after the minimum execution time, the more threads you have, the execution time starts to grow. The law of diminishing returns kicks in. You hit a bottleneck in the system, because all of the threads compete for limited resources. Something has to give at such speeds. For us this was the database. Lots of threads need to read and write form a single database and this caused locks and delays at higher thread counts.

Memory Bottleneck

One of the first bottlenecks that you’ll hit is the amount of memory on the machine where the tests are executed. Our tests run all in memory (even the databases, a separate blog will describe this in details) and do no touch the disk. However, at some point the threads will eat all of the available memory. The operating system will start to use the swap on the hard disk to perform all needed operations. When this happens, tests will slow down significantly - you’ll definitely notice it. On linux, you can monitor the swap size with the 'top' command.

In this case, there is a small amount of disk swap used. Ideally you’d want to see this:

Just because there is a swap on disk, does not mean that it is used for your tests. In order to be sure, on Linux, you can run 'vmstat 1' command during test execution. Monitor 'si' and 'so' columns (swap in and swap out). If they remain 0, no swap is used.

If you see swap usage in your tests, you need to add more memory. Nowadays pretty much no one runs tests on bare metal machines. Virtual infrastructure makes it easier to provision the machine with more memory when needed. Then just reboot it so that the change can take effect.

Hardware Matters

The threads/execution time graph is valid only for specific hardware configuration. If you move to another machine, or upgrade the CPU/memory, you need to run these experiments again to determine the new optimal thread count. When we started our tests, initially the optimal number of threads was 10. Some months later, we moved the tests to run on more powerful machine. To our surprise, when we run the experiments again, the optimal number of threads was now 12.

Heijunka

The last note for running tests in parallel — the number of tests in each of the threads should be equal (approximately). When we started running the tests in parallel, we were using parallel_tests, thinking that it will divide the number of tests equally among the threads. However it divides equally the number of the feature files between the threads (if you’re not familiar, check here what are those files). Each feature file can contain a random number of tests. We had some feature files with 3 tests and others with over 50 tests. In order to have similar number of tests per thread, we had to manually split some of the big feature files to smaller ones. Why is equal batch size important? Because 'parallel_tests' will wait until the thread with the highest number of tests to finish, and then declares that the execution is over. If you have one big feature file, it will execute all the tests from that file in a single thread, even if the other threads are idling.

There is actually a Lean term for this - heijunka, although we use it a very straightforward and simple case.

For the initial leveling, we were counting the numbed of test cases by hand. However, because of the constant update of the tests, this approach was not scalable. So we wrote a small tool to calculate the number of test cases in Cucumber feature files. It uses methods build in the Gherkin gem to parse the files (if you are using Cucumber, you already have this gem installed). It also has the capability to ignore certain tags.

]]>Imagine the following line of PHP code:

$db->GetRow("SELECT * FROM users WHERE id = $user_id");

This line is vulnerable to SQL injection, if an attacker controls $user_id variable.

This line is vulnerable to SQL injection, if an attacker controls $user_id variable.

This is the safe version (using parametrized query):

$db->GetRow("SELECT * FROM users WHERE id = ?", array(‘$user_id’));

Options

You have three options to deal with this problem:

Option #1: Before each release, you manually check for SQL injections in your applications. You insert single quote in text fields and look for errors. This is tedious work and you may get bored pretty quickly. You can use a bunch of tools to assist you - like SQLMap or Netsparker. The automated scanner will save some time but will not uncover all problems. The tools that test through the UI cannot reach certain pages that are buried three levels deep. Sometimes the authentication fails, sometimes your application needs to be configures in certain way to produce meaningful results for the scanners. You make a checkbox in your pre-release checklist to scan for SQL injections before every release. This option is also the slowest of the three. I’ve worked at a companies, where extensive SQL injections scan with automatic tool takes 8-10 hours.

Option #2: To save manual work before every release, you can write a unit test for the method that contains the line above. Although the unit test will not call the DB (it runs only in memory), you can verify that the GetRow method is called with two arguments instead of one. You can also check that the first argument is a string that equals to “SELECT * FROM users WHERE id = ?”. Job done, no? The problem remains. The unit test covers only one vulnerable location in your code base. Writing unit tests to check specifically for SQL injection is overkill and it couples tightly code and tests. If you change the table name or the SQL query parameters, you need to also change the test.

Side note: if you decide to go with this option, it is better to use high level automated tests (API or UI), rather than unit.

Option #3: You decide that option #1 and option #2 are unreliable and overkill. You bite the bullet and start writing a tool that scans all of your backend code for possible SQL injections vulnerabilities. You want to remove them once and for all. At Komfo, we did exactly this and developed a tool to remove a whole class of software and security defects from our code. We never have to worry about this types of errors. Whenever you can, you should pursue this option.

The table bellow summarizes the three approaches:

Our tool was working so good that we decided to create another one and eliminate а different class of defects — unintended table locks. This tool is called PHP-Unlocker.

Custom Tool Advantages

Although the initial price to develop such a tool is high, the ongoing maintenance effort is low or nonexistent. You’ll setup the tool once and it does not need constant adjustments. Also static code analysis tools are lightning fast and can be run in your CI pipeline after every code change. They analyze 100% of your code base (a unit test will analyze only the method that its written for). You can either choose to write a tool from scratch, the way we did with PHP-Reaper and PHP-Unlocker ,or (depending on your language of choice) can write plugin for popular static code analysis tool — e.g. FindBugs for Java.

Your custom tool does not have to be static code analyzer, because the class of bugs you need to eradicate may be different in nature. It can be a shell script run at the end of test environment setup to check the logs for errors. Or it can be an nmap scan of your application for open ports after all automated tests are completed.

The Three Pillars of Automated Testing

A while ago I wrote a blog post about the three pillars of automated testing — static code analysis, unit tests and high level tests. Do not ignore the first of those pillars, it pays off. Static code analysis tools offer tremendous benefits in some cases. Learn how to use it. As a rule of thumb you should push the tests as close to the code as possible (high level tests -> unit tests -> static code analysis -> code). The higher you go, the more expensive the tests become (time, effort, reliability and maintenance).

Your End Goal

Strive to remove a whole class of defects instead of a single one. Whenever you encounter a problem, don’t eliminate it quickly on the spot. Think of how you can detect and remove the whole class/type of similar defects when they occur in the future. What is their common denominator? What is the root cause? Then you automate the process by writing a tool and running it on every code change. Time to move on to more creative work. Don't fight the same battles constantly.

]]>PHP-Unlocker is a static analysis tool that detects potential, unintended DB table locks for PHP applications using ADOdb. It searches your code for improper usage of StartTrans() and CompleteTrans()methods.

So why write this tool? It’s to scratch our own need, because we had an application with bad coding

So why write this tool? It’s to scratch our own need, because we had an application with bad coding practices. We’d constantly see errors in production about timeouts because DB write operation could not obtain a lock on a table.

Imagine what would happen when doSomeStuff() throws an exception. If there is no try/catch block (that also needs to properly handle the error) somewhere up the call stack, CompleteTrans() will never be called.

Now, depending on how your transactions are set and database is configured, a lock on a table may be in effect that does not allow write operations until CompleteTrans() is called or until certain timeout has expired.

For example, this situation happens when the transaction access mode is set to ‘READ ONLY’ and innodb_lock_wait_timeout is set to any value (by default, for MySQL it’s set to 50 seconds).

The Solution

OK, so what is the solution? The whole DB operation needs to be wrapped in try/catch block that properly cleans after itself in case of an exception. Here is how the above code should look like:

Sometimes, because of ignorance, laziness or just because of pressure, the developers skip these ‘extra’ details. In case of an exception, the transaction is properly handled in the catch block, first by calling FailTrans() and then by calling CompleteTrans().

What PHP-Unlocker does is, to check where in the code there are calls to StartTrans(). Then it goes back to the beginning of the method and checks if this call is wrapped in try/catch statement. If it is wrapped in try/catch, then the tool checks the catch block. In the catch block, there needs to be at least two calls: FailTrans() and CompleteTrans(). If any of the above mentioned conditions are false, then PHP-Unlocker will flag the line of code as problematic.

Usage

Miscellaneous

What if a line is a false positive? I’ve never seen this happen in our codebase so far (160,000 KLOC - no empty lines, no commented lines, no tests code). If it happens, let me know.

I’ve written another tool that uses the same PHP parser - PHP-Reaper, that detects SQL injection in ADOdb code. There is an option in that tool to ignore a false positive, by adding a specific line comment. However PHP-Reaper is more complicated (e.g. tries to figure out where the variable used in a string concatenation is coming from). For the same codebase we’ve noticed some false positives and that’s why we added the silencing option to PHP-Reaper.

PHP-Unlocker is more simple and more reliable. I also think that if you have problematic code construction like the one above you should just fix it with away.

Future Work

Currently we’re moving away from PHP and gradually migrating functionalities to Java. I probably will not be able to support this project much longer. But if you have a burning need let me know.

On that note, Java has a better tooling around static code analysis. Findbugs has warnings about closing a resources in case of an exception. Also with the introduction in Java 7 of try-with-resources the manual labour to close resources in case of an exception becomes thing of the past.

]]>TL;DR It all comes down to economics. What is the cheapest possible way for a company to find defects in its application? For some it's - let the users find problems. For others - extensive in-house testing before each release. Apples and oranges.

TL;DR It all comes down to economics. What is the cheapest possible way for a company to find defects in its application? For some it's - let the users find problems. For others - extensive in-house testing before each release. Apples and oranges.

Recently I happened upon a book called “The Leprechauns of Software Engineering”. The premise is that, in software, there are things that we get for granted and believed in blindly. For some of those assumptions there are no (reliable) studies or scientific data to back them up.

You can also listen to an episode of Ruby Rogues podcast starring the book author.

One of those assumptions is that it is more expensive to fix a defect the later it’s found in the development cycle (planning, development, testing, release to customer). This is known as Boehm’s curve.

The author goes into great lengths to dispel this “myth”. He investigates a number of papers on the subject, starting with the original one by Boehm, and finding all of them not “plausible” enough. Basically he dislikes the type of participants in the surveys — students vs regular developers, the size of the projects — small vs large, incorrectly sited data from other sources. To me this looks like overstretch, a bit sensational, a bit tabloid like.

Maybe the original Boehm's data is wrong. Maybe the graph is not exactly exponential. Maybe it varies from industry to industry. But this does not change the fact that it’s definitely more expensive to fix a bug found by your customers, rather than fixing the same bug before it reaches your customers. Yes, there may not be comprehensive studies about this, but it is painfully obvious to everyone that has more than a couple of year of development experience.

I didn’t want to write a post about something that is so clear, like the water is wet. We know this and feel it every day. I want the unexperienced and young developers to get the right impression when reading this book. They should not produce crap. They should do what they can to produce quality software the first time.

Some Theory

“It is always cheaper to do the job right the first time.”

Phil Crosby

“One gets a good rating for fighting a fire. The result is visible; can be quantified. If you do it right the first time, you are invisible. You satisfied the requirements. That is your job. Mess it up, and correct is later, you become a hero.”

W. Edwards Deming

Moving to the software world now. There are different industries and companies. The cost of a late defect depends on a number of factors:

Is the product free or paid? As a customer, you don’t pay for using Facebook with money. You pay in other ways — your time, your data, your actions. If there is a problem with Facebook, who are you going to call? It’s free after all. Facebook boasts that they don’t have a testing role (and it shows), but for them it does not make economical sense to have one. They have so many users and partners that report bugs every day. On the other hand think of a paid service like Paypal, if there is a defect, money is directly lost (more on working in the financial industry later). So, all things being equal, free services do not require high quality out of the door.

Is the company a startup or an established one? In case of a startup, you’re trying to prove your idea first. Bugs don’t matter much as long as you find the right product/market fit. As long as you have a runway you can pivot to other solutions. You’ll worry about bugs later. If customers love your idea, they will not reject your app because of a few bugs. On the other hand, if you’re an established company, you have a reputation to keep and can't afford much defects. Customers will run to your competitors.

How often you can deploy your product? If your product is SaaS you can deploy whenever you like, even to fix a single typo. If you release once a year — say you’re developing a software that’s being distributed on CDs - then the price of mistakes is high. You may not get a chance to release a fix until the next year. Mobile apps also fall into this category. You can release whenever you like, but currently there is no way to force you customers to upgrade. You’re stuck with supporting the older versions. Software written for the internet of things is even more problematic. How often are you updating the firmware of your Smart TV? To sum it up, the bugs are cheaper to fix, if you are able to deploy frequently.

Are you the only game in town? Are you a monopoly? You are monopoly if you are: Facebook, NASA, specialized software used only in house, The Guardian etc. What do the customers of Facebook do if there is a bug? Use MySpace? No, just sit and wait for a fix.

Can your software directly endanger human lives? The price to pay for a defect in x-ray software is way, way higher than, say for a Twitter application. Also true for avionics software, medical equipment, nuclear power plants.

As you can see, there is no way that the cost of a defect is equal in all types of software. But one things is certain, the price is always higher when fixing late, it’s not cheaper.

The True Cost

Here is what happens when a bug is found by one of your client. He pick up the phone or writes an email to your customer facing department to complain. This department logs a ticket to the appropriate teams. In some cases there are multiple layers through which the support ticket passes — the first line of defense, technical level 1, technical level 2. The defect description ends up in the backlog of the development team. Usually the product owner decides when the bug will be fixed and schedules it. Then a developer picks the task, fixes the bug. If there is a designated testing role, then the fix is being tested and the bug is marked as fixed. Now, using the same chain, the customer is notified (most of the time). All of those people are being paid. If the quality was right the first time, you would not need that big of a customer service department, or that many developers, or testers. You can keep the same amount of people, but they can do value added work. If only the quality was higher the first time.

The cost of the bad quality in the above case can be calculated pretty easily: ‘the number of people involved’ x ‘how much time they are involved’ x ‘the hourly rate’. For more details on how much bad quality costs in the long run you can read Phil Crosby’s book “Quality is Free”.

But there are is one cost that is hidden, and it’s not easy to calculate it — your reputation loss. If you have a defect, the customer will complain, which means you’re wasting not only yours but also his valuable time. Besides complaining to you, he will say bad things about your product to his friends/coworkers/family/social network. In the end they will not buy from you. Missing sales opportunities — how can you know how much are they? How can you measure them? The only way to make sure this does not happen is if you get the quality right from the beginning.

If a defect is found while still in development it can be fixed faster. Why? Because the problem is still fresh in the developer’s mind. When you release the product to your customers, the developers are moving to work on other features. Two weeks later when a defect report comes in, the developer (if he is the same) needs to shift back it’s focus from the current task. He needs to remember the problem domain or to learn about this functionality altogether (if its is a different developer than the one who originally created the functionality).

Most of the time, its also hard to debug and test on production environment. The audit logs may be inadequate, or you may not be able to debug due to performance and security concerns.

The cost of the actual fix is the same. Delete one line and then commit it, in case of the ‘goto fail;’ bug. But the consequences are quite different depending on what point in time you fix this bug.

Examples

Still not persuaded that the price to fix a bug when the product is released is higher? Here are some real life examples. Of course for all the software written in the whole world, not all the bugs are causing that much problems, but here a few outliers just to make a point.

Oh, but you’ll say, I’m not developing such a software. That’s fine, and here is one example for my career (there are others but this may be the most costly)

A while ago, I was working for a finance company. Every defect costed us money. Literally. 400 EUR here, 1200 EUR there. Slightly off FX rate, small rounding error. This was the cost of doing business and it was considered normal. One day however, due to a bug, one of our customers woke up and found 100,000 EUR more in his account. He immediately started to withdraw the money out of our system. We figured out what happened and managed to stop 20,000 EUR from withdrawing, but for the rest it was too late. This bug costed us 80,000 EUR.

Conclusion

There is no need for comprehensive studies on the exact cost of the defects found later in the development cycle. We, as practitioners know that the price is higher. We know it because we see it every day.

]]>Two months ago, we were moving production servers from one datacenter to another. It was supposed to be boring, no-thrills event. There was no new code to be deployed, only running it from different location. We were pretty confident, that there would be no problems, partly because we’ve migrated]]>https://emanuilslavov.com/keeping-external-dependencies-under-control/5af2a3ae2c5eca00177b19dcMon, 13 Jun 2016 15:13:04 GMT

Two months ago, we were moving production servers from one datacenter to another. It was supposed to be boring, no-thrills event. There was no new code to be deployed, only running it from different location. We were pretty confident, that there would be no problems, partly because we’ve migrated and tested our internal test environment to the same location a couple of weeks ago. Nothing could go wrong, right?

So we were “mildly” surprised when all hell break loose, after we started the servers in the new datacenter. The system was responding normally in the first 5 minutes, and then started slowing down more and more. Eventually, it halted completely and we had to switch back to the old datacenter until we figure out what was the problem.

After a couple of hours of diagnostics and we found the culprit. We’re using StatsD extensively to track all sorts of statistics. Each second we’re logging between 100 and 500 events. Because of the high volume, we had hardcoded the IP address of the StatsD server in the /etc/hosts files, so that there were no DNS queries to resolve it (which makes the whole process faster).

When we moved to the new datacenter, we did not want to have hardcoded IP addresses in /etc/hosts files because it makes it harder for maintenance. What happened is that we accidentally DDoS-ed the DNS server with requests to resolve StatsD server address. This in turn slowed down all the other resolution requests made to the DNS server, hence the whole system slowed down significantly. As the old saying goes:

The operation team figured out how to solve this problem, but for us, engineers a red light went on. In case of similar emergency in the future, we did not have any way to temporary stop logging to StatsD. All calls to StatsD were made by using their library API directly:

StatsD::increment('sys.methods.twitter.comments');

To make matters worse, we had 300+ such calls:

There was no way to disable StatsD logging at once, if need be. So we decided to setup a simple thin wrapper around the external API StatsD call. Now we have the ability switch on/off the logging ability from once place in the codebase.

Wrapping external dependencies in thin wrappers in order to tighter control over them is not a new concept, of course. Depending on how it’s implemented, this pattern has different names: adaptor, facade, shim or just wrapper function. It is mostly used in order to mock/stub the external call for unit testing purposes. However, it has even bigger benefits in terms of application resilience, maintenance and testability. If you haven’t wrapped all of your external dependencies yet, I think the reasons listed bellow will persuade you to do so.

Unit Testing

This is the original purpose why this pattern was invented. Remember that your unit tests should not touch the network, any databases or filesystems. Lets assume, in PHP, you want to unit test a method that uses file_get_contents(). It reached out over the network to read a file. The only way to write reliable unit test is to wrap file_get_contents() in a small method, and call this method instead. Then in your unit tests, you would overwrite the small method with your stub/mock and simulate whatever behavior you want — file found, file not found, internal error, timeout.

Replacing Libraries

Imagine that in the scenario described above (the 300+ direct calls using StatsD PHP library), we want to replace StatsD logging with another logging framework. If you don’t have an adaptor, you need to changed to code in 300+ locations. If you have an adaptor, you need to change the code only in once place.

Logging

When all the external requests do through one adaptor, it easy to keep track and log them. For example, you might use Facebook’s library to post messages. With the adaptor, you can track how many calls you made, how much time it took for each one, how many HTTP errors you get (4xx, 5xx), have you reached your API limit etc.

Emergency Stop

In case something goes wrong, you can immediately stop all the calls to the external services in the adaptor, instead of everywhere in the codebase.

High Level Testing Stubs

Lets suppose your app is using an external service to make a payments with credit cards. If you have an adaptor over your payment provider library*, it can greatly benefit your high level tests (any tests that are not unit). Those tests may be manual or automated. For example, you may not want to contact the real payment provider for a test that spends real money. In your adaptor you can create ‘magic’ credit card numbers for testing purposes:

Be careful though with those ‘magic’ numbers. You need to have some safeguard for that those numbers cannot be used in production.

Similarly, if your app uses social media, you might want to do high level tests of that will happen when Facebook is down, or reject your new post.

Note: Most of the payment providers these days provide a test environment, sometimes called sandbox, to play with. The problems that I’ve seen with such environments are that they may not support all the responses you need, they are being treated as second class citizen and not maintained well (sometimes not accessible for hours). They are slow for test automation purposes because are the traffic goes the internet. Also it requires you to switch between environments to test your app with real transactions.

Error Detection Logic

Adaptors help you to have all the error detection logic for the external library you’re using in once place. For example, you might want to catch and/or re-throw exceptions based on that HTTP error code you get. Or maybe you want to perform some specific action every time an empty response is received. Again, you have all this error handling logic only in the adaptor and not scattered everywhere in the codebase, avoiding code duplication.

Converting Incoming Data

Most of the time, you need to do some type of data conversion. The data you receive from the outside world needs to match your internal domain representation. You may want to filter some of the received data out, to convert binary to text, XML to JSON and so on. These are technical details that need to be in the adaptor. They should not leak into the upper abstraction layers that deal with business logic. You should not mix the low level technical details and high level business logic.

Retry Logic

If the external library you’re using requires network to send or receive data, then you can greatly benefit from retry logic in your adaptor. Internet may be slow or unreliable, the remote service maybe sketchy and sometimes throw internal server error. The majority of the time, these are transient errors. They go away if you retry again later. You will increase your app resilience if you implement some sort of retry logic for your external communication. For example your app might send SMS through a gateway, retry on HTTP code other than 200 or 403. Simplified version:

You need to be careful to distinguish which errors are transient and which are permanent and retry only the former. It makes no sense to retry authentication or forbidden errors, as you most likely need to change access tokens, or permissions. It makes sense to retry internal server errors and timeouts. Retrying not found errors depends on what your application is doing. If it’s uploading a file first and then verifying that the upload is correct, then it makes sense. However if your application is checking the weather for an airport with the three letter code (e.g. MAD for Madrid), not found errors most likely means wrong airport code and retrying the request will only make your app slower.

]]>Given the widespread adoption of node.js it’s surprising that there is not much synthesized information about the specifics of writing unit tests on this platform. Recently I open sourced Nagual, HTTP simulator for faster and reliable automation tests. these are the challenges I faced writing unit tests for]]>https://emanuilslavov.com/writing-unit-tests-for-node-js-application/5af2a3ae2c5eca00177b19dbMon, 28 Mar 2016 15:19:58 GMTGiven the widespread adoption of node.js it’s surprising that there is not much synthesized information about the specifics of writing unit tests on this platform. Recently I open sourced Nagual, HTTP simulator for faster and reliable automation tests. these are the challenges I faced writing unit tests for node.js application.

Linter

Linters are part of the three pillars of automated tests. It does not make sense to not use one, given how easy it is to setup. They are fast and incredibly useful when working with interpreted languages. Linters will catch some corner cases and bugs that unit tests will miss. For JavaScript you have two main options (excluding the father of JS linters - JSLint as everyone finds it is too opinionated) - JSHint and ESLint. ESLint is newer and is what the cool kids are using these days.

Test Framework

There are couple of actively supported unit test frameworks for node.js. I choose Jasmine, mainly because it has all the basic functionalities built in. For example it has its own spy functionality:

spyOn(foo, 'setBar');

If you choose any other unit test framework, say mocha, you need to use external spy library like Sinon.JS.

var sinon = require(‘sinon’);
sinon.spy(foo, "setBar");

It’s not a big deal, but I’m more productive when all the parts needed are in the same package. No need to do research, pick and choose and so the paradox of choice does not kick in.

Directory Structure

Whatever unit test framework you choose, it’s widely adopted to mirror the directory structure of your main application in the tests directory.

As you can see, the directory tests/source/contactTheRealServer contains all the tests for the four functions exposed in contactTheRealServer module. The tests names need to end with ‘Test’ or ‘test’ so that Jasmine can recognize and run them.
This is how to config file looks like:

Small Modules

Setup your linter to alert you if function complexity is above 10. This will make you break big modules into smaller ones which are easier to unit test. The number of tests you need to write for a function is the same as the cyclomatic complexity for that function. When I started Nagual as a proof of concept, it was one big function so to unit test it, I had to break it down to smaller pieces.

You can also export only the functions that the other modules need. This is similar to making a method public in the OO languages. If you do not export a function to the outside world, there is no way to test it (directly).

Writing basic test cases for node.js application is pretty much like writing them in other languages, so I'll focus only on the corner cases bellow.

Spying on Imported Functions

You want to write a unit test that exercises handleRequest(). In particular you want to know that returnResponse() has beed called. With the current code you can’t do this because the unit test does not have any control over returnResponse() function. One way to do this is to not use returnResponse() directly but instead assign it to module.exports and use it though that interface. Here is how the code will look like after the change:

Stubbing require()

In Nagual’s case there is a function that loads objects from the file system via require(). You can not directly control require() but you can wrap it in a function that you can overwrite and use this new function to load your objects (the full example is here):

It case the request is either GET or HEAD, there is no body and it’s trivial to test. In case the request has a body (POST, PUT or DELETE methods), an event listener is involved — the on() function that listens for ‘data’ and ‘end’ events. Testing is not straight forward because these are asynchronous operations. One way to test is the following:

Node.js has a stream called PassThrough (all streams are instances of EventEmitter). It acts as a test dummy that can be used to emit ‘data’ and ‘end’ events. It is used instead of the real request. The last assertions makes sure that the POST buffer is assembled correctly (it also means that the ‘end’ event is emitted).

]]>This is the last part of a collection of seven blog posts about how to write reliable software. When a new developer joins the company we go over this list, and so I’ve decided to organize my thoughts on the subject a bit and share them with a wider]]>https://emanuilslavov.com/seven-habits-to-create-reliable-software-continuous-improvement/5af2a3ae2c5eca00177b19daFri, 19 Feb 2016 15:06:44 GMTThis is the last part of a collection of seven blog posts about how to write reliable software. When a new developer joins the company we go over this list, and so I’ve decided to organize my thoughts on the subject a bit and share them with a wider audience.

"Survival is optional. No one has to change.”

W. Edwards Deming

Continuous Improvement

After the Second World War Japan’s industry was devastated. The country was occupied by the United States until 1952. After the occupation ended, the US retained heavy presence and one of the reasons was that it needed to counter any Soviet Union desires for expansion into the far east (remember that the Korean War was in full swing). The US were also worried that if Japan’s population remained poor and jobless for too long it might naturally turn to communism.

In 1947, as a statistician, Dr Deming was asked to help with Japan’s census. In 1950 he was invited to speak to the Japanese Union of Scientists and Engineers and so began his long relation with the Japanese industry leaders. He is credited by many as the man responsible for Japanese post WWII economic miracle.

Perhaps ironically he was more known and respected in Japan than in his home country. This began to change slowly, not much before NBC aired the documentary ‘If Japan Can Why Can't We’ in 1980. In his book ‘Out of the Crisis’, he outlined 14 key principles to transform any business. Principle #5 reads:

“Improve constantly and forever the system of production and service, to improve quality and productivity, and thus constantly decrease costs.”

You might think that Dr Deming was influenced by his stay in Japan and the self improvement philosophy in Buddhism. However this was not the case as noted in ‘Thinking About Quality: Progress, Wisdom, and the Deming Philosophy’. While Buddhism deals with the self improvement, Dr Deming was talking about improving the system as a whole (and thus ourselves). Before Dr Deming, continuous improvement was not practiced at large scale in factories and enterprises in Japan or anywhere else.

Self Improvement

You need to start with the system that you have the most control on. This is ourselves. Constantly improving, especially in our line of work (software development), is mandatory. Technology changes all the time, customers demand more, there is urgent need to deliver better, faster and cheaper software. It all means that you need to be fluent with the latest tools and techniques. Read blogs, follow influencers on Twitter, attend conferences, read books. Constantly evolve and embrace lifetime learning. This is also good job security advice.

System Improvement

System improvement means to improve the environment where you work and operate. The system also encompasses your customers and your suppliers. For example, when you notice an error — fix it immediately. Need to add a line in a method, but you notice a refactoring opportunity — do it (remember the old boy scout rule to leave the camp cleaner than you have found it). You can improve everything, indefinitely. This is one way to ensure job satisfaction and not to get bored if you stay on a job for more than three years (which seems aged in the software development industry).

The New View

The old Tayloristic approach to management was that the workers just need to do what they are told by their superiors and shut up. Managers do all the planning and thinking for the workers. Dr Deming teaches us that the workers are best suited to notice the problems and suggest solutions. So if you’re in a leadership role, go to the gemba. On a regular basis gather the team and ask each one of them to make one suggestion how a piece of the system can be improved (think kaizen, quality circles). I’d even go as far to say that if you’re having retrospective meetings this should be the most important part — list small improvement suggestions and act on them. Many small improvement amass over time to bring bing benefits. There is no silver bullet, no insta-fix-two-days-solution that most of the managers crave for.

Here are some actual examples of things that are important to engineers to be improved:

The build is taking 10 minutes, can we make it less than 5?

Setting up development environment on new employee’s PC is done by hand every time. Can this be automated?

Data for testing in staging environment comes as a subset from production data, but is manually scrubbed for PII by the operations team. Can this be scripted?

We have to fill out an online time sheet each month, what projects we’ve worked on and it takes 15 minutes every week per person. Can this information be taken from the tasks assigned to me in Jira automatically?

When you do some improvement, show and teach others in the company what you’ve done and how they can do the same. This way your company is rapidly learning and the feedback process is constantly being amplified. This is described in details in The High-Velocity Edge book.

Leadership

Company leadership needs to realize these are activities that will not result in any immediate revenue. However they will pay benefits in the long run. In the book 'Drive: The Surprising Truth About What Motivates Us', Daniel Pink writes that there are three main drivers to keep us engaged in the long run in whatever we do. Autonomy, mastery and purpose. Continuos improvement is the second one — allow your employees to constantly pursue higher mastery of their craft. They will not get bored and leave as they will see the results of their work and their improvements implemented. If will also keep your company in business in the long run. Don’t believe me? Just start reading W. Edwards Deming, probably the only books on management you’ll ever need.