Last winter, I wrote and released a service for a client I am still working with. Overall, the service has met business needs and performance requirements. However, one of the teams that consumes the service told me they were periodically running into an issue where the service would return 500 errors and not return to normal until the service was restarted. I asked when this was occurring and put on my detective’s hat.

In this article, I will introduce the process I went through to diagnose the bug and determine the correct integration test solution to fix it the right way. In doing so, I had to create a test that accurately reproduced the scenario my service was experiencing in PROD. I had to create a fix that took my test from failing to passing. Finally, I worked to increase confidence in the correctness of code for all future releases, which is only possible through automated testing.

Diagnosing the Bug

I read through my service’s log files around the time the 500 errors started happening. They quickly showed a pretty serious problem; a little before midnight on a Saturday, my service would start throwing errors. At first there was a variety of errors occurring, all SQLException, but eventually, the root cause became the same:

org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80)

This went on for several hours until early the following morning when the service was restarted and the service went back to normal.

Checking with the
cave trolls
DBAs, I found the database I was connecting to went down for maintenance. The exact details escape me, but I believe it was a roughly 30-minute window the database was down. So, clearly, my service had an issue re-connecting to a database once the database recovered from an outage.

Fixing the Bug the Wrong Way

The most straightforward way of fixing this bug (and one which I have often went to in the past), would have been to Google “recovering from database outage,” which would likely lead me to a Stack Overflow thread that answers my question. I would then have “copied and pasted” in the provided answer and pushed the code to be tested.

If production was being severely affected by a bug, this approach might be necessary in the short-term. That said, time should be set aside in the immediate future to cover the change with an automated test.

Fixing the Bug the Right Way

As is often the case, doing the things the “right way” often means significant front-loaded time investment. This adage is definitely true here.

The return on investment, however, is less time later spent fixing bugs, increased confidence in the correctness of the code, and, in addition, tests can be an important form of documentation as to how the code should behave in a given scenario.

While this specific test case is a bit esoteric, it’s an important factor to keep in mind when designing and writing tests, be they unit or integration. Give tests good names, make sure test code is readable, etc.

Solution 1: Mock Everything

My first crack at writing a test for this issue was to try to “mock everything.” While Mockito and other mocking frameworks are quite powerful and are getting ever easier to use, after mulling over this solution, I quickly came to the conclusion I just wouldn’t ever have the confidence that I wouldn’t be testing anything beyond the mocks I have written.

Getting a “green” result would not increase my confidence in the correctness of my code, the whole point of writing automated tests in the first place! On to another approach.

Solution 2: Use an In-Memory Database

Using an in-memory database was my next attempt at writing this test. I’m a pretty big proponent of H2, I’ve used H2 extensively in the past and was hoping it might address my needs here once again. I probably spent more time here than I should have.

While ultimately this approach doesn’t pan out, the time spent isn’t entirely wasted, I did learn a decent bit more about H2. One of the advantages of doing things the “right way” (though often painful in the moment) is that you learn a lot. The knowledge gained might not be useful at the time, but could prove valuable later.

The Advantages of Using an In-Memory Database

Like I said, I probably spent more time here than I should have, but I did have my reasons for wanting this solution to work. H2, and other in-memory databases, had a couple of very desirable traits:

Speed: Starting and stopping H2 is quite fast, sub-second. While a little slower than using mocks, my tests would still be plenty fast.

Portability: H2 can run entirely from an imported jar, so other developers can just pull down my code and run all the tests without performing any additional steps.

Additionally, my eventual solution had a couple non-trivial disadvantages which I will cover as part of that solution below.

Writing the Test

Somewhat meaningful, but to this point, I still hadn’t written a single line of production code. A central principle of TDD is to write the test first and production code later. This methodology along with ensuring a high level of test coverage also encourages the developer to only make changes that are necessary. This goes back to the goal increasing confidence in the correctness of your code.

Initially, I felt I was on the right path with this solution. There is the question of how do I start the H2 server backup (one problem at a time!). However, when I run the test, it is failing and giving an error analogous to what my service is experiencing in PROD:

However, if I modify my test case and simply attempt a second connection to the database:

conn = DataSourceUtils.getConnection(dataSource);

The exception goes away and my test passes without me making any changes to my production code. Something isn’t right here…

Why This Solution Didn't Work

So, using H2 won’t work. I actually spent quite a bit more time trying to get H2 to work than what the above would suggest. Troubleshooting attempts included; connecting to a file based H2 server instance instead of just an in-memory one, a remote H2 server; I even stumbled up the H2 Server class that would have addressed the server shutdown/startup issue from earlier.

None of those attempts worked, obviously. The fundamental problem with H2, at least for this test case, is attempting to connect to a database will cause that database to start up if it currently isn’t running. There is a bit of a delay, as my initial test case shows, but obviously, this poses a fundamental problem. In PROD, when my service attempts to connect to a database, it does not cause the database to start up (no matter how many times I attempt connecting to it). My service’s logs can certainly attest to this fact. On to another approach.

Solution 3: Connect to a Local Database

Mocking everything won’t work. Using an in-memory database didn’t pan out either. It looks like the only way I will be able to properly reproduce the scenario my service was experiencing in PROD was by connecting to a more formal database implementation. Bringing down a shared development database is out of the question, so this database implementation needs to run locally.

The Problems With This Solution

So, everything before this should give a pretty good indication that I really wanted to avoid going down this path. There are some good reasons for my reticence:

Decreased portability: If another developer wanted to run this test, he or she would need to download and install a database on the local machine. He or she would also need to make sure the configuration details matched what the test is expecting. This is a time-consuming task and would lead to at least some amount of “out-of-band” knowledge.

Slower: Overall, my test still isn’t too slow, but it does take several seconds to start up, shutdown, and then start up again even against a local database. While a few seconds doesn’t sound like much, time can add up with enough tests. This is a major concern as integration tests are allowed to take longer (more on that later), but the faster the integration tests, the more often they can be run.

Organizational wrangling: To run this test on the build server means I would now need to work with my already-overburdened DevOps team to set up a database on the build box. Even if the ops team wasn’t overburdened, I just like to avoid this when possible as it’s just one more step.

Licensing: In my code example, I am using MySQL as my test database implementation. However, for my client, I was connecting to an Oracle database. Oracle does offer Oracle Express Edition (XE) for free; however, it does come with stipulations. One of those stipulations is that two instances of Oracle XE cannot be running at the same time. The specific case of Oracle XE aside, licensing can become an issue when it comes to connecting to specific products offerings, it’s something to keep in mind.

Success! (Finally)

Originally this article was a good bit longer, which also gave a better impression of all the
blood, sweat, and tears
work that went into getting to this point. Ultimately, such information isn’t particularly useful to readers, even if cathartic for the author to write about. So, without further ado, a test that accurately reproduces the scenario my service was experiencing in PROD:

The underlying problem my service was experiencing is when a connection from the DataSource’s connection pool failed to connect to the database, it became “bad.” The next problem then was my DataSourceimplementation would not drop these “bad” connections from the connection pool. It just kept trying to use them over and over.

The fix for this is luckily pretty simple. I needed to instruct my DataSource to test a connection when the DataSource retrieved it from the connection pool. If this test failed, the connection would be dropped from the pool and a new one attempted. I also needed to provide the DataSource with a query it could use to test a connection.

Finally (not strictly necessary but useful for testing), by default, my DataSource implementation would only test a connection every 30 seconds. However, it would be nice for my test to run in less than 30 seconds. Ultimately the length of this period isn’t really meaningful, so I added a validation interval that is provided by a property file.

One final note for writing integration tests. Initially, I created a test configuration file that I used to configure the DataSource to use in my test. However, this is incorrect.

The problem is that if someone were to remove my fix from the production configuration file but left it in the test configuration file, my test would still be passing but my actual production code would once again be vulnerable to the problem I spent all this time fixing! This is a mistake that would be easy to imagine happening. So, be sure to use your actual production configuration files when writing integration tests.

Automating the Test

The end is almost in sight. I have a test case that accurately reproduces the scenario I am experiencing in PROD. I have a fix that then takes my test from failing to passing. However, the point of all this work wasn’t to just have confidence that my fix works for the next release, but for all future releases.

Maven users: hopefully you are already familiar with the surefire plugin. Or, at least hopefully, your DevOps team already has your parent pom set up so that when a project is being built on your build server, all those unit tests you took the time to write are being run with every commit.

This article, however, isn’t about writing unit tests, but about writingintegration tests. An integration test suite will typically take much longer to run (sometimes hours) than a unit test suite (which should take no more than five to 10 minutes). Integration tests are also typically more subject to volatility. While the integration test I wrote in this article should be stable – if it breaks, it should be cause for concern – when connecting to a development database, you can’t always be 100% confident the database will be available or that your test data will be correct or even present. So a failed integration test doesn’t necessarily mean the code is incorrect.

Luckily the folks behind Maven have already addressed this and that is with the failsafe plugin. Whereas the surefire plugin, by default, will look for classes that are pre or post-fixed with Test, the failsafe plugin will look for classes pre or post-fixed with IT (Integration Test). Like all Maven plugins, you can configure in which goals the plugin should execute. This gives you the flexibility to have your unit tests run with every code commit, but your integration tests to only run during a nightly build. This can also prevent a scenario in which a hotfix needs to be deployed, but a resource that an integration test depends upon isn’t present.

Final Thoughts

Writing integration tests can be a time consuming and difficult task. It requires extensive thought into how your service will interact with other resources in PROD. This task is even more difficult and time-consuming when you are specifically testing for failure scenarios which often requires more control of the resource your test is connecting with and drawing on past experience and knowledge on what scenarios to test for.

Despite this high cost in time and effort, this investment will pay itself back many times over in time. Increasing confidence in the correctness of code, which is only possible through automated testing, is central to shortening the development feedback cycle.