Monday, 12 December 2016

One of the things you quickly discover when moving from building services on-premise to “the cloud” is quite how many more bits of wire and kit suddenly sit between you and your consumer. Performance-wise this already elongated network path can then be further compounded when the framework you’re using invokes unintuitive behaviour by default [1].

The Symptoms

The system was a new REST API built in C# on the .Net framework (4.6) and hosted in the cloud with AWS. This AWS endpoint was then further fronted by Akamai for various reasons. The initial consumer was an on-premise adaptor (also written in C#) which itself had to go through an enterprise grade web proxy to reach the outside world.

Naturally monitoring was added in fairly early on so that we could start to get a feel for how much added latency moving to the cloud would bring. Our first order approximation to instrumentation allowed us to tell how long the HTTP requests took to handle along with a breakdown of the major functions, e.g. database queries and 3rd party requests. Outside the service we had some remote monitoring too that could tell us the performance from a more customer-like position.

When we integrated with the 3rd party service some poor performance stats caused us to look closer into our metrics. The vast majority of big delays were outside our control, but it also raised some other questions as the numbers didn’t quite add up. We had expected the following simple formula to account for virtually all the time:

HTTP Request Time ~= 3rd Party Time + Database Time

However we were seeing a 300 ms discrepancy in many (but not all) cases. It was not our immediate concern as there was bigger fish to fry but some extra instrumentation was added to the OWIN pipeline and we did a couple of quick local profile runs to look out for anything obviously out of place. The finger seemed to point to time lost somewhere in the Nancy part of the pipeline, but that didn’t entirely make sense at the time so it was mentally filed away and we moved on.

Serendipity Strikes

Whilst talking to the 3rd party about our performance woes with their service they came back to us and asked if we could stop sending them a “Expect: 100-Continue” header in our HTTP requests.

This wasn’t something anyone in the team was aware of and as far as we could see from the various RFCs and blog posts it was something “naturally occurring” on the internet. We also didn’t know if it was us adding it or one of the many proxies in between us and them.

We discovered how to turn it off, and did, but it made little difference to the performance problems we had with them, which were in the order of seconds, not milliseconds. Feeling uncomfortable about blindly switching settings off without really understanding them we reverted the change.

The mention of this header also cropped up when we started investigating some errors we were getting from Akamai that seemed to be more related to a disparity in idle connection timeouts.

Eventually, as we learned more about this mysterious header someone in the team put two-and-two together and realised this was possibly where our missing time was going too.

The Cause

Our REST API uses PUT requests to add resources and it appears that the default behaviour of the .Net HttpClient class is to enable the sending of this “Expect: 100-Continue” header for those types of requests. Its purpose is to tell the server that the headers have been sent but that it will delay sending the body until it receives a 100-Continue style response. At that point the client sends the body, the server can then process the entire request and the response is handled by the client as per normal.

Yes, that’s right, it splits the request up so that it takes two round trips instead of one!

Now you can probably begin to understand why our request handling time appeared elongated and why it also appeared to be consumed somewhere within the Nancy framework. The request processing is started and handled by the OWN middleware as that only depends on the headers, it then enters Nancy which finds a handler, and so requests the body in the background (asynchronously). When it finally arrives the whole request is then passed to our Nancy handler just as if it had been sent all as a single chunk.

The Cure

When you google this problem with relation to .Net you’ll see that there are a couple of options here. We were slightly nervous about choosing the nuclear option (setting it globally on the ServicePointManager) and instead added an extra line into our HttpClient factory so that it was localised:

We re-deployed our services, checked our logs to ensure the header was no longer being sent, and then checked the various metrics to see if the time was now all accounted for, and it was.

Epilogue

In hindsight this all seems fairly obvious, at least, once you know what this header is supposed to do, and yet none of the people in my team (who are all pretty smart) joined up the dots right away. When something like this goes astray I like to try and make sense of why we didn’t pick it up as quickly as perhaps we should have.

In the beginning there were so many new things for the team to grasp. The difference in behaviour between our remote monitoring and on-premise adaptor was assumed to be one of infrastructure especially when we had already battled the on-premise web proxy a few times [2]. We saw so many other headers in our requests that we never added so why would we assume this one was any different (given none of us had run across it before)?

Given the popularity and maturity of the Nancy framework we surmised that no one would use it if there was the kind of performance problems we were seeing, so once again were confused as to how the time could appear to be lost inside it. Although we were all aware of what the async/await construct does none of us had really spent any serious time trying to track down performance anomalies in code that used it so liberally and so once again we had difficulties understanding perhaps what the tool was really telling us.

Ultimately though the default behaviour just seems so utterly wrong that none of use could imagine the out-of-the-box settings would cause the HttpClient to behave this way. By choosing this default we are in essence optimising PUT requests for the scenario where the body does not need sending, which we all felt is definitely the exception not the norm. Aside from large file uploads or massive write contention we were struggling to come up with a plausible use case.

I don’t know what forces caused this decision to be made as I clearly wasn’t there and I can’t find any obvious sources that might explain it either. The internet and HTTP has evolved so much over the years that it’s possible this behaviour provides the best compatibility with web servers out-of-the-box. My own HTTP experience only covers the last few years along with few more around the turn of the millennium, but my colleagues easily cover the decades I’m missing so I don’t feel I’m missing anything obvious.

Hopefully some kind soul will use the comments section to link to the rationale so we can all get a little closure on the issue.

Thursday, 17 November 2016

In my recent post “Tautologies in Tests” I adapted one of Einstein’s apocryphal sayings and suggested that tests should be “as precise as possible, but not too precise”. But what did I mean by that? How can you be too precise, in fact isn’t that the point?

Mocking

One way is to be overly specific when tracking the interactions with mocks. It’s very easy when using a mocking framework to go overboard with your expectations, just because you can. My personal preference (detailed before in “Mock To Test the Outcome, Not the Implementation”) is to keep the details of any interactions loose, but be specific about the outcomes. In other words what matters most is (usually) the observable behaviour, not necessarily how it’s achieved.

For example, rather than set-up detailed instructions on a mock that cover all the expected parameters and call counts I’ll mostly use simple hand-crafted mocks [1] where the method maps to a delegate where I’ll capture only the salient details. Then in the assertions at the end I verify whatever I need to in the same style as the rest of the test. Usually though the canned response is test case specific and so rarely needs any actual logic.

In essence what I’m creating some people prefer to call stubs as they reserve the term “mocks” for more meatier test fakes that record interactions for you. I’d argue that using the more complex form of mock is largely unnecessary and will hurt in the long run. To date (anecdotally speaking) I’ve wasted too much time “fixing” broken tests that overused mocks by specifying every little detail and were never written to give the implementation room to manoeuvre, e.g. during refactoring. In fact an automated refactoring tool is mandatory on code like this because the methods are referenced in so many tests it would take forever to fix-up manually.

I often feel that some of the interactions with dependencies I’ve seen in the past have felt analogous to testing private methods. Another of my previous posts that was inspired by mocking hell is “Don’t Pass Factories, Pass Workers”. Naturally there is a fine line here and maybe I’ve just not seen enough of it done well to appreciate how this particular tool can be used effectively.

White-Box Testing

The other form of overly specific test I’ve seen comes from what I believe is relying too much on a white-box testing approach so that the tests express the output exactly.

The problem with example based tests is that they are often taken literally, which I guess is kind of the point, but as software engineers we should try and see passed the rigid examples and verify the underlying behaviour instead, which is what we’re really after.

For example, consider a pool of numbers [2] up to some predefined limit, say, 10. A naïve approach to the problem might test the pool by asserting a very specific sequence, i.e. the starting one:

for (var number in expected) Assert.That(pool.Acquire(), Is.EqualTo(number)); }

From a white-box testing approach we can look inside the NumberPool and probably see that it’s initially generating numbers using the ++ operator. The implementation might eagerly generate that sequence in the constructor, add them to the end of a queue, and then divvy out the front of the queue.

From a “programmer’s test” point of view (aka unit test) it does indeed verify that, if my expectation is that the implementation should return the exact sequence 1..10, then it will. But how useful is that for the maintainer of this code? I’d argue that we’ve over-specified the way this unit should be allowed to behave.

Verify Behaviours

And that, I think, lies at that heart of the problem. For tests to be truly effective they should not describe exactly what they do, but should describe how they need to behave. Going back to our example above the NumberPool class does not need to return the exact sequence 1..10, it needs to satisfy some looser constraints, such as not returning a duplicate value (until re-acquired), and limiting the range of numbers to between 1 and 10.

With these two tests we are free to change the implementation to generate a random sequence in the constructor instead if we wanted, and they would still pass, because it conforms to the looser, albeit still well defined, behaviour. (It may have unpredictable performance characteristics but that is a different matter.)

Once again we are beginning to enter the realm of property based testing which forces us to think harder about what behaviours our code exhibits rather than what it should do in one single scenario.

This does not mean there is no place for tests that take a specific set of inputs and validate the result against a known set of outputs. On the contrary they are an excellent starting point for thinking about what the real test should do. They are also important in scenarios where you need some smoke tests that “kick the tyres” or you are naturally handling a very specific scenario.

Indicative Inputs

Sometimes we don’t intend to make our test look specific but it just turns out that way to the future reader. For example in our NumberPool tests above what is the significance of the number “10”? Hopefully in this example it is fairly obvious that it is an arbitrary value as the test names only talk about “a limit”. But what about a test for code that handles, say, an HTTP error?

In this test we have a mock (nay stub) HTTP server that will return a non-2XX style result code. Now, what is the significance of the InternalServerError result code returned by the stub? Is it a specific result code we’re handling here, or an indicative one in the 5XX range? The test name uses the term “service unavailable” which maps to the more specific HTTP code 503, so is this in fact a bug in the code or test?

Unless the original author is around to ask (and even remembers) we don’t know. We can surmise what they probably meant by inspecting the production code and seeing how it processes the result code (e.g. a direct comparison or a range based one). From there we might choose to see how we can avoid the ambiguity by refactoring the test. In the case where InternalServerError is merely indicative we can use a suitably named constant instead, e.g.

A clue that there is a disconnect is when the language used in the test name isn’t correctly reflected in the test body itself. So if the name isn’t specific then nor should the test be, but also vice-versa, if the name is specific then expect the test to be. A corollary to this is that if your test name is vague don’t surprised when the test itself turns out equally vague.

Effective Tests

For a suite of tests to be truly effective you need them to remain quietly in the background until you change the code in a way that raises your awareness around some behaviour you didn’t anticipate. The fact that you didn’t anticipate it means that you’ll be relying heavily on the test rather than the code you just changed to make sense of the original intended behaviour.

When it comes under the spotlight (fails) a test needs to convince you that it was well thought out and worthy of your consideration. To be effective a guard dog has to learn the difference between friend and foe and when we write tests we need to learn how to leave enough room for safe manoeuvring without forgetting to bark loudly when we exceed our remit.

[1] When you keep your interfaces simple and focused this is pretty easy given how much a modern IDE can generate for you when using a statically typed language.

[2] This example comes from a real one where the numbers where identifiers used to distinguish compute engines in a grid.

Tuesday, 15 November 2016

As I pointed out in my blog post that collates Season One of my In The ToolboxC Vu column I never intended to write more than a couple of introductory articles before handing it over for others to share their experiences. Yet now, three years later, I’m still plugging away at it and Season Three is already in the making with a couple of episodes already under my belt.

Just as before I also strongly advise you to become a member of the ACCU so you can get this, plus loads of much better content, which may or may not be published online by their respective authors. As I write this post it’s still only a measly £45 per year and is one of the last remaining printed journals about programming.

We have so many ideas for our products but only so many hours in the day to develop them. Sometimes all it needs is a simple text file in the repo, whilst bigger projects seem to demand an enterprise-grade solution like JIRA.

As programmers we need a safe environment in which to write and test our code, free from the distractions going on around us. When running the tests it should not be at the mercy of other developers running tests at the same time as us; first and foremost we start in isolation, if we can.

It’s a simple question: how do you find a piece of text? And yet there is a dizzying array of choices available that depend heavily on what’s accessible at the time and where and how that elusive text is stored.

In the move to go digital the humble whiteboard has been pushed aside, which is disappointing as it’s still probably the best design tool available. It also has many other uses than drawing pictures of boxes, drums and cylinders.

Monday, 14 November 2016

In the past few years I’ve worked on a few projects where TIBCO has been the message queuing product of choice within the company. Naturally being a test-oriented kind of guy I’ve used unit and component tests for much of the donkey work, but initially had to shy away from writing any automated integration tests due to the inherent difficulties of getting the system into a known state in isolation.

Organisational Barriers

For any automated integration tests to run reliably we need to control the whole environment, which ideally is our development workstations but also our CI build environment (see “The Developer’s Sandbox”). The main barriers to this with a commercial product like TIBCO are often technological, but also more often than not, organisational too.

In my experience middleware like this tends to be proprietary, very expensive, and owned within the organisation by a dedicated team. They will configure the staging and production queues and manage the fault-tolerant servers, which is probably what you’d expect as you near production. A more modern DevOps friendly company would recognise the need to allow teams to test internally first and would help them get access to the product and tools so they can build their test scaffolding that provides the initial feedback loop.

Hence just being given the client access libraries to the product is not enough, we need a way to bring up and tear down the service endpoint, in isolation, so that we can test connectivity and failover scenarios and message interoperability. We also need to be able develop and test our logic around poisoned messages and dead-letter queues. And all this needs to be automatable so that as we develop and refactor we can be sure that we’ve not broken anything; manually testing this stuff is not just not scalable in a shared test environment at the pace modern software is developed.

That said, the TIBCO EMS SDK I’ve been working with (v6.3.0) has all the parts I needed to do this stuff, albeit with some workarounds to avoid needing to run the tests with administrator rights which we’ll look into later.

The only other thorny issue is licensing. You would hope that software product companies would do their utmost to get developers on their side and make it easy for them to build and test their wares, but it is often hard to get clarity around how the product can be used outside of the final production environment. For example trying to find out if the TIBCO service can be run on a developer’s workstation or in a cloud hosted VM solely for the purposes of running some automated tests has been a somewhat arduous task.

This may not be solely the fault of the underlying product company, although the old fashioned licensing agreements often do little to distinguish production and modern development use [1]. No, the real difficulty is finding the right person within the client’s company to talk to about such matters. Unless they are au fait with the role modern automated integrated testing takes place in the development process you will struggle to convince them your intended use is in the interests of the 3rd party product, not stealing revenue from them.

Okay, time to step down from the soap box and focus on the problems we can solve…

Hosting TIBEMSD as a Windows Service

From an automated testing perspective what we need access to is the TIBEMSD.EXE console application. This provides us with one or more TIBCO message queues that we can host on our local machine. Owning thing process means we can therefore create, publish to and delete queues on demand and therefore tightly control the environment.

If you only want to do basic integration testing around the sending and receiving of messages you can configure it as a Windows service and just leave it running in the background. Then your tests can just rely on it always being there like a local database or the file-system. The build machine can be configured this way too.

Unfortunately because it’s a console application and not written to be hosted as a service (at least v6.3 isn’t), you need to use a shim like SRVANY.EXE from the Windows 2003 Resource Kit or something more modern like NSSM. These tools act as an adaptor to the console application so that the Windows SCM can control them.

One thing to be careful of when running TIBEMSD in this way is that it will stick its data files in the CWD (Current Working Directory), which for a service is %SystemRoot%\System32, unless you configure the shim to change it. Putting them in a separate folder makes them a little more obvious and easier to delete when having a clear out [2].

Running TIBEMSD On Demand

Running the TIBCO server as a service makes certain kinds of tests easier to write as you don’t have to worry about starting and stopping it, unless that’s exactly the kinds of test you want to write.

I’ve found it’s all too easy when adding new code or during a refactoring to accidentally break the service so that it doesn’t behave as intended when the network goes up and down, especially when you’re trying to handle poisoned messages.

Hence I prefer to have the TIBEMSD.EXE binary included in the source code repository, in a known place so that it can be started and stopped on demand to verify the connectivity side is working properly. For those classes of integration tests where you just need it to be running you can add it to your fixture-level setup and even keep it running across fixtures to ensure the tests running at an adequate pace.

If, like me, you don’t run as an Administrator all the time (or use elevated command prompts by default) you will find that TIBEMSD doesn’t run out-of-the-box in this way. Fortunately it’s easy to overcome these two issues and run in a LUA (Limited User Account).

Only Bind to the Localhost

One of the problems is that by default the server will try and listen for remote connections from anywhere which means it wants a hole in the firewall for its default port. This of course means you’ll get that firewall popup dialog which is annoying when trying to automate stuff. Whilst you could grant it permission with a one-off NETSH ADVFIREWALL command I prefer components in test mode to not need any special configuration if at all possible.

Windows will allow sockets that only listen for connections from the local host to avoid generating the annoying firewall popup dialog (and this was finally extended to include HTTP too). However we need to tell the TIBCO server to do just that, which we can achieve by creating a trivial configuration file (e.g. localhost.conf) with the following entry:

listen=tcp://127.0.0.1:7222

Now we just need to start it with the –conf switch:

> tibemsd.exe -config localhost.conf

Suppressing the Need For Elevation

So far so good but our other problem is that when you start TIBEMSD it wants you to elevate its permissions. I presume this is a legacy thing and there may be some feature that really needs it but so far in my automated tests I haven’t hit it.

There are a number of ways to control elevation for legacy software that doesn’t have a manifest, like using an external one, but TIBEMSD does and that takes priority. Luckily for us there is a solution in the form of the __COMPAT_LAYER environment variable [3]. Setting this, either through a batch file or within our test code, supresses the need to elevate the server and it runs happily in the background as a normal user, e.g.

Once we know how to run TIBEMSD without it causing any popups we are in a position to do that from within an automated test running as any user (LUA), e.g. a developer or the build machine.

In C#, the language where I have been doing this most recently, we can either hard-code a relative path [4] to where TIBEMSD.EXE resides within the repo, or read it from the test assembly’s app.config file to give us a little more flexibility.

We can also add our special .conf file to the same folder and therefore find it in the same way. Whilst we could generate it on-the-fly it never changes so I see little point in doing this extra work.

Something to be wary of if you’re using, say, NUnit to write your integration tests is that it (and ReSharper) can copy the test assemblies to a random location to aid in insuring your tests have no accidental dependencies. In this instance we do, and a rather large one at that, so we need the relative distance between where the test assemblies are built and run (XxxIntTests\bin\Debug) and the TIBEMSD.EXE binary to remain fixed. Hence we need to disable this copying behaviour with the /noshadow switch (or “Tools | Unit Testing | Shadow-copy assemblies being tested” in ReSharper).

Given that we know where our test assembly resides we can use Assembly.GetExecutingAssembly() to create a fully qualified path from the relative one like so:

Stopping the daemon involves calling Kill(). There are more graceful ways of remotely stopping a console application which you can try first, but Kill() is always the fall-back approach and of course the TIBCO server has been designed to survive such abuse.

Naturally you can wrap this up with the Dispose pattern so that your test code can be self-contained:

// Arrange using (RunTibcoServer()) { // Act }

// Assert

Or if you want to amortise the cost of starting it across your tests you can use the fixture-level set-up and tear down:

One final issue to be aware of, and it’s a common one with integration tests like this which start a process on demand, is that the server might still be running unintentionally across test runs. This can happen when you’re debugging a test and you kill the debugger whilst still inside the test body. The solution is to ensure that the server definitely isn’t already running before you spawn it, and that can be done by killing any existing instances of it:

Process.GetProcessesByName(“tibemsd”) .ForEach(p => p.Kill());

Naturally this is a sledgehammer approach and assumes you aren’t using separate ports to run multiple disparate instances, or anything like that.

Other Gottchas

This gets us over the biggest hurdle, control of the server process, but there are a few other little things worth noting.

Due to the asynchronous nature and potential for residual state I’ve found it’s better to drop and re-create any queues at the start of each test to flush them. I also use the Assume.That construct in the arrangement to make it doubly clear I expect the test to start with empty queues.

Also if you’re writing tests that cover background connect and failover be aware that the TIBCO reconnection logic doesn’t trigger unless you have multiple servers configured. Luckily you can specify the same server twice, e.g.

var connection= “tcp://localhost,tcp://localhost”;

If you expect your server to shutdown gracefully, even in the face of having no connection to the queue, you might find that calling Close() on the session and/or connection blocks whilst it’s trying to reconnect (at least in EMS v6.3 it does). This might not be an expected production scenario, but it can hang your tests if something goes awry, hence I’ve used a slightly distasteful workaround where the call to Close() happens on a separate thread with a timeout:

Task.Run(() => _connection.Close()).Wait(1000);

Conclusion

Writing automated integration tests against a middleware product like TIBCO is often an uphill battle that I suspect many don’t have the appetite or patience for. Whilst this post tackles the technical challenges, as they are at least surmountable, the somewhat harder problem of tackling the organisation is sadly still left as an exercise for the reader.

[1] The modern NoSQL database vendors appear to have a much simpler model – use it as much as you like outside production.

[2] If the data files get really large because you leave test messages in them by accident they can cause your machine to really grind after a restart as the service goes through recovery.

I still doubt anyone would actually write this and a simple function like abs() is not what’s usually under test when this crops up. A more realistic scenario would need much more distance between the test and production code, say, a component-level test:

If we mentally inline the derivation of the expected value what the test is saying is “When a message is processed the output contains a request time which is formatted by the processor”. This is essentially a tautology because the test is describing its behaviour in terms of the thing under test, it’s self-reinforcing [1].

Applying the advice from Antoine de Saint-Exupéry [2] about perfection being achieved when there is nothing left take away, lets implement FormatTime() like this:

public string FormatTime(DateTime value) { return null; }

The test will still pass. I know this change is perverse and nobody would ever make that ridiculous a mistake, but the point is that the test is not really doing its job. Also as we rely more heavily on refactoring tools we have to work harder to verify that we have not silently broken a different test that was also inadvertently relying on some aspect of the original behaviour.

Good Duplication

We duplicate work in the test for a reason, as a cross-check that we’ve got it right. This “duplication” is often just performed mentally, e.g. formatting a string, but for a more complex behaviour could be done in code using an alternate algorithm [3]. In fact one of the advantages of a practice like TDD is that you have to work it out beforehand and therefore are not tempted to paste the output from running the test on the basis that you’re sure it’s already correct.

If we had duplicated the work of deriving the output in the example above my little simplification would not have worked as the test would then have failed. Once again, adopting the TDD practice of starting with a failing test and transitioning to green by putting the right implementation in proves that the test will fail if the implementation changes unexpectedly.

This is a sign to watch out for – if you’re not changing the key part of the implementation to make the test pass you might have overly-coupled the test and production code.

What is the Test Really Saying?

The problem with not being the person that wrote the test in the first place is that it may not be telling you what you think it is. For example the tautology may be there because what I just described is not what the author intended the reader to deduce.

The test name only says that the output will contain the time value, the formatting of that value may well be the responsibility of another unit test somewhere else. This is a component level test after all and so I would need to drill into the tests further to see if that were true. A better approach might be to make the breaking change above and see what actually fails. Essentially I would be doing a manual form of Mutation Testing to verify the test coverage.

Alternatively the author may be trying to avoid creating a brittle test which would fail if the formatting was tweaked and so decided the best way to do that would be to reuse the internal code. The question is whether the format matters or not (is it a published API?), and with no other test to specifically answer that question one has to work on an assumption.

This is a noble cause (not writing brittle tests) but there is a balance between the test telling you about a fault in the code and it just being overly specific and annoying by failing on unimportant changes. Sometimes we just need to work a little harder to express the true specification in looser terms. For example maybe we only need to assert that a constituent part of the date is included, say, the year as that is usually the full 4 digits these days:

Assert.That(output.RequestTime, Is.StringContaining(“2010”));

If we are careful about the values we choose we can ensure that multiple formats can still conform to a looser contract. For example 10:33:44 on 22/11/2016 contains no individual fields that could naturally be formatted in a way where a simple substring search could give a false positive (e.g. the hour being mistaken for the day of the month).

A Balancing Act

Like everything in software engineering there is a trade-off. Whilst we’d probably prefer to be working with a watertight specification that leaves as little room for ambiguity as possible, we often have details that are pretty loose. When that happens we have to decide how we want to trigger a review of this lack of clarity in the future. If we make the test overly restrictive it runs the risk of becoming brittle, whilst making it overly vague could allow breaking changes to go unnoticed until too late.

Borrowing (apocryphally) from Einstein we should strive to make our tests as precise as possible, but not overly precise. In the process we need to ensure we do not accidentally reuse production code in the test such that we find ourselves defining the behaviour of it, with itself.

Thursday, 27 October 2016

There is always a tension in programming between creating something that is hard to misuse but at the same time adheres to standards to try and leverage the Principle of Least Surprise. One area I personally struggle with this conflict is how to communicate to a client (of the software kind) that they have made a request for something which doesn’t currently exist, and almost certainly will never exist.

As a general rule when someone requests a resource that doesn’t exist then you should return a 404 (Not Found). And this makes perfect sense when we’re in production and all the bugs have been ironed but during development when we’re still exploring the API it’s all too easy to make a silly mistake and not realise that it’s due to a bug in our code.

An Easy Mistake

Imagine you’re looking up all orders for a customer, you might design your API something like this:

GET /orders/customer/12345

For a starter you have the whole singular noun vs plural debate which means you’ll almost definitely try this by accident:

GET /order/customer/12345

or make the inverse mistake

GET /orders/customers/12345

By the standard HTTP rules you should return a 404 as the resource does not exist at that address. But does it actually help your fellow developers to stick to the letter of the law?

Frameworks

What makes this whole issue much thornier is that if you decide you want to do the right thing by your fellow programmers you will likely have to fight any web framework you’re using because they usually take the moral high ground and do what the standard says.

What then ensues is a fight between the developer and framework as they try their hardest to coerce the framework to send all unmatched routes through to a handler that can return their preferred non-404 choice.

A colleague who is also up for the good fight recently tried to convince the Nancy .Net framework to match the equivalent of “/.*” (the lowest weighted expression) only to find they had to define one route for each possible list of segments, i.e. “/.*”, “/.*/.*”, “/.*/.*/.*”, etc. [1].

Even then he still got some inconsistent behaviour. Frameworks also make it really easy to route based on value types which gives you a form of validation. For example if I know my customer ID is always an integer I could express my route like this:

“/orders/customer/{integer}”

That’s great for me but when someone using my API accidentally formats a URL wrong and puts the wrong type of value for the ID, say the customer’s name, they get a 404 because no route matches a non-integer ID. I think this is a validation error and should probably be a 400 (Bad Request) as it’s a client programmer bug, but the framework has caused it to surface in a way that’s no different to a completely invalid route.

Choice of Status Code

So, assuming we want to return something other than Not Found for what is clearly a mistake on the client’s part, what are our choices?

In the debates I’ve seen on this 400 (Bad Request) seems like a popular choice as the request, while perhaps not technically malformed, is often synonymous with “client screwed up”. I also like Phil Parker’s suggestion of using 405 (Method Not Allowed) because it feels like less abuse of the 4XX status codes and is also perhaps not as common as a 400 so shows up a bit more.

In RESTful APIs there is often a question mark around the use of PUT versus POST when creating and updating resources. The canonical example is the blogging engine where we wish to add new posts and comments. The default choice appears to be POST which I guess is because we tend to shy away from that contentious discussion about whether PUT or POST is more suitable.

As I’ve understood it, when the client can determine the address of the resource you can use PUT whereas when only server knows (i.e. generates it) then a POST is more appropriate. Hence you could say that the question boils down to whether or not the client can create a unique ID for the resource or has to leave that to the server.

POST

If we look at the blogging engine it’s probably easier on the client if the server just takes care of it:

POST /blog/create { title: “My New Blog Post” }

Here the server returns the URL for the freshly minted blog post. An obvious choice would probably be to generate a GUID for it and use that as the permanent ID:

GET /blog/1234-1234-1234-1234 { title: “My New Blog Post” }

Idempotency

The problem with this approach is that we now have a problem if the request fails. We don’t know if the blog post was created because we don’t know its ID yet. If we retry the request, unless the server has a cache of the request, we’ll end up with a duplicate if the server completed on its side. A common solution is to include a client generated ID for the request that the server can use to detect when a request is being replayed:

But wait, haven’t we just solved the client generated ID problem? If we need to come up with a unique ID for the request for the purposes of idempotency, why don’t we just PUT the resource with that ID in the first place?

PUT /blog/9876-9876-9876-9876 { title: “My New Blog Post” }

Natural Keys

When the client is just a single machine and it’s creating the resource itself from scratch it has far more latitude in choosing the resource’s ID, but if you’re transforming data in a distributed system it can get a little more tricky.

If your event comes in the form of an upstream message you cannot just use a GUID because when a message gets replayed (which it will, eventually) you’ll end up generating a duplicate as the request IDs will be different. Hence you need to look for something in the upstream message that can be used as a more natural key for the resource.

Going back to our blogging engine example we already had one in the guise of the blog post’s title:

PUT /blog/my-new-blog-post { title: “My New Blog Post” }

Yes, if the title changes then the ID will no longer match, but that was only a convenience anyway. Hopefully there is enough data in the request itself to make an ID clash extremely unlikely, e.g.

If the resource is immutable then you only need to guard against duplicate attempts to create it, but if it’s mutable then you may already be considering using PUT to mutate it [1]. In this case you’ll probably already have come up with a versioning scheme that will allow you to detect concurrency conflicts from multiple sources, e.g. ETag. In this scenario the request will not contain an “original version” tag to compare against so it’s a conflict just the same as when one occurs when updating later.

Degrees of Idempotency

It might sound like an oxymoron to say that there are varying levels of idempotency, but there can be. Essentially we’re making the usual time/space trade-off as we balance the need to verify the details of a potential duplicate or erroneous request against the need to persist more state to allow the comparison to be made at an indeterminate point in the future. Also depending on whether you are doing a POST or a PUT means there are assumptions about what a duplicate can be.

For example at the weakest end of the spectrum if you’re doing a POST with a simple unique request ID you might just say that if you’ve seen it before, irrespective of the content, then it’s a duplicate request. Also if you are happy to miss a replay, say, one month later, then you can even put a time-to-live (TTL) on the unique request ID to save space. Either way you’re treating idempotency very much as a temporal anomaly, which essentially is what it usually is, a fast or slow retry [2].

If you’re looking for a little extra piece of mind you might decide to generate a hash of the salient headers and content too which would allow you to detect if your request ID might have been reused with different content, either through programmer naivety or an unexpected change in the system. This is not a duplicate but a bad request. Sadly textual formats like XML and JSON have many equally valid representations [3] so a straight text hash is imprecise, but that may well be the pragmatic choice.

At the other end of the spectrum you might not be happy to silently discard false positives and so you need to persist enough about the request to be sure it really is a duplicate. This likely involves doing a proper comparison of the attributes in the request, which implies that you must still have them. If you save each change as an event rather than just mutating the state to provide an updated snapshot, then you’ll have what you need, but the extra storage is going to cost you. Of course there are other benefits to storing event streams but that’s a different story.

Just Pedantry?

Hence one advantage with using a PUT over a POST is that you push the responsibility onto the client to define what it means to be talking about the same resource. This is done by elevating the traditional unique request ID (or some other data) to become the permanent resource ID. Consequently idempotency starts to become implicit in the design rather than appearing more like an afterthought.

In scenarios where requests come from a single source this is largely an academic exercise but when you have the potential for the same request to come via multiple sources (or be replayed at different times), then I think it’s useful to try and formalise the way we address resources to push the issue to the forefront.

[1] For partial updates you might choose to use the less well known PATCH verb.

Tuesday, 25 October 2016

Contrary to what some business folk, project managers and programmers may believe, every software system we build is essentially bespoke. They may well think this LOB application, REST API or web site is very similar to many others that have already been built but that is almost certainly because the feature set and user interface have been dreamt up by copying what has already gone before rather than thinking about what they actually need this time around.

It’s an easy trap to fall into, after all, isn’t the point of hiring experienced people because you want them to leverage all that knowledge they already have to quickly build your system? But what knowledge do they actually bring with them to each new product or project? Unless each one is only a couple of weeks long, which is almost non-existent in my experience [1], then there are way too many different variables to consider this new venture the same as any predecessor.

Okay, so they may be “similar”, but not the same. At least, not unless you and your organisation have absolutely zero desire to learn anything new about the process and tools used to create software. And that’s essentially my point – the industry moves so incredibly fast that something, probably many things, change between each project. But even then that assumes that no learning occurs during the project itself and, once again, unless it’s only a few weeks long that is also highly unlikely to happen.

In the last few years I’ve been mostly working in the enterprise arena on web services using the classic enterprise-grade offerings. The enterprise is generally renowned for its glacial pace and yet in that time the technology stack alone has moved along in leaps and bounds. For example that first API was in C# 4 / .Net 3.5 and now we’re looking at C# 6 / .Net Core with one eye looking at running on Linux machines too. Service hosting has changed from IIS / MVC to self-hosting / OWIN, with Nancy in there somewhere. The database too has switched from the relative safety of Oracle & SQL Server to MongoDB & Couchbase, and in one instance [2]has been a hybrid of the relational and document paradigms. Even the ancillary tooling like the VCS, CI product and testing & mocking frameworks have all changed as well, either to a different product or have received a non-trivial upgrade.

At the same time as the technology stack is evolving so too is the development process. The various organisations I’ve been working in recently have all undergone, or should I say are still undergoing, a major change to a more agile way of working. This in itself is not a one-time switch but a change in mind-set to one of continual learning and therefore by definition is subject to relentless change. Admittedly those changes become more gradual as the bigger problems are addressed but even so the way the teams around you change can still have an affect on your own ways of working – the DevOps message may have yet to reach the parts of the organisation you interact with.

Even if the toolchain and process largely stay the same the way we apply the technology changes too as what was once “best practice” gets replaced by a new “best practice”, thereby making somewhat of a mockery of the whole notion. Maybe once before we were happy to return null references but now we wish to use an Optional type instead, or we realise the inappropriate nature of the Singleton pattern in highly testable codebase.

With all the change going on around us you might rightly question what being “experienced” actually means in this industry, if we apparently can’t carry over much from one project to the next. Clearly this is a little extreme though as there is plenty we do carry over. In reality, although everything eventually does change, it does not all change at the same time. Hence at the beginning I said that no system is ever the same, it can be very similar, but will be far from identical.

As experienced programmers what we carry over are the battle scars. What this should lead us to do is ask those questions that the less experienced know not what to ask in the first place, and often only discover themselves the hard way. We should never assume that just because we did anything one particular way before that that was the only way, or will even still be the best way in this new setting.

It might be a good way to start out but we should always be looking for ways to improve it, or detect when the problem has diverged such that our first-order approximation is no longer fit for purpose. It’s all too easy to try and solve problems we’ve had in the past, the second time around, and then completely fail to notice we’re really solving a different problem, or at least one with enough differences that we should let go of the past instead. In the end you may be right and you actually converge on a similar design to before, congratulations, you were right this time; as long as you didn’t sacrifice delivering more important stuff just to satisfy your hunch.

By all means bring and share your experiences on your next venture, but be careful you do not get blindsided by them. Only solve today’s problems with yesterday’s solutions if that really is the best thing to do. You might be surprised just how much has changed in the world of programming since then.

[1] I’ve spent longer than that just trying to fix a single bug before!

Friday, 21 October 2016

We were a brand new team of 5 (PM + devs) tasked with building a calculation engine. The team was just one part of a larger programme that encompassed over a dozen projects in total. The intention was for those other teams to build some of the services that ours would depend on.

Our development process was somewhat DSM-like in nature, i.e. iterative. We built a skeleton based around a command-line calculator and fleshed it out from there [1]. This skeleton naturally included vague interfaces for some of the services that we knew we’d need and that we believed would be fulfilled by some of the other teams.

Fleshing Out the Skeleton

Time marched on. Our calculator was now being parallelised and we were trying to build out the distributed nature of the system. Ideally we would like to have been integrating with the other teams long ago but the programme RAG status wasn’t good. Every other team apart from us was at “red” and therefore well behind schedule.

To compensate for the lack of collaboration and integration with the other services we needed we resorted to building our own naïve mocks. We found other sources of the same data and built some noddy services that used the file-system in a dumb way to store and serve it up. We also added some simple steps to the overnight batch process to create a snapshot of the day’s data using these sources.

Programme Cuts

In the meantime we discovered that one of the services we were to depend on had now been cancelled and some initial testing with another gave serious doubts about its ability to deliver what we needed. Of course time was marching on and our release date was approaching fast. It was fast dawning on us that these simple test mocks we’d built may well have to become our production services.

One blessing that came out of building the simple mocks so early on was that we now had quite a bit of experience on how they would behave in production. Hence we managed to shore things up a bit by adding some simple caches and removing some unnecessary memory copying and serialization. The one service left we still needed to invoke had found a more performant way for us to at least bulk extract a copy of the day’s data and so we retrofitted that into our batch preparation phase. (Ideally they’d serve it on demand but it just wasn’t there for the queries we needed.)

Release Day

The delivery date arrived. We were originally due to go live a week earlier but got pushed back by a week because an important data migration got bumped and so we were bumped too. Hence we would have delivered on time and, somewhat unusually, we were well under budget our PM said [2].

So the mocks we had initially built just to keep the project moving along were now part of the production codebase. The naïve underlying persistence mechanism was now a production data store that needed high-availability and backing up.

The Price

Whilst the benefits of what we did (not that there was any other real choice in the end) were great, because we delivered a working system on time, there were a few problems due to the simplicity of the design.

The first one was down to the fact that we stored each data object in its own file on the file-system and each day added over a hundred-thousand new files. Although we had partitioned the data to avoid the obvious 400K files-per-folder limit in NTFS we didn’t anticipate running out of inodes on the volume when it quickly migrated from a simple Windows server file share to a Unix style DFS. The calculation engine was also using the same share to persist checkpoint data and that added to the mess of small files. We limped along for some time through monitoring and zipping up old data [3].

The other problem we hit was that using the file-system directly meant that the implementation details became exposed. Naturally we had carefully set ACLs on the folders to ensure that only the environment had write access and our special support group had read access. However one day I noticed by accident that someone had granted read access to another group and it then transpired that they were building something on top of our naïve store.

Clearly we never intended this to happen and I’ve said more about this incident previously in “The File-System Is An Implementation Detail”. Suffice to say that an arms race then developed as we fought to remove access to everyone outside our team whilst others got wind of it [4]. I can’t remember whether it happened in the end or not but I had put a scheduled task together than would use CALCS to list the permissions and fail if there were any we didn’t expect.

I guess we were a victim of our success. If you were happy with data from the previous COB, which many of the batch systems were, you could easily get it from us because the layout was obvious.

Epilogue

I have no idea whether the original versions of these services are still running to this day but I wouldn’t be surprised if they are. There was a spike around looking into a NoSQL database to alleviate the inode problem, but I suspect the ease with which the data store could be directly queried and manipulated would have created too much inertia.

Am I glad we put what were essentially our mock services into production? Definitely. Given the choice between not delivering, delivering much later, and delivering on time with a less than perfect system that does what’s important – I’ll take the last one every time. In retrospect I wish we had delivered sooner and not waited for a load of other stuff we built as the MVP was probably far smaller.

The main thing I learned out of the experience was a reminder not to be afraid of doing the simplest thing that could work. If you get the architecture right each of the pieces can evolve to meet the ever changing requirements and data volumes [5].

What we did here fell under the traditional banner of Technical Debt – making a conscious decision to deliver a sub-optimal solution now so it can start delivering value sooner. It was the right call.

[1] Nowadays you’d probably look to include a slice through the build pipeline and deployment process up front too but we didn’t get any hardware until a couple of months in.

[2] We didn’t build half of what we set out to, e.g. the “dashboard” was a PowerShell generated HTML page and the work queue involved doing non-blocking polling on a database table.

[3] For regulatory reasons we needed to keep the exact inputs we had used and couldn’t guarantee on being able to retrieve them later from the various upstream sources.

[4] Why was permission granted without questioning anyone in the team that owned and supported it? I never did find out, but apparently it wasn’t the first time it had happened.

[5] Within reason of course. This system was unlikely to grow by more than an order of magnitude in the next few years.

Thursday, 20 October 2016

When looking at the performance of our software we often have to consider both first-order and second-order effects. For example when profiling a native application where memory management is handled explicitly we can directly see the cost of allocations and deallocations because this all happens at the moment we make them. In contrast the world of garbage collected languages like C# exhibit different behaviour. The cost of memory allocations here are minimal because the algorithm is simple. However the deallocation story is far more complex, and it happens at a non-deterministic time later.

A consequence of this different behaviour is that it is much harder to see the effects that localised memory churn is having on your application. For example I once worked on a C# data transformation tool where the performance was appalling. Profiling didn’t immediately reveal the problem but closer inspection showed that the garbage collector was running full tilt. Looking much closer at the hottest part of the code I realised it was spending all it’s time splitting strings and throwing them away. The memory allocations were cheap so there were no first-order effects, but the clean-up was really expensive and happened later and therefore appeared as a second-order effect which was harder to trace back.

Short Term Gains

We see the same kind of effects occurring during the development process too. They are often masked though by the mistaken belief that time is being saved, it is, but only in the short term. The problem is the second-order effects of such time saving is actually lost later, and when it’s more precious.

This occurs because the near term activity is being seen as wasteful of a certain person’s time, on the premise that the activity is of low value (to them). But what is being missed is the second-order effects of doing that, such as the learning about the context, people and product involved. When crunch time comes that missed learning suddenly has to happen at the later time when potentially under time pressure or after money has already been spent; then you’re heading into sunk costs territory.

In essence what is being perceived as waste is the time spent in the short term, when the real waste is time lost in the future due to rework caused by the missed opportunity to learn sooner.

All Hail “Agile”

Putting this into more concrete terms consider a software development team where the developer’s time is assumed to be best spent designing and writing code. The project manager assumes that having conversations, perhaps with ops or parts of the business is of low value, from the developer’s perspective, and therefore decides it’s better if someone “less expensive” has it instead.

Of course we’re all “agile” now and we don’t do that anymore. Or do we? I’ve worked in supposedly agile teams and this problem still manifests itself, maybe not quite to the same extent as before, but nonetheless it still happens and I believe it happens because we are confused about what the real waste is that we’re trying to avoid.

Even in teams I’ve been in where we’ve tried to ensure this kind of problem is addressed, it’s only addressed locally, it’s still happening further up the food chain. For example a separate architecture team might be given the role of doing a spike around a piece of technology that a development team will be using. This work needs to happen inside the team so that those who will be developing and, more importantly, supporting the product will get the most exposure to it. Yes, there needs to be some governance around it, but the best people to know if it even solves their problem in the first place is the development team.

Another manifestation of this is when two programme managers are fed highlights about potential changes on their side of the fence. If there is any conflict there could be a temptation to resolve it without going any lower. What this does is cut out the people that not only know most about the conflict, but are also the best placed to negotiate a way out. For example instead of trying to compensate for a potential breaking change with a temporary workaround, which pushes the product away from its eventual goal, see if the original change can be de-prioritised instead. If a system is built in very small increments it’s much easier to shuffle around the high priority items to accommodate what’s happening around the boundaries of the team.

Time for Reflection

How many times have you said, or heard someone else say, “if only you’d come to us earlier”. This happens because we try and cut people out of the loop in the hope that we’ll save time by resolving issues ourselves, but what we rarely do is reflect on whether we really did save time in the long run when the thread eventually started to unravel and the second-order effects kicked in.

Hence, don’t just assume you can cut people out of the loop because you think you’re helping them out, you might not be. They might want to be included because they have something to learn or contribute over-and-above the task at hand. Autonomy is about choice, they might not always want it, but if you don’t provide it in the first place it can never be leveraged.

Monday, 22 August 2016

The codebase I currently work on is split into a number of repositories. For example the infrastructure and deployment scripts are in separate repos as are each service-style “component”.

Manual Syncing

To keep things moving along the team decided that the handful of bits of code that were shared between the two services could easily be managed by a spot of manual copying. By keeping the shared code in a separate namespace it was also partitioned off to help make it apparent that this code was at some point going to be elevated to a more formal “shared” status.

This approach was clearly not sustainable but sufficed whilst the team was still working out what to build. Eventually we reached a point where we needed to bring the logging and monitoring stuff in-sync and I also wanted to share some other useful code like an Optional<T> type. It also became apparent that the shared code was missing quite a few unit tests as well.

Share Source or Binaries?

The gut reaction to such a problem in a language like C# would probably be to hive off the shared code into a separate repo and create another build pipeline for it that would result in publishing a package via a NuGet feed. And that is certainly what we expected to do. However the problem was where to publish the packages to as this was closed source. The organisation had its own license for an Enterprise-scale product but it wasn’t initially reachable from outside the premises where our codebase lay. Also there were some problems with getting NuGet to publish to it with an API key that seemed to lay with the way the product’s permissions were configured.

Hence to keep the ball rolling we decided to share the code at the source level by pulling the shared repo into each component’s solution. There are two common ways of doing this with Git – subtrees and submodules.

Git Submodules

It seemed logical that we should adopt the more modern submodule approach as it felt easier to attach, update and detach later. It also appeared to have support in the Jenkins 1.x plugin for doing a recursive clone so we wouldn’t have to frig it with some manual Git voodoo.

As always there is a difference between theory and practice. Whilst I suspect the submodule feature in the Jenkins plugin works great with publicly accessible open-source repos it’s not quite up to scratch when it comes to private repos that require credentials. After much gnashing of teeth trying to convince the Jenkins plugin to recursively clone the submodules, we conceded defeat assuming that we’re another victim of JENKINS-20941.

Git Subtree

Given that our long term goal was to move to publishing a NuGet feed we decided to try using a Git subtree instead so that we could at least move forward and share code. This turned out (initially) to be much simpler because for tooling like Jenkins it appears no different to a single repo.

Our source tree looked (unsurprisingly) like this:

<solution> +- src +- app +- shared-lib +- .csproj +- *.cs

All we needed to do was replace the shared-lib folder with the contents of the new Shared repository.

First we needed to set up a Git remote. Just as the remote main branch of a cloned repo goes by the name origin/master, so we set up a remote for the Shared repository’s main branch:

> git remote add shared https://github/org/Shared.git

Next we removed the old shared library folder:

> git rm src\shared-lib

…and grafted the new one in from the remote branch:

> git subtree add --prefix src/shared shared master --squash

This effectively takes the shared/master branch and links it further down the repo source tree to src/shared which is where we had it before.

However the organisation of the new Shared repo is not exactly the same as the old shared-lib project folder. A single child project usually sits in it’s own folder, but a full-on repo has it’s ownsrc folder and build scripts and so the source tree now looked like this:

The latter command is almost the same as the one we used earlier but we pull rather than add. Once again we’re squashing the entire history as we’re not interested in it.

Pushing Changes Back

Naturally you might want to make a change in the subtree in the context of the entire solution and then push it back up to the parent repo. This is doable but involves using git subtree push to normalise the change back into the folder structure of the parent repo.

Personally we decided just to make the changes test-first in the parent and always flow down to the child. In the few cases the child solution helped in debugging we decided to work on the fix in the child solution workspace and then simply manually copy the change over to the shared workspace and push it out through the normal route. It’s by no means optimal but a NuGet feed was always our end game so we tolerated the little bit of friction in the short term.

The End of the Road

If we were only sucking in libraries that had no external dependencies themselves (up to that point our small shared code only relied on the .Net BCL) we might have got away with this technique for longer. But in the end the need to pull in 3rd party dependencies via NuGet in the shared project pushed it over the edge.

The problem is that NuGet packages are on a per-solution basis and the <HintPath> element in the project file assumes a relative path (essentially) from the solution file. When working in the real repo as part of the shared solution it was “..\..\packages\Xxx”, but when it’s part of the subtree based solution it needed to be two levels further up as “..\..\..\..\packages\Xxx”.

Although I didn’t spend long looking I couldn’t find a simple way to easily overcome this problem and so we decided it was time to bite-the-bullet and fix the real issue which was publishing the shared library as a NuGet feed.

Partial Success

This clearly is not anything like what you’d call an extensive use of git subtree to share code, but it certainly gave me a feel for it can do and I think it was relatively painless. What caused us to abandon it was tooling specific (the relationship between the enclosing solution’s NuGet packages folder and the shared assembly project itself) and so a different toolchain may well fair much better if build configuration is only passed down from parent to subtree.

I suspect the main force that might deter you from this technique is how much you know, or feel you need to know, about how git works. When you’re inside a tool like Visual Studio it’s very easy to make a change in the subtree folder and check it in and not necessarily realise you’re modifying what is essentially read-only code. When you next update the subtree things get sticky. Hence you really need to be diligent about your changes and pay extra attention when you commit to ensure you don’t accidentally include edits within the subtree (if you’re not planning on pushing back that way). Depending on how experienced your team are this kind of tip-toeing around the codebase might be just one more thing you’re not willing to take on.