When you use the Polly circuit-breaker, make sure you share your Policy instances!

This post is somewhat of PSA about using the excellent open source Polly library for handling resiliency to your application. Recently, I was tasked with adding a circuit-breaker implementation to some code calling an external API, and I figured Polly would be perfect, especially as we already used it in our solution!

I hadn't used Polly directly in a little while, but the excellent design makes it easy to add retry handling, timeouts, or circuit-breaking to your application. Unfortunately, my initial implementation had one particular flaw, which meant that my circuit-breaker never actually worked!

In this post I'll outline the scenario I was working with, my initial implementation, the subsequent issues, and what I should have done!

tl;dr;Policy is thread safe, and for the circuit-breaker to work correctly, it must be shared so that you call Execute on the same Policy instance every time!

The scenario - dealing with a flakey external API

A common requirement when working with currencies is dealing with exchange rates. We have happily been using the Open Exchange Rates API to fetch a JSON list of exchange rates for a while now.

The existing implementation consists of three classes:

OpenExchangeRatesClient - Responsible for fetching the exchange rates from the API, and parsing the JSON into a strongly typed .NET object.

OpenExchangeRatesCache - We don't want to fetch exchange rates every time we need them, so this class caches the latest exchange rates for a day before calling the OpenExchangeRatesClient to get up-to-date rates.

FallbackExchangeRateProvider - If the call to fetch the latest rates using the OpenExchangeRatesClient fails, we fallback to a somewhat recent copy of the data, loaded from an embedded resource in the assembly.

All of these classes are registered as Singletons with the IoC container, so there isonly a single instance of each. This setup has been working fine for a while, but there was an issue where the Open Exchange Rates API went down, just as the local cache of exchange rates expired. The series of events was:

A request was made to our internal API, which called a service that required exchange rates.

The service called the OpenExchangeRatesCache which realised the current rates were out of date.

The cache called the OpenExchangeRatesClient to fetch the latest rates.

Unfortunately the service was down, and eventually caused a timeout (after 100 seconds!)

At this point the cache used the FallbackExchangeRateProvider to use the stale rates for this single request.

A separate request was made to our internal API - repeat steps 2-6!

An issue in the external dependency, the exchange rate API going down, was causing our internal services to take 100s to respond to requests, which in turn was causing other requests to timeout. Effectively we had a cascading failure, even though we thought we had accounted for this by providing a fallback.

Note I realise updating cached exchange rates should probably be a background task. This would stop requests failing if there are issues updating, but the general problem is common to many scenarios, especially if you're using micro-services.

Luckily, this outage didn't happen at a peak time, so by the time we came to investigate the issue, the problem had passed, and relatively few people were affected. However, it obviously flagged up a problem, so I set about trying to ensure this wouldn't happen again if the API had issues at a later date!

Fix 1 - Reduce the timeouts

The first fix was a relatively simple one. The OpenExchangeRatesClient was using an HttpClient to call the API and fetch the exchange rate data. This was instantiated in the constructor, and reused for the lifetime of the class. As the client was used as a singleton, the HttpClient was also a singleton (so we didn't have any of these issues).

The first fix I made was to set the Timeout property on the HttpClient. In the failure scenario, it was taking 100s to get back an error response. Why 100s? Because that's the default timeout for HttpClient!

Checking our metrics of previous calls to the service, I could see that prior to the failure, virtually all calls were taking approximately 0.25s. Based on that, a 100s timeout was clearly overkill! Setting the timeout to something more modest, but still conservative, say 5s, should help prevent the scenario happening again.

Fix 2 - Add a circuit breaker

Circuit-breakers in brief

Circuit-breakers make sense when calling a somewhat unreliable API. They use a fail-fast approach when a method has failed several times in a row. As an example, in my scenario, there was no point repeatedly calling the API when it hadn't worked several times in a row, and was very likely to fail. All we were doing was adding additional delays to the method calls, when it's pretty likely you're going to have to use the fallback anyway.

The circuit-breaker tracks the number of times an API call has failed. Once it crosses a threshold number of failures in a row, it doesn't even try to call the API for subsequent requests. Instead, it fails immediately, as though the API had failed.

After some timeout, the circuit-breaker will let one method call through to "test" the API and see if it succeeds. If it fails, it goes back to just failing immediately. If it succeeds then the circuit is closed again, and it will go back to calling the API for every request.

The circuit-breaker was a perfect fit for the failure scenario in our app, so I set about adding it to the OpenExchangeRatesClient.

Creating a circuit breaker policy

You can create a circuit-breaker Policy in Polly using the CircuitBreakerSyntax. As we're going to be making requests with the HttpClient, I used the async methods for setting up the policy and for calling the API:

This configuration creates a new circuit breaker policy, defines the number of consecutive exceptions to allow before marking the API as broken and opening the breaker, and the amount of time the breaker should stay open for before moving to the half-closed state.

Once you have a policy in place, circuitBreaker, you can call ExecuteAsync and pass in the method to execute. At runtime, if an exception occurs executing CallRatesApi() the circuit breaker will catch it, and keep track of how many exceptions it has raised to control the breaker's state.

Adding a fallback

When an exception occurs in the CallRatesApi() method, the breaker will catch it, but it will re-throw the exception. In my case, I wanted to catch those exceptions and use the FallbackExchangeRateProvider. I could have used a try-catch block, but I decided to stay in the Polly-spirit and use a Fallback policy.

A fallback policy is effectively a try catch block - it simply executes an alternative method if CallRatesApi() throws. You can then wrap the fallback policy around the breaker policy to combine the two. If the circuit breaker fails, the fallback will run instead:

It was as though the circuit breaker wasn't there at all! No matter how many times the CallRatesApi() method threw, the circuit was never breaking.

Can you see what I did wrong?

Using circuit breakers properly

Every time you call GetLatestRates(), I'm creating a new circuit breaker (and fallback) policy, and then calling ExecuteAsync on that!

The circuit breaker, by it's nature, has state that must be persisted between calls (the number of exceptions that have previously happened, the open/closed state of the breaker etc). By creating new Policy objects inside the GetLatestRates() method, I was effectively resetting the policy back to its initial state, hence why nothing was working!

The answer is simple - make sure the Policy persists between calls to GetLatestRates() so that its state persists. The Policy is thread safe, so there's no issues to worry about there either. As the client is implemented in our app as a singleton, I simply moved the policy configuration to the class constructor, and everything proceeded to work as expected!

And that's all it takes! It works brilliantly when you actually use it properly 😉

Summary

This post ended up a lot longer than I intended, as it was a bit of a post-incident brain-dump, so apologies for that! It serves as somewhat of a cautionary tale about having blinkers on when coding something. When I implemented the fix initially I was so caught up in how I was solving the problem I completely overlooked this simple, but crucial, difference in how policies can be implemented.

I'm not entirely sure if in general it's best to use shared policies, or if it's better to create and discard policies as I did originally. Obviously, the latter doesn't work for circuit breaker but what about Retry or WaitAndRetry? Also, creating a new policy each time is probably more "allocate-y", but is it faster due to not having to be thread safe?

I don't know the answer, but personally, and based on this episode, I'm inclined to go with shared policies everywhere. If you know otherwise, do let me know in the comments, thanks!