Retry Logic

Published 2016-07-01

One of the age-old fallacies of distributed computing is that the network is reliable. The network is not reliable. In fact, the network fails all the time -- DNS failures, dropped packets, routing problems, bad configurations, misbehaving applications and hardware, etc.

When this happens to you, you don't want your application to fail. You just want it to shrug off that one bad request and try again.

The same approach also works well when you don't know how long something will take. Instead of waiting for the longest possible time you can periodically check to see whether that thing is ready.

In the first case you expect the program to succeed but want to handle occasional failures. In the second case you expect the program to fail most of the time, but succeed eventually. Actually this is the same logic!

The Retry Loop

Here's a simple retry loop (in Go, but the logic will port to any language).

Each time the loop runs, the delay increases because of time.Second * i * 5

The sleep happens before we do anything else.

Rather than just spamming requests to the service when it fails, we introduce an incremental back off. This prevents the retry from causing a denial of service, and also avoids blowing through rate limit quotas.

Wait... Why do we sleep first? Doesn't it make sense to try doing the thing first and then sleep afterward if it doesn't work?

This seems more intuitive, but this code has a bug! On the final pass through the loop we will wait after we have reached our final failure condition. In other words, we have just finished our last attempt, never succeeded, and then we proceed to wait for the maximum amount of time for no reason.

In the correct version, during our first pass through the loop i is zero. In Go, time.Sleep(0) just returns immediately, so we can put this at the top and avoid the bug.

How Long Do We Retry For?

One aspect of this type of retry loop is that eventually it stops. Eventually you will reach a timeout period where your program will stop calling the remote service and fail.

Depending on how complicated your // do something task is this may be longer than the sleep duration alone. But to get an idea of how long this will take before giving up, we can at least calculate the maximum sleep duration using this handy formula:

(tries * (tries - 1)) / 2 * delay

So for a 5 tries and 5 second delay we get 50 seconds, or about 1 minute in total (assuming // do something takes 1 second or so).

Also, while I don't find it particularly useful, you can use a similar formula to calculate the time if you start your loop at 1 instead of 0:

(tries * (tries + 1)) / 2 * delay

Extra Stuff

A retry mechanism is pretty useful. Depending on your needs there are a few extra things you can do. I'll give some examples here:

Cap the Wait Time

For longer-running operations the retry time can grow quite large. You may, for example, want to limit the upper bound to retry every 60 seconds.

Vary the Delay

If you have multiple processes polling the same resource, you may want to introduce some variability to prevent them all from hitting the service at the same time.

You can apply the randomness to one interval or to the entire range. E.g:

5 + 5 + 5 + rand(0, 5) // random range 15-20

Or

rand(0, 20) // random range 0-20

Keep in mind that with random 1-20 you should expect an average delay of 10 seconds, so this is significantly less time than the first example which will average around 18 seconds.

Break vs. Return

If you are writing your retry loop in the context of a longer procedure you may notice that it is a bit harder to do clean error handling. You can remedy this by wrapping your retry in a function instead.