Failures are inevitable in any system. How it should be handled varies
from one system to another. In job processing systems, a common
approach is to retry the failed jobs for a fixed number of times
before they are considered as permanent failures. A backoff
function is used to determine the wait time between successive
retries.

Let’s look at a simple backoff function, which retries after a fixed
wait time – 5 minutes in this case.

functionconstant(retryCount){return5*60;}

Let’s assume there are 100 job failures, the chart below shows when
each of the jobs would be retried again. Each dot represents a time at
which a job is retried. The color is varied based on the retry count.

An exponential backoff function increases the wait time exponentially
for each retry.

The above two are pure functions, given an input they will always
return the same output. If n jobs failed at the same time, then all
the n jobs will be retried at the same time, which could cause
thundering herd
problem. A random component called jitter is normally added to fix
this problem. The last component in the function below is a random
jitter that is scaled based on the retry count.

The above function is good enough for most of the use cases. There are
still some gaps where your job processing system would be idle. The
below function tries to distribute the load evenly by increasing the
randomness.

Although the load distribution is better than the previous version,
the wait time between two retries starts to deviate a lot. I wonder if
there are any stateless functions which could provide better distribution
without much deviation in wait time.