Perseverance: flexible retries for Clojure

Few of us have the luxury to work exclusively with reliable systems. Sure,
nothing is 100% reliable, but as long as you use only the resources local to a
single machine, you may disregard the possibility of failure. However, as soon
the network is involved, a whole array of problems emerges. You can no longer be
sure that an HTTP request succeeds, or a file gets downloaded, or a TCP stream
doesn't close midway. Thus, you have to protect your program from such
calamities and devise a recovery plan in advance.

Generic fault tolerance in distributed systems is an incredibly difficult topic
to contemplate on, so we won't trouble ourselves with it today. Instead, I want
to discuss simpler matters, when an unsuccessful operation can just be tried
again. It is usually true for actions that are immutable or idempotent, such as
HTTP GET requests. We run into these scenarios quite often, so we created
Perseverance.

Most languages (Clojure included) already have libraries to retry failed
operations. They are trivial to implement as their core logic is just a loop
that hammers an action until it succeeds. Extra sophistication is provided by
various delay strategies, but the main premise remains the same across all
implementations. What I always found suboptimal about this approach is that it
requires you to specify the retry strategy upfront — which does not contribute
to reusability of such code. When you write a function that downloads a file
from S3, do you know how many retry attempts are enough? Say, one caller wants
to keep trying till the end of time if necessary. Another caller may intend to
make a quick attempt and then fallback to something else. Without knowing all of
the usage patterns in advance, the low-level code cannot make proper decisions.
However, if retry handlers are placed in the higher levels of the system, the
handling becomes much coarser (think redownloading a single file vs. starting
the whole batch anew).

At this point, an insightful reader might notice: "This is a textbook problem
for Lisp's condition system!" Indeed, we at Grammarly write a lot of Common
Lisp, so we were naturally inspired by its approach to error handling. Now, if
you don't know what a condition system is, I recommend watching this talk by
Chris Houser at Clojure/conj 2015. A key takeaway from this talk — yes,
Clojure doesn't have conditions like Common Lisp, but you can emulate them with
dynamic variables. And that's what we did.

If you are still unsure what this is all about, check out the following example:

(require'[perseverance.core:asp]);; Fake function that returns a list of files but fails the first three times.(let [cnt(atom0)](defn list-s3-files[](when (< @cnt3)(swap!cntinc)(throw(RuntimeException."Failed to connect to S3.")))(range 10)));; Fake function that imitates downloading a file with 50/50 probability.(defn download-one-file[x](if (> (rand)0.5)(println (format"File #%d downloaded."x))(throw(java.io.IOException."Failed to download a file."))));; Let's wrap the previous function in retriable. Notice that we make;; no assumptions about how it will be retried.(defn download-one-file-safe[x](p/retriable{}(download-one-filex)));; Now to a function that downloads all files.(defn download-all-files[](let [files(p/retriable{:catch[RuntimeException]}(list-s3-files))](mapvdownload-one-file-safefiles)));; Let's call it and see what happens.(download-all-files);; Unhandled java.lang.RuntimeException: Failed to connect to S3.;; The exception is not handled because we haven't established the retry context.(p/retry{:strategy(p/constant-retry-strategy2000)}(download-all-files));; java.lang.RuntimeException: Failed to connect to S3., retrying in 2.0 seconds...;; java.lang.RuntimeException: Failed to connect to S3., retrying in 2.0 seconds...;; java.io.IOException: Failed to download a file., retrying in 2.0 seconds...;; File #0 downloaded.;; File #1 downloaded.;; java.io.IOException: Failed to download a file., retrying in 2.0 seconds...;; File #2 downloaded.;; File #3 downloaded.;; java.io.IOException: Failed to download a file., retrying in 2.0 seconds...;; java.io.IOException: Failed to download a file., retrying in 2.0 seconds...;; File #4 downloaded.;; ...;; And the call eventually succeeds.

This example features almost everything you have to know about Perseverance. The
library allows you to insulate unreliable pieces of code without thinking too
much about how they are going to be used. Then, you wrap your top-level calls in
one (or a few) retry macros, specifying the exact policy of retries that fits
the task at hand. This separation helps us write functions and even libraries
that are very generic and at the same time robust to failures.

If you became interested, check the Github repository. There you will find the
installation instructions and usage specifics. I’ll be happy to receive any
comments and suggestions. Please, use this library and be [fault] tolerant!