It depends what you mean by resiliency. I tend to work on things with strong consistency requirements, and to be honest I think the way most people build and talk about distributed systems engineering is pretty gross and unprincipled.

Why is Jepsen so successful against most systems, despite it being so incredibly slow to actually exercise communication interleaving patterns? People are building systems from a fundamentally broken perspective where they are not actually considering the realistic conditions that their systems will be running in.

In my opinion, the proper response to this should be to ask how we can simulate realistic networks (and filesystems for that matter) on our laptops as quickly as possible, without requiring engineers to work with new tools.

My approach is to use quickcheck to generate partitions + client requests and implement participants as things that implement an interface that usually looks like:

receive(at, from, msg) -> [(to, msg)]

tick(at) -> [(to, msg)]

And this way a distributed algorithm can be single-stepped in accelerated time, and for each outbound message we use the current state of network weather to assign an arrival time / drop it. Stick it in a priority queue and iterate over this until no messages are in flight.

With something like this, every engineer can get a few thousand jepsen-like runs in a couple seconds before even opening a pull request. They don’t have to use any tools other than their language’s standard test support. You can write the simulator once and it has very high reuse value, since everything just implements the interface you chose. Way higher bug:cpu cycle ratio than jepsen.

This does not replace jepsen, as jepsen is still important for catching end-to-end issues in an “as-deployed” configuration. But I really do think jepsen is being totally misused.

We should build things in a way that allows us to quickly, cheaply measure whether they will be successful in the actual environments that we expect them to perform in.

Maximize introspectability. Everything is broken to some extent, so be sympathetic to future selves that have to debug it in production while failing spectacularly and causing everyone to freak out.

One kind of concurrency that few seem to consider until it’s time to upgrade: multiple versions of your code may be live in some situations. Did you build your system in a way that ensures this is safe? Did you build your system in a way that allows you to un-deploy a new version if it fails unexpectedly? Or did you build in points of no return?

One reason why I don’t do distributed systems fault injection consulting anymore is because of the egos of people who can’t accept their babies have problems. That got tiring really quickly. The #1 most important thing to building a reliable system is being humble. Everything we do is broken. That’s OK. So many engineers who learn how to open sockets begin to think of themselves as infallible rockstars. It’s really hard to build systems that work with these people.

I’m not a fan. I remember it as being what felt like “common sense” combined with some ideas that I think are bad. Of course, common sense means “lessons that took me years to learn”. No book is all good advice. Part of the learning process is implementing ideas that aren’t yours and learning when they are applicable and when they aren’t.

My biggest problem with “Release It!” was it felt more like “all good engineers do X”. Best practices never are. They are good ideas in a certain context. Sadly, most tech thought leaders teach them as absolutes.