The error handling chapter contains a lot of really good information, but I'm missing a bit of background on how to think about error handling. Error handling, especially at a distributed systems scale, requires a different mindset. You describe the let-it-fail thinking but fault-tolerancy goes beyond that and I don't really have the feeling the book gives me a feel for what fault-tolerancy looks like.

For example there's no discussion of graceful degradation, but that's a key aspect of fault-tolerancy (together with self-repair). I'm also missing a mindset of "errors are not exceptional", that is, errors are not only something that can occur, errors are something that will occur often. If you have a distributed system of a 100 nodes there will be plenty of broken hard drives.

Let-it-fail / intentional programming does not mean having to worry less about errors, it means thinking about errors in a different way. Instead of focussing on error handling the focus shifts to recovery.

I guess what I'm trying to say is, I miss a bit of background on the kind of thinking that leads to the error handling mechanisms of Erlang and that is useful to use them optimally. There are bits and pieces of this thinking spread around the chapter but I'm missing them put together in a "big picture".