After all, a temporary halt to withdrawals and deposits is just a temporary
inconvenience; in Bitcoin’s experience consensus failures can usually be fixed
within a few hours. Equally, from the point of view of Kraken, it’s hard to
argue that multiple implementations made Ethereum more reliable: they still had
to stop using Ethereum for a few hours while the dust settled.

So as promised, I thought I’d talk in a bit more detail about why multiple
implementations don’t necessarily make a system more reliable. We’ll also ask
an important question: How is the Ethereum protocol specified?

What is Reliability Anyway?

If we’re going to try to make our consensus system better, we have to start
with figuring out what we’re trying to achieve in the first place. I used to work
in analog electronics, so lets make an analogy: Suppose I’m an electrical
engineer, and I’ve been given a budget to work with and asked to design a
reliable mains power system for a building. Does that mean I’m just trying to
design a system that maximizes the percentage of time the lights stay on,
within the budget allowed?

Of course not! “Reliability” isn’t as simple as keeping the lights on; there’s
at least four things I want my design to do reliably:

Reliably don’t kill people.

Reliably don’t burn the building down.

Reliably don’t allow the power distribution system to be permanently damaged
by faults.

Reliably keep the lights on.

Out of those four, just keeping the lights on - the availability of power -
is actually the lowest priority! I’d much rather the power go out
temporarily than have anything get permanently damaged, I definitely don’t want
to burn the building down, and if my design gets anyone killed I probably
should find a good lawyer ASAP.

Optimising for Reliability

Circuit breakers are a great example of these priorities in action. Circuit
breakers protect against shorts and overloads by cutting the flow of
electricity immediately; without them faults would result in permanent damage
and even fires.

They’re also single-points-of-failure: in most buildings between your lights
and the power grid there are two or three circuit breakers (or fuses) in series.
If any one of them fails the lights go off - no backups.

But when we consider our priorities that’s OK: keeping the lights on is the
lowest on our list, and for most purposes we can tolerate some downtime in
exchange for a safer power system. In short, we’ve sacrificed the
availability of power, in exchange for a higher overall reliability for the
same total budget.

Optimising for Availability

That’s not always a good trade-off: sometimes the power failing is itself a
safety problem. An interesting example are the “fire pumps” used in high-rise
fire sprinkler systems to supply water to the sprinkler heads:

Here our priorities are very different: if a fire pump is in use, there’s a
good chance the building is already on fire. To make a long story short,
building codes prohibit the installation of circuit breakers on circuits
that supply fire pumps in
many circumstances, because it’s better that the pump keep running so the
sprinklers can put the fire out, even at the risk of potentially destroying the
pump and wires connected to it due to a fault. Fire-pumps sacrifice overall
reliability in exchange for higher availability.

Redundancy: High Availability and High Reliability

What if I want the best of both worlds? I could install two fire pumps, both
protected by circuit breakers. I’d then have a system where faults are handled
safely without damaging equipment, and (hopefully!) if one pump fails in a fire
I’ll still be protected by the other.

Why don’t we do this? In the case of fire pumps, money. Building twice as many
pumps costs twice as much money, and for various reasons it’s more effective to
spend the money on other things like thicker wires that can handle fault
currents and higher quality pumps that are less likely to fail in the first
place. As often happens with trade-offs, your choices are reliability,
availability, and affordability: pick two.

Consensus Systems and Redundancy

But at least you can easily make those fire pumps redundant. For a pump,
redundancy is additive: if the left pump turns on and the right pump doesn’t,
water is still going to flow. Additional pumps only add to the availability of
the system, right?

Actually, even for something as simple as a pump that’s not necessarily true:
if two pumps supply the same pipe, if one of the pump fails the result is often
that the other pump wastes most of its output pushing water backwards through
the pump that failed; if the failure mode was a leak, the whole system could be
totally useless. So you need to add check-valves to the design, which means
the cost of two pumps is now a little higher than 2x. And those valves can
themselves fail, reducing reliability…

Consensus systems take this problem to the extreme; it’s really difficult to
use redundancy to make a consensus system more reliable. If we have two
different implementations of the same system, if one implementation thinks
Alice paid Bob and the other implementation thinks Alice paid Charlie, we have
a massive problem that must be fixed. Until that problem is fixed the system
simply isn’t safe to use: neither Charlie nor Bob can be sure that they’re
actually going to get paid. And if Bob and Charlie have both lost large sums of
money in contradictory ways… Have fun trying to come to consensus on who
should eat the loss.

In a consensus system naively adding redundancy subtracts from reliability in
a particularly bad way: not only do you have twice as much code that can have
bugs in it, previously harmless subtle implementation differences are now
serious problems.

Voting

So why not make three implementations, and have them vote two-out-of-three?

Real-world systems do work this way - the Space Shuttle went as far as to have
five different computers,
running two independently written versions of the flight software. Obviously
this comes at a cost, three implementations is roughly three times as much
work; the Space Shuttle wasn’t exactly a low-cost project.

But using voting in decentralized systems also fails in another subtle way:
part of the desire for independent implementations is the perception that
they’ll “decentralize development”. In that respect redundancy still fails: the
two-of-three solution is itself an implementation, with the choices of
which three implementations being the implementation!

How is the “Ethereum Protocol” Defined?

If we’re going to use redundancy, we need a specification. For something as a
simple as a pump, that specification doesn’t need to be all that detailed to
work:

As terrible as that specification is, if I tried to use it to buy a fire pump
I’d get something back that still put water on fires. It wouldn’t be the best
pump for the job - and I’d be the laughingstock of the job site - but when you
get down to it pumping water just isn’t that complex.

In comparison, here’s an extract from the Ethereum Homestead “Yellow Paper”:

And that’s just one of dozens of pages of densely written notation.

Yet is that sufficient to be a protocol specification? Apparently not! It’s
very telling that the DAO bailout hard-fork isn’t a part of that Yellow Paper,
the Ethereum wiki hasn’t been updated with the DAO bailout rules, nor was an
Ethereum Improvement Proposal written for
the DAO bailout. I looked at the codebases for Geth and Parity: both
implement the hard-fork, but neither code base2 points to a human-readable
specification describing what that hard-fork actually was.

I don’t believe a second, compatible implementation of Bitcoin will ever be a
good idea. So much of the design depends on all nodes getting exactly
identical results in lockstep that a second implementation would be a menace
to the network.
-Satoshi Nakamoto

As far as I can tell, just like Bitcoin, in practice while the Ethereum
protocol is documented by human readable text, the Ethereum protocol is
defined by executable code. Yet, it’s often claimed otherwise:

It's pretty nice that ETH has a spec and multiple implementations, in cases when there's a bug in a particular implementation.

Footnotes

Protocols with timeouts such as Lightning do change this, although the timeouts involved (should!) be on the order of a week or two; every Bitcoin fork to date has been resolved in no more than a few hours. ↩

The pull-req for Parity’s bailout implementation refers to this spec, but Geth doesn’t appear to mention that document at all. In any case, referring to random document on GitHub in a pull-req is pretty dodgy - there are multiple levels of trusted pointers that can fail there and make what the specification actually is unclear. ↩