Monday, 29 May 2017

Byzantine clients

Although I personally don't work on Byzantine block chains, we happen to be in the midst of a frenzy of research and hiring and startups centered on this model. As you probably know, a block chain is just a sequence (totally ordered) of records shared among some (perhaps very large) population of participating institutions. There is a rule for adding the next block to the end of the chain, and the chain itself is fully replicated. Cryptographic protection is used to avoid risk of a record being corrupted or modified after it is first recorded. A Byzantine block chain uses Byzantine agreement to put the blocks into order, and to force the servers to vote on the contents of every block. This is believed to yield ultra robust services, and in this particular use-case, ultra robust block chains.

Financial institutions are very excited about this model, and many have already started to use a digital contracts language called hyper-ledger that seems to express a wide range of forms of contracts (including ones that might have elements that can only be filled in later, or that reference information defined in prior blocks), and one can definitely use block chains to support forms of cyber currency, like BitCoin (but there are many others by now). In fact block chains can even represent actual transfers of money: this is like what the SWIFT international banking standard does, using text messages and paper ledgers to represent the transaction chain.

Modern block chain protocols that employ a Byzantine model work like this: we have a set of N participants, and within that set, we assume that at most T might be arbitrarily faulty: they could collude, might violate the protocol, certainly could lie or cheat in other ways. Their nefarious objective might be to bring the system to a halt, to confuse non-faulty participants about the contents of the blockchain or the order in which the blocks were appended, to roll-back a block (as a way to undo a transaction, which might let them double-spend a digital coin or renege on a contract obligation), etc. But T is always assumed to be less than N/3.

Of course, this leads to a rich class of protocols, very well-studied, although (ironically), more often in a fully synchronous network than in an asynchronous one. But the so-called PRACTI protocols created by Miguel Castro with Barbara Liskov work in real-world networks, and most block chain systems use some variation of them. They basically build on consensus (the same problem solved by Lamport's Paxos protocol). The non-compromised participants simply outvote any Byzantine ones.

What strikes me as being of interest here is that most studies have shown that in the real world, Byzantine faults almost never arise! And when they do, they almost never involve faulty servers. Conversely when attackers do gain control, they usually compromise all the instances of any given thing that they were able to attack successfully. So T=0, or T=N.

There have been a great many practical studies on this question. I would trace this back to Jim Gray, who once wrote a lovely paper on "Why Do Computers Stop and What Can Be Done About It?". Jim was working at Tandem Computers at that time, on systems designed to tolerate hardware and software problems. Yet they crashed even so. His approach was to collect a lot of data and then sift through it.

Basically, he found that human errors, software bugs and design errors were a much bigger problem then hardware failures. Jim never saw any signs of Byzantine faults (well, any fault is Byzantine. But I mean malicious behaviors, crafted to compromise a system).

More recent studies confirm this. At Yahoo, for example, Ben Reed examined data from a great many Zookeeper failures, and reported his findings ina WIPS talk at SOSP in 2007. None were Byzantine (in the malicious sense).

At QNX Chris Hobbs has customers who worry that very small chips might experience higher rates of oddities that would best be modeled as Byzantine. To find out, he irradiated some chips in a nuclear reactor, and looked at the resulting crashes to see what all the bit flips did to the code running on them (and I know of NASA studies of this kind, too). In fact, things do fail. But mostly, by crashing. The main issue turns out to be undetected data corruption, because most of the surface area on the chips in our computers is used for SRAM, caching, and DRAM storage. Data protected by checksums and the like remains safe, but these other forms are somewhat prone to undetected bit flips. But most data isn't in active use, in any case: most computer memory just has random stuff in it, or zeros, and the longest lived active form of memory is to hold the code of the OS and the application programs. Bit-flips will eventually corrupt instructions that do matter, but corrupted instructions mostly trigger faults. So, it turns out that the primary effect of radiation is to raise the rate of sudden crashes.

Back when Jim did his first study, Bruce Nelson built on it by suggesting that the software bugs he was seeing fall into two cases: Bohrbugs (deterministic and easily reproduced, like Bohr's model of the atom: an easy target to fix) and Heisenbugs (wherever you look, the bug skitters off to somewhere else). Bruce showed that in a given version of a program, Bohrbugs are quickly eliminated, but patches and upgrades often introduce new ones, creating an endless cycle. Meanwhile, the long-lived bugs fell into the Heisenbug category, often originating from data structure damage "early" in a run that didn't cause a crash until much later, or from concurrency issues sensitive to thread schedule ordering. I guess that the QNX study just adds new elements to that stubborn class of Heisenbugs.

So, we don't have Byzantine faults in servers, but we do have a definite issue with crashes caused by bugs, concurrency mistakes, and environmental conditions that trigger hardware malfunctions. There isn't anything wrong with using Byzantine agreement in the server set, if you like. But it probably won't actually make the service more robust or more secure.

Bugs can be fought in many ways. My own preferred approach is simpler than running Byzantine agreement. With Derecho or other similar systems, you just run N state machine replicas doing whatever the original program happened to do, but start them at different times and use state transfer to initialize them from the running system (the basis of virtual synchrony with dynamic group membership).

Could a poison pill kill them all at once? Theoretically, of course (this is also a risk for a Byzantine version). In practice, no. By now we have thirty years of experience showing that in process group systems, replicas won't exhibit simultaneous crashes, leaving ample time to restart any that do crash, which will manifest as occasional one-time events.

Nancy Leveson invented a methodology called N-Version programming; her hypothesis was that if most failures were simply due to software bugs, then that by creating multiple versions of the most important services, you could mask the bugs because coding errors would probably not be shared among the set. This works, too, although she was surprised at how often all N versions were buggy in the same way. Apparently, coders often make the same mistakes, especially when they are given specifications that are confusing in some specific respect.

Fred Schneider and others looked at synthetic ways of "diversifying" programs, so that a single piece of code could be used to create the versions: this method automatically generates a bunch of different versions from one source file, with the same input/output behavior. You get N versions too, often find that concurrency problems won't manifest in the identical way across the replicas, and they also are less prone to security compromises!

Dawson Engler pioneered automated tools for finding bugs and even for inferring specifications. His debugging tools are amazing, and with them, bug-free code is an actual possibility.

Servers just shouldn't fail in weird, arbitrary ways anymore. There is lots of stuff to worry about, but Byzantine compromise of a set of servers shouldn't be high on the radar.

Moreover, with strong cloud firewalls, Intel SGX, and layers and layers of monitoring, I would bet that the percentage of compromises that manage to take control of between and N/2-1 replicas (but not more), or. even of attacks that look Byzantine is even lower.

But what we do see (including Reed's 2007 study), are correct services that come under attack from some form of malicious client. For an attacker, it is far easier to hijack a client application than to penetrate a server, so clients that try to use their legitimate connectivity to the server for evil ends are a genuine threat. In fact with open source, many attackers just get the client source code, hack it, then run the compromised code. Clients that simply send bad data without meaning to are a problem too.

The client is usually the evil-doer. Yet the Byzantine model blames the server, and ignore the clients.

It seems to me that much more could be done to characterize the class of client attacks that a robust service could potentially repel. Options include monitoring client behavior and blacklisting any clients that are clearly malfunctioning, capturing data redundantly and somehow comparing values, so that a deliberately incorrect input can be flagged as suspicious or even suppressed entirely (obviously, this can only be done rarely, and for extreme situations), or filtering client input to protect against really odd data. Gun Sirer and Fred Schneider once showed that a client could even include a cryptographic proof that input strings really came from what the client typed, without tampering.

Manuel Costa and Miguel Castro came up with a great system they called Vigilante, a few years ago. If a server was compromised by bad client input, it detected the attack, isolated the cause and spread the word instantly. Other servers could then adjust their firewalls to protect themselves, dynamically. This is the sort of thing we need.

So here's the real puzzle. If your bank plans to move my accounts to a block chain, I'm ok with that, but don't assume that BFT on the block chain secures the solutions. You need to also come up without a way to protect the server against buggy clients, compromised clients, and clients that just upload bad data. The "bad dudes" are out there, not inside the data center. Develop a plan to to keep them there!