To any software engineer, the two consecutive goto fail lines will be suspicious. They are, and more than that. To quote Adam Langley:

The code will always jump to the end from that second goto, err will contain a successful value because the SHA1 update operation was successful and so the signature verification will never fail.

Not verifying a signature is exploitable. In the upcoming months, there will remain plenty of devices running older versions of iOS and MacOS. These will remain vulnerable, and epxloitable.

Brittle Software Engineering

When first seeing this code, I was once again caught by how incredibly brittle programming is. Just adding a single line of code can bring a system to its knees.

For seasoned software engineers this will not be a real surprise. But students and aspiring software engineers will have a hard time believing it. Therefore, sharing problems like this is essential, in order to create sufficient awareness among students that code quality matters.

Code Formatting is a Security Feature

When reviewing code, I try to be picky on details, including white spaces, tabs, and new lines. Not everyone likes me for that. I often wondered whether I was just annoyingly pedantic, or whether it was the right thing to do.

The case at hand shows that white space is a security concern. The correct indentation immediately shows something fishy is going on, as the final check now has become unreachable:

Indentation Must be Automatic

Because code formatting is a security feature, we must not do it by hand, but use tools to get the indentation right automatically.

A quick inspection of the sslKeyExchange.c source code reveals that it is not routinely formatted automatically: There are plenty of inconsistent spaces, tabs, and code in comments. With modern tools, such as Eclipse-format-on-save, one would not be able to save code like this.

Yet just forcing developers to use formating tools may not be enough. We must also invest in improving the quality of such tools. In some cases, hand-made layout can make a substantial difference in understandability of code. Perhaps current tools do not sufficiently acknowledge such needs, leading to under-use of today’s formatting tools.

Code Review to the Rescue?

Code review can be effective against these sorts of bug. Not just auditing, but review of each change as it goes in. I’ve no idea what the code review culture is like at Apple but I strongly believe that my colleagues, Wan-Teh or Ryan Sleevi, would have caught it had I slipped up like this. Although not everyone can be blessed with folks like them.

There is a mismatch between the expectations and the actual outcomes of code reviews. From our study, review does not result in identifying defects as often as project members would like and even more rarely detects deep, subtle, or “macro” level issues. Relying on code review in this way for quality assurance may be fraught.

Automated Checkers to the Rescue?

If manual reviews won’t find the problem, perhaps tools can find it? Indeed, the present mistake is a simple example of a problem caused by unreachable code. Any computer science student will be able to write a (basic) “unreachable code detector” that would warn about the unguarded goto fail followed by more (unreachable) code (assuming parsing C is a ‘solved problem’).

Therefore, it is no surprise that plenty of commercial and open source tools exist to check for such problems automatically: Even using the compiler with the right options (presumably -Weverything for Clang) would warn about this problem.

Here, again, the key question is why such tools are not applied. The big problem with tools like these is their lack of precision, leading to too many false alarms. Forcing developers to wade through long lists of irrelevant warnings will do little to prevent bugs like these.

Unfortunately, this lack of precision is a direct consequence of the fact that unreachable code detection (like many other program properties of interest) is a fundamentally undecidable problem. As such, an analysis always needs to make a tradeoff between completeness (covering all suspicious cases) and precision (covering only cases that are certainly incorrect).

To understand the sweet spot in this trade off, more research is needed, both concerning the types of errors that are likely to occur, and concerning techniques to discover them automatically.

Testing is a Security Concern

As an advocate of unit testing, I wonder how the code could have passed a unit test suite.

Unfortunately, the testability of the problematic source code is very poor. In the current code, functions are long, and they cover many cases in different conditional branches. This makes it hard to invoke specific behavior, and bring the functions in a state in which the given behavior can be tested. Furthermore, observability is low, especially since parts of the code deliberately obfuscate results to protect against certain times of attacks.

Thus, given the current code structure, unit testing will be difficult. Nevertheless, the full outcome can be tested, albeit it at the system (not unit) level. Quoting Adam Langley again:

In other words, while the code may be hard to unit test, the system luckily has well defined behavior that can be tested.

Coverage Analysis

Once there is a set of (system or unit level) tests, the coverage of these tests can be used an indicator for (the lack of) completeness of the test suite.

For the bug at hand, even the simplest form of coverage analysis, namely line coverage, would have helped to spot the problem: Since the problematic code results from unreachable code, there is no way to achieve 100% line coverage.

Therefore, any serious attempt to achieve full statement coverage should have revealed this bug.

Note that trying to achieve full statement coverage, especially for security or safety critical code, is not a strange thing to require. For aviation software, statement coverage is required for criticality “level C”:

Level C:

Software whose anomalous behavior, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a major failure condition for the aircraft.

This is one of the weaker categories: For the catastrophic ‘level A’, even stronger test coverage criteria are required. Thus, achieving substantial line coverage is neither impossible nor uncommon.

The Return-Error-Code Idiom

Finally, looking at the full sources of the affected file, one of the key things to notice is the use of the return code idiom to mimic exception handling.

Since C has no built-in exception handling support, a common idiom is to insist that every function returns an error code. Subsequently, every caller must check this returned code, and include an explicit jump to the error handling functionality if the result is not OK.

This is exactly what is done in the present code: The global err variable is set, checked, and returned for every function call, and if not OK followed by (hopefully exactly one) goto fail.

Almost 10 years ago, together with Magiel Bruntink and Tom Tourwe, we conducted a detailed empirical analysis of this idiom in a large code base of an embedded system.

One of the key findings of our paper is that in the code we analyzed we found a defect density of 2.1 deviations from the return code idiom per 1000 lines of code. Thus, in spite of very strict guidelines at the organization involved, we found many examples of not just unchecked calls, but also incorrectly propagated return codes, or incorrectly handled error conditions.

Based on that, we concluded:

The idiom is particularly error prone, due to the fact that it is omnipresent as well as highly tangled, and requires focused and well-thought programming.

EDIT February 24, 2014: Fixed incorrect use of ‘dead code detection’ terminology into correct ‘unreachable code detection’, and weakened the claim that any computer science student can write such a detector based on the reddit discussion.

33 thoughts on “Learning from Apple’s #gotofail Security Bug”

I couldn’t agree more about how important code formatting is, and that not enough developers care about it. But, I couldn’t disagree more about the use of auto formatters. Yes, they’re fantastic for getting the indentation right – but they are god awful for anything else. They’re so bad, I turn off every trigger to automatically format code (Except of course getting the right indentation on new line)

It is of course true that in the process I lose a way to know that the indentation is correct and not going to mislead. But to counter that, I think if you care enough about how your code looks you’ll pick up on those mistakes immediately, and that the IDE should be able to warn you of unreachable code.

Great points about code inspection and coding standards. I know code walk throughs are boring and difficult to manage but if done right and with proper motivation, they can be very effective. See my brief remarks about coding practices and a walk through we did of the world’s first implantable defibrillator code: http://lockstep.com.au/blog/2011/02/23/programming-is-like-playwriti

But why isn’t anyone talking about the goto statements themselves? Arie says two gotos in a row are suspicious, but I say just of the buggers is suspicious! Don’t they teach programmers in college anymore that no program EVER needs a goto?? The goto encourages lazy design and sloppy coding and, well, look where it landed Apple? This problem would probably have never occurred if the programmer took care with an elegant design. And of course code inspection and testing is more reliable when code is well behaved.

I applaud Arie’s point about brittleness.

It’s a deep worry that so much of the modern world now depends on writing. Really bad writing.

Of course, goto’s are considered harmful, and naturally we teach this to our students (especially in The Netherlands where we are proud of Dijkstra). Actually, today’s students are trained in OO, functional programming, and exception handling, and would not be able to write code like this.

The reason I did not say that much about the goto’s (but I guess I should have) is that I considered them a direct consequence of the bigger decision to use a return code idiom to mimic exception handling. Because of this, there are so many gotos in the code. I would expect that the designers of this code were aware of Dijkstra’s letter, but that they needed some form of exception handling in C, and decided for this.

Idiomatic error handling can be done in a cleaner way (with macros, or with the longjmp goto), but, as we found in 2005, remains brittle and error prone.

One thing though: you stated: “In other words, while the code may be hard to unit test, the system luckily has well defined behavior that can be tested.”

The word ‘luckily’ is, imho, a bit odd here. It implies that ‘they are lucky that it has well defined behaviour’. Defining that behaviour is part of the job as well, that shouldn’t be a quality attribute that is in this system ‘by luck’.

Still, it shows that designing software with testability in mind is very important. It also shows that software testing in general is a highly interesting field.

For the unit tests we were ‘unlucky’, since the code will be very hard to unit test.

Usually, it will be harder to invoke specific functionality at the system level, as it will be hard to bring the system in a state in which it should exactly execute one particular line of code.
Here this turned out to be possible, which is not so self-evident. So I think that is why I used the word lucky …

Unit testing does not mean you only execute one line of code, it’s more about testing one very small piece of functionality or scenario. Obsessing over one unit == one function or one method misses the point about writing small, focused tests that help triangulate a point of failure.

It’s unfortunate the the Go authors have stuck with return codes–although at least they allow a bit more structure around it. Another thing I miss from the modular languages was that one had to actively ignore a returned result with a CALL statement.

Exceptions are a much better design pattern than return codes (since the default action is to propagate them rather than to ignore them). But they are still surprisingly fragile…. Over the years, I have seen a steady stream of code that swallows exceptions in one way or another.

I am suspicious of any code that catches exceptions, since this is an opportunity for a serious bug. Java’s introduction of “checked” exceptions to handle non-error situations is unfortunate, since it teaches programmers that they should catch exceptions everywhere, which undermines the whole design pattern.

I teach developers that they should throw exceptions a lot, clean up their resources with ‘finally’ clauses, and almost never write the keyword ‘catch’. Code reviews should examine every ‘catch’ clause very carefully!

C++ exceptions are not great in system code, as they end up being a call to longjmp without calling setjmp – one uncaught system error and the app exits, which is not ideal. Also, they leak things not allocated on the stack, whereas this cleanup pattern fixes that (finally clauses can mitigate this – do you use them religiously?)

Test-driven development (TDD) done consistently throughout a project results in near-100% logic coverage. That’s better than just statement coverage. It also results in testable code, better modularity, better design (if you pay attention to code smells), and faster software development.

The heart of TDD is: 1. write a test. running that test should fail. 2. make the test pass by writing just enough code (all other tests should be passing as well.) 3. refactor. 4. Repeat until desired feature is implement. Checking into source-code-control can be done whenever all tests are passing.

If code is only written to make a test pass (as in TDD), you can’t write an if statement until you have a test that requires it. You’ll end up with a test for the if statement being true and a test for the if statement being false.

TDD in C is not terribly hard. I’ve taught people how to TDD in C and other languages. And the suites of tests created by TDD makes it possible to refactor with more confidence.

If one or more TDD tests (and other tests, like system tests) fail, someone broke something. The fine-grain testing that is part of TDD will usually pin-point where the new bug is. Undo the changes that broke the test, or fix things before doing anything else.

(Tests that fail unpredictably indicate something is non-deterministic either in the code-under-test or the test itself. That also needs to be fixed.)

landonf demonstrated that SSLVerifySignedServerKeyExchange() is unit-testable in isolation. See his code on github:

I’m surprise that nobody stated so far that had Apple moved beyond C to, say C++, this specific bug, at least, would no have occurred. Using the Resource Acquisition Is Initialization (RAII) idiom would have eliminated the need for ‘goto fail’ in order to perform resource cleanup. I’m totally on board with TDD, but if we can eliminate whole categories of bugs just by using slightly more modern tools (compilers), why not?

The makers of “BeOS” wrote their system in C++ and then had problems when the compiler vendor changed the internals of the C++ library/data-structures. I recall this as an example of “The fragile base-class”. (Also, don’t write your OS with someone else’s compiler.)

Objective-C was/is less fragile, and NextStep even had device drivers written in Objective-C. If I recall correctly, Apple even unified C++ and Objective-C exception throwing/catching. As you saw in this case, error-handling was “by convention” with limited language support.

Almost 10 years ago… wow! Did you know the codebase you are talking about is now 24 MLOC instead of the 12 MLOC back then?

Thanks for the great post, by the way! Isn’t the core problem here that programming is still a free-format painting-the-cave-wall exercise? What if the entry (and change) of “code” would become several levels more strict, such that typing in that goto here would have done nothing short of disabling the keyboard to prevent a commit? I’m thinking of approaches like proposed by Markus Voetler (mbeddr, I think) where you appear to be typing in code, but actually the IDE is interpreting and potentially strictly checking the input. Sure, you’ll need to implement rules to make such an approach work, but you’ll write down coding rules anyway (in a big manual that nobody will read).

Isn’t it time to stop scratching our programs on the rocky walls of a cave?

Starting with an ‘unknown error’ seems intuitive. In this case, however, the ‘err’ variable is overwritten upon every function call. Therefore, as soon as a function returns successfully err gets the success status.

Labelling the last three statements with ‘fail’ seems a bit misleading too, indeed, as these lines are also executed upon success. They are more like ‘finally’ in try-catch syntax.

1) Dead code analysis would warn things like this, if used
2) You should always initialize statuses to error, certainly so when certificates are tested
3) You should always use block brackets, regardless number of statements
4) You should if -else if instead of sequence of ifs