Saturday, October 27, 2018

Systemd is bad parsing and should feel bad

Systemd has a remotely exploitable bug in its DHCPv6 client. That means anybody on the local network can send you a packet and take control of your computer. The flaw is a typical buffer-overflow. Several news stories have pointed out that this client was rewritten from scratch, as if that were the moral failing, instead of reusing existing code. That's not the problem.

The problem is that it was rewritten from scratch without taking advantage of the lessons of the past. It makes the same mistakes all over again.

In the late 1990s and early 2000s, we learned that parsing input is a problem. The traditional ad hoc approach you were taught in school is wrong. It's wrong from an abstract theoretical point of view. It's wrong from the practical point of view, error prone and leading to spaghetti code.

The first thing you need to unlearn is byte-swapping. I know that this was some sort epiphany you had when you learned network programming but byte-swapping is wrong. If you find yourself using a macro to swap bytes, like the be16toh() macro used in this code, then you are doing it wrong.

But, you say, the network byte-order is big-endian, while today's Intel and ARM processors are little-endian. So you have to swap bytes, don't you?

No. As proof of the matter I point to every other language other than C/C++. They don't don't swap bytes. Their internal integer format is undefined. Indeed, something like JavaScript may be storing numbers as a floating points. You can't muck around with the internal format of their integers even if you wanted to.

An example of byte swapping in the code is something like this:

In this code, it's taking a buffer of raw bytes from the DHCPv6 packet and "casting" it as a C internal structure. The packet contains a two-byte big-endian length field, "option->len", which the code must byte-swap in order to use.

Among the errors here is casting an internal structure over external data. From an abstract theory point of view, this is wrong. Internal structures are undefined. Just because you can sort of know the definition in C/C++ doesn't change the fact that they are still undefined.

From a practical point of view, this leads to confusion, as the programmer is never quite clear as to the boundary between external and internal data. You are supposed to rigorously verify external data, because the hacker controls it. You don't keep double-checking and second-guessing internal data, because that would be stupid. When you blur the lines between internal and external data, then your checks get muddled up.

Yes you can, in C/C++, cast an internal structure over external data. But just because you can doesn't mean you should. What you should do instead is parse data the same way as if you were writing code in JavaScript. For example, to grab the DHCP6 option length field, you should write something like:

The thing about this code is that you don't know whether it's JavaScript or C, because it's both, and it does the same thing for both.

Byte "swapping" isn't happening. We aren't extracting an integer from a packet, then changing it's internal format. Instead, we are extracting two bytes and combining them. This description may seem like needless pedantry, but it's really really important that you grok this. For example, there is no conditional macro here that does one operation for a little-endian CPU, and another operation for a big-endian CPU -- it does the same thing for both CPUs. Whatever words you want to use to describe the difference, it's still profound and important.

The other thing that you shouldn't do, even though C/C++ allows it, is pointer arithmetic. Again, it's one of those epiphany things C programmers remember from their early days. It's something they just couldn't grasp until one day they did, and then they fell in love with it. Except it's bad. The reason you struggled to grok it is because it's stupid and you shouldn't be using it. No other language has it, because it's bad.

I mean, back in the day, it was a useful performance optimization. Iterating through an array can be faster adding to pointer than incrementing an index. But that's virtually never the case today, and it just leads to needless spaghetti code. It leads to convoluted constructions like the following at the heart of this bug where you have to both do arithmetic on the pointer as well as on the length which you are checking against. This nonsense leads to confusion and ultimately, buffer overflows.

In a heckofalot of buffer overflows over the years, there's a construct just like this lurking near the bug. If you are going to do a rewrite of code, then this is a construct you need to avoid. Just say no to pointer arithmetic.

In my code, you see a lot of constructs where it's buf, offset, and length. The buf variable points to the start of the buffer and is never incremented. The length variable is the max length of the buffer and likewise never changes. It's the offset variable that is incremented throughout.

Because of simplicity, buffer overflow checks become obvious, as it's always "offset + x < length", and easy to verify. In contrast, here is the fix for the DHCPv6 buffer overflow. That this is checking for an external buffer overflow is less obvious:

Now let's look at that error code. That's not what ENOBUFS really means. That's an operating system error code that has specific meaning about kernel buffers. Overloading it for userland code is inappropriate.

That argument is a bit pedantic I grant you, but that's only the start. The bigger issue is that it's identifying the symptom not the problem. The ultimate problem is that the code failed to sanitize the original input, allowing the hacker to drive computation this deep in the system. The error isn't that the buffer is too small to hold the output, the original error is that the input was too big. Imagine if this gets logged and the sysadmin reviewing dmesg asks themselves how they can allocate bigger buffers to the DHCP6 daemon. That is entirely the wrong idea.

Again, we go back to lessons of 20 years that this code ignored, the necessity of sanitizing input.

Now let's look at assert(). This is a feature in C that we use to double-check things in order to catch programming mistakes early. An example is the code below, which is checking for programming mistakes where the caller of the function may have used NULL-pointers:

This is pretty normal, but now consider this other use of assert().

This isn't checking errors by the programmer here, but is instead validating input. That's not what you are supposed to use asserts for. This are very different things. It's a coding horror that makes you shriek and run away when you see it. In my fact, that's my Halloween costume this year, using asserts to validate network input.

This reflects a naive misunderstanding by programmers who don't understand the difference between out-of-band checks validating the code, and what the code is supposed to be doing validating input. Like the buffer overflow check above, EINVAL because a programmer made a mistake is a vastly different error than EINVAL because a hacker tried to inject bad input. These aren't the same things, they aren't even in the same realm.

Conclusion

Rewriting old code is a good thing -- as long as you are fixing the problems of the past and not repeating them. We have 20 years of experience with what goes wrong in network code. We have academic disciplines like langsec that study the problem. We have lots of good examples of network parsing done well. There is really no excuse for code that is of this low quality.

This code has no redeeming features. It must be thrown away and rewritten yet again. This time by an experienced programmer who know what error codes mean, how to use asserts properly, and most of all, who has experience at network programming.

11 comments:

You don't really go into enough detail. Are byte swapping and casting to internal structures inherently wrong or is it just when these are used on external data? I'm not aware that these operations in themselves can cause problems, just what you do with the resulting data.

I think you've overlooked the fact that assert_return isn't related to the C assert macro. assert_return is a systemd macro which *returns* when the given condition is false. It's just a shorthand for if (!cond) return.

You also seem to enjoy the word "undefined" without really quantifying it. Undefined has a specific meaning in C. You're holding it wrong. What I think you mean to suggest is that overlaying application defined data structures on top of data received from outside the application is unwise as it isn't parsing the data, just mapping c data structures to binary data. That's fine. That act alone doesn't cause trouble, nor does it prevent you from doing future validation of that data.

Your article raises some good points about systemd reinventing things which it need not, but you're quite off the mark in regards to your antiu-sggestions in how to fix the underlying problems.

"offset + x < length" is not a safe form for a bounds check, if x can be a value decoded from the network data. If x can have a value close to the maximum positive value for its type, then "offset + x" may overflow and compare less than length, even though x is not within bounds. Instead, compare x to length - offset, knowing that offset <= length is an invariant.

No, his points are sound and address the importance of principles and high-level clarity over low-level implementation.

1) Clear semantics is crucial, for example in validating & reporting input errors. The quality of engineering & engineers can be seen in their error reporting..

2) He's absolutely right that implementation-level byte manipulation is not a safe or robust approach to parse potentially unsafe external data. Never has been, never will be.

I think the suggestion to throw the code away and start again is excessive -- but we all know that's just hyperbole. The part that does need to be thrown away is the input parsing/ external interface, and that needs to be reimplemented with a more robust parsing approach.

Reliable checked parsing can be easily encapsulated in either C or C++.

Except the link you gave doesn’t say anything at all about htons or ntohs... In fact it implies that they are great. I completely agree with the arguments in that blog, but neither you nor Rob actually understands them. They key point from that blog is that the code should be about the relationship between the data stream’s byte order and the computer’s byte order, it should be about the data stream’s byte order only. And while ntohs is badly named (should be something more like big_endian_to_short), ntohs is exactly what we want. No more ifdefs or conditional code. I’ll put it this way: (data[0] << 8) + data[1] is the correct way extract a big endian number from a data stream (on any computer). But it’s complicated to type and understand, so it’s easy for the programmer to screw up and hard for a reviewer to immediately tell what it is. So the correct thing to do here is to create a helper function that always does this correctly. And hey! You’ve discovered ntohs.

In both cases (manually typing it out vs. a function called ntohs), you have to remember to do it before using the underlying data, and only the latter is especially clear about what the intention is. Ntohs is *better* code.

To be clear, the best way to do it is something like Chrome’s BigEndianReader: takes a buffer and it’s size and then has ReadU16, ReadU32, ReadBytes, etc. that return false if you overstep the bounds of the buffer. So error checking is simple, just if (reader.ReadU16(&dest)) { log error; }. Code doesn’t assume anything about the host system; no crazy ifdefs or conditional code.

Btw, Rob’s buffer bounds check has an integer overflow, so it’s wrong (BigEndianReader would not), as mentioned elsewhere in the comments. And if assert_return is as another comment says, a macro that returns if the condition is false... then the only decent point this article makes is a weak one about error codes.

The post is a bigger joke than the original code. The post is bad and you should feel bad.

Regarding Unknown's claim that the author overlooked what assert_return means:

I thought this too but actually I think the author's point is that the intention of the code doesn't match how it is written.First off if this function is called on input passed in from an external source then is EINVAL (Invalid Argument) really the right choice here? There's no way of distinguishing this error from "the programmer made a mistake and passed an invalid parameter" and "the data passed to the application is malformed". I think that the correct fix for this problem won't involve changing THIS piece of code, as it can stay as it is. The correct fix would be to add validation to the data BEFORE it is passed to this function. The point being that having a function containing the word "assert" being used to make run-time checks of external data is a misuse of the function. A differently named function (validate_return or something) which does the same thing could be used here and the intention would then at least match the code, but I think delaying the validation until a deeper function call is still unclear.

You are correct when you say that having (data[0] << 8) | data[1]; everywhere in your code would be bad code. You are correct when you say that this could be turned into a function. You are wrong when you say that this function is ntohs.When you have a structure and you read/fread/recv data straight over it and make a ntohs pass over it, or alternatively (an even worse option) if you have a buffer of bytes and you cast it to a structure and make a ntohs pass over that, you are making numerous assumptions which C doesn't really let you make and which can only be guaranteed by specific implementations.You are assuming that the data is correctly aligned, that the padding matches up, that you haven't got a bunch of trap representations and that your type sizes match up.These are all assumptions which you don't need to make in your code, making these assumptions doesn't even make your code any faster if you have a modern compiler, making these assumptions just makes your code less clear.And the worst thing is, when you use ntohs you don't even get to hide these assumptions, they're all still being made right there right in front of you in your very code.A wonderful book written by Brian Kernighan and Rob Pike shows a way of abstracting the process of packing and unpacking binary data in C with a very simple and straightforward interface (although if you don't like this one, there are still other better options than ntohs):unpack(buf, "csll", &c, &count, &dw1, &dw2);This is a line taken from page 220 of the first (and I think only) edition of the book. That page also covers basically all I've said above in response to the other comment too. It validates the input BEFORE it is extracted, then uses an assert inside the function which wraps the extraction to verify that the programmer didn't make a mistake when calling the function.

Aside from the author's blunder regarding the offset + x < length check, I don't see anything else particularly wrong with this article other than the aforementioned slightly sloppy use of the term "undefined" but at least calling everything which may be implementation defined, unspecified or undefined just "undefined" is not as big a mistake as for example incorrectly assuming that some undefined behaviour was actually implementation defined.In the end, everyone makes mistakes, I may have made some in this very comment.

I agree with pretty much all of your points, except I don’t understand your point about ntohs being “not” that function. It is *exactly* that function (and should be used inside the unpack function you mentioned). Also, if you use sized integer types and packed structs, you get a lot of guarantees... not saying that’s a good idea for parsing of course.

I still think the article is bad. It’s not in detail, it’s disorganized and badly written, it gives essentially no evidence for its assumptions and very little actionable feedback that’s actually worth taking into account. The comments have pointed to much better texts.

As a C programmer of some decades who has fixed and written up numerous CVEs, my reaction to the article is:

* I use pointer arithmetic, and I am not aware of any consensus among security-conscious C programmers to avoid it. For instance, BoringSSL's CBS abstraction (a fairly recent invention used to safely read from buffers of known length) uses pointer arithmetic. So I disagree with the article on this point.

* I do not like byte-swapping functions that convert integers to integers because I don't like the idea that an integer variable could hold a byte-swapped value, with no assurance from the type system that the value won't be used without swapping it to native byte order. Abstractions that convert byte strings to integers and vice versa are clearer, whether they are platform-neutral or (for performance) platform-specific. So I agree with the article on this point as a matter of personal taste, but I haven't seen evidence of a strong consensus that all code should avoid integer byte swapping.

* Manipulating a pointer and length in lock-step, by hand, isn't a great practice. My preferred solution is a lightweight, safe abstraction like the aforementioned CBS. If that isn't desirable for some reason, base/offset/length as suggested by the article is reasonable, carrying the inconvenience of having to add base and offset frequently. Pointer/end is also reasonable, instead carrying the inconvenience of subtracting pointer from end frequently.

* Reusing system error codes isn't a great practice, but it's an easy one. It is more important to have good, matching error codes and strings for errors you expect people to see in practice than for errors you do not. Malformed network input data isn't common (for binary formats), so seeing ENOBUFS or EINVAL being abused for that purpose doesn't really throw a red flag for me.

* The line "the code failed to sanitize the original input, allowing the hacker to drive computation this deep in the system" concerns me, because it seems to suggest that input buffer overruns should be prevented by a well-formedness check separate from parsing. For complicated inputs I don't think that is a good practice; drift between the sanitization code and the parsing code won't be obvious to a reader and will almost always result in a CVE.