Saturday, January 16, 2016

Some notes C in 2016

On r/programming was this post called "How to C (as of 2016)". It has some useful advice, but also some bad advice. I thought I'd write up comments on the topic. As somebody mentioned while I was writing this, only responsible programmers should be writing in C. Irresponsible programmers should write other languages that have more training wheels. These are the sorts of things responsible programmers do.

Use a debugger
The #1 thing you aren't doing, that you should be doing, is stepping through each line of code in a source level debugger as soon as you write it. If you only pull out the debugger to solve particularly difficult problems, then you are doing it wrong.

That means using an IDE like Visual Studio, XCode, or Eclipse. If you are only using an editor (without debugging capabilities), you are doing it wrong. I mention this because so many people are coding in editors that don't have debuggers. I don't even.

It's a concern for all language, but especially with C. When memory gets corrupted, you need to be able to dump structures and memory in order to see that. Why is x some weird value like 37653? Using printf() style debugging won't tell you, but looking at the hexdump of the stack will clearly show you how the entire chunk of memory was overwritten.

And debug your own code

Because C has no memory protection, a bug in one place can show up elsewhere, in an unrelated part of code. This makes debugging some problems really hard. In such cases, many programmers throw up their hands, say "I can't fix this", and lean on other programmers to debug their problem for them.

Don't be that person. Once you've gone through the pain of such bugs, you quickly learn to write better code. This includes better self-checking code that makes such bugs show up quicker, or better unit tests that cover boundary cases.

Code offensively
I once worked on a project where the leaders had decided to put "catch(...)" (in C++) everywhere, so that the program wouldn't crash. Exceptions, even memory corruption, would be silently masked an the program would continue. They thought they were making the code more robust. They thought it was defensive programming, a good principle.

No, that isn't defensive programming, just stupid programming. I masks bugs, making them harder to find in the long run.

You want to do the reverse. You want offensive code such that bugs cannot survive long undetected.

One way is assert(), double checking assumptions that you know must always be true. This catches bugs before they have a chance to mysteriously corrupt memory. Indeed, when debugging mystery C bugs, I'll often begin by adding assert() everywhere I suspect their might be a problem. (Although, don't go overboard on asserts)

The best offensive coding is unit tests. Any time something is in doubt, write a unit test that stresses it. The C language has a reputation for going off in the weeds when things scale past what programmers anticipated, so write a test for such cases.

Code for quality

On a related note, things like unit tests, regression tests, and even fuzz testing are increasingly becoming the norm. If you have an open-source project, you should expect to have "make test" that adequately tests it. This should unit test the code with high code coverage. It's become the standard for major open-source projects. Seriously, "unit test with high code coverage" should be the starting point for any new project. You'll see that in all my big open-source projects, where I start writing unit tests early and often (albeit, because I'm lazy, I have inadequate code coverage).

AFL fuzzer is relatively new, but it's proving itself useful at exposing bugs in all sorts of open-source projects. C is the world's most dangerous language for parsing external input. Crashing because of a badly formatted file or bad network packet was common in the past. But in 2016, such nonsense is no longer tolerated. If you aren't nearly certain no input will crash your code, you are doing it wrong.

And, if you think "quality" is somebody else's problem, then you are doing it wrong.

Stop with the globals
When I work with open-source C/C++ projects, I tear my hair out from all the globals. The reason your project is tough to debug and impossible to make multithreaded is that you've littered with global variables. The reason refactoring your code is a pain is because you overuse globals.

There's occasionally a place for globals, such the debug/status logging system, but otherwise it's a bad, bad thing.

A bit OOP, a bit functional, a bit Java
As they say, "you can program in X in any language", referring to the fact that programmers often fail to use the language as it was intended, but instead try to coerce it into some other language they are familiar with. But that's just saying that dumb programmers can be dumb in any language. Sometimes counter-language paradigms are actually good.

The thing you want from object-oriented programming is how a structure conceptually has both data and methods that act on the data. For struct Foobar, you create a series of functions that look like foo_xxxx(). You have a constructor foo_create(), a destructor foo_destroy(), and a bunch of functions get act on the structure.

Most importantly, when reasonable, define struct Foobar in the C file, not the header file. Make the functions public, but keep the precise format of the structure hidden. Everywhere else refer to the structure using forward references. This is especially important for libraries, where exporting their headers destroys ABI compatibility (because structure size changes). If you must export the structure, then put a version or size as its first parameter.

No, the goal here isn't to emulate the full OOP paradigm of inheritance and polymorphism. Instead, it's just a good way of modularizing code that's similar to OOP.

Similarly, there some good ideas to pull from functional programming, namely that functions don't have "side effects". They consume their inputs, and return an output, changing nothing else. The majority of your functions should look like this. A function that looks like "void foobar(void);" is the opposite of this principle, being a side-effect only function.

One area of side-effects to avoid is global variables. Another is system calls that affect the state of the system. Globals are similar to deep variables within structures, where you call something like "int foobar(struct Xyz *p);" that hunts deeply in p to find the parameters it acts on. It's better to bring them up to the top, such as calling "foobar(p->length, p->socket->status, p->bbb)". Yes, it makes the parameter lists long and annoying, but now the function "foobar()" depends on simple types, not a complex structure.

Part of this functional attitude to programming is being aggressively const correct, where pointers are passed in as const, so that the function can't change them. It communicates clearly which parts are output (the return value and non-const pointers), and which are the inputs.

C is a low-level systems language, but except for dire circumstances, you should avoid those C-isms. Instead of C specific, you should write your code in terms of the larger C-like language ecosystem, where code could be pasted into JavaScript, Java, C#, and so on.

That means no pointer arithmetic. Yes, in the 1980s, this made code slightly faster, but since 1990s, it provides no benefit, especially with modern optimizing compilers. It makes code hard to read. Whenever there is a cybersecurity vulnerability in open-source (Hearbleed, Shellshock, etc.), it's almost always in pointer-arithmetic code. Instead, define an integer index variable, and iterate through arrays that way -- as if you were writing this in Java.

This ideal also means stop casting structs/integers for network-protocol/file-format parsing. Yes, that networking book you love so much taught you to do this, using things like "noths(*(short*)p)", but it was wrong then when the book as written, and is wronger now. Parse the integers like you would have to in Java, such as "p[0] * 256 + p[1]". You think casting a packed structure on top of data to parse it is the most "elegant" way of doing it, but it's not.

Ban unsafe functions
Stop using deprecated functions like strcpy() and spritnf(). When I find a security vulnerability, it'll be in these functions. Moreover, it makes your code horribly expensive to audit, because I'm going to have to look at every one of these and make sure you don't have a buffer-overflow. You may know it's safe, but it'll take me a horrendously long time to figure out for myself. Instead, use strlcpy()/strcpy_s(), and snprintf()/sprintf_s().

More generally, you really need to be really comfortable knowing what both a buffer-overflow and integer-overflow is. Go read OpenBSD's reallocarray(), understand why it solves the integer-overflow problem, then using it instead of malloc() in all your code. If you have to, copy the reallocarray() source from OpenBSD and stick it in your code.

You know how your code mysteriously crashes on some input? Unsafe code is probably the reason. Also, it's why hackers break into your code. Do the right things, and these problems disappear.

The "How to C" post above tells you to use calloc() everywhere. That's wrong, it still leaves open the integer overflow bug on many platforms. Also, get used to variable-sized thingies, which means using realloc() a lot -- hence reallocarray().

There's much more to writing secure code, but if you do these, you'll solve most problems. In general, always distrust input, even when it's from a local file or USB port you control.

Stop with the weird code

What every organization should do is organize an after-work meeting, where anybody can volunteer to come and hash out an agreement for a common style-guide for the code written in the organization. Then fire every employee who shows up. It's a stupid exercise.

The only correct "style" is to make code look unsurprisingly like the rest of the code on the Internet. This applies to private code, as well as any open-source you do. The only decision you have to make is to pick an existing, well-known style guide to follow, like Linux, BSD, WebKit, or Gnu.

The nice thing about other languages, especially Python, is that there isn't the plethora of common styles like there is in C. That was one of the shocking things about the Hearbleed vulnerability, that OpenSSL uses the "Whitesmiths" style of braces, which was at one time common but is now rare and weird. LibreSSL restyled it to the BSD format. That's probably a good decision: if your C style is fringe/old, it may be worth restyling it to something common/average.

You know that really cool thing you've thought of, and think everyone will adopt once they see the beauty of it in your code? Yea, remove that thing, it just pisses everyone off. Or, if you must use that technique (it happens sometimes), document it

The future is multicore
CPUs aren't going to get any faster. Instead, we'll increasingly get more CPU cores on the chip. That doesn't mean you need to worry about making your code multithreaded today, but it means you should probably consider what that might look like in the future.

No, mutexes and critical sections don't make your code more multicore. Sure, they solve the safety problem, but an enormous cost to performance. It means your code might get faster for 2 or 3 cores, but after that, adding cores will instead make your software go slower. Fixing this, getting multicore scalability is second only to security in importance to the C programmer today.

One of these days I'll write a huge document on multicore scalability, but in the meanwhile, just follow the advice above: get rid of global variables and the invisible sharing of deep data structures. When we have to go in later and refactor the code to make it scale, our jobs will be significantly easier.

Stop using true/false for success/failure
The "How to C" document tells you that success always means true. This is garbage. True means true, success means success. They don't mean each other. You can't avoid the massive code out there that returns zero on success and some other integer for failure.

Yea, it sucks that there is no standard, but there is never going to be one. Instead, the wrong advice of the "How to C" doc is a good example of the "Stop with weird code" principle. The author thinks if only we could get everyone to do it his way, if we just do it hard enough in our own code to enlighten others, then the problem is going to be solved. That's bogus, programmers aren't ever going to agree on the same way. Your code has to exist in a world where where ambiguity exists, where both true and 0 are common indicators of success, despite being opposite values. The way to do that is unambiguously define SUCCESS and FAILURE values.

When code does this:

if (foobar(x,y)) { ...; } else { ...; }

There's no way that I, reading your code, can easily know which case is "success" and which is failure. There are just too many standards. Instead, do something like this:

if (foobar(x,y) == Success) { ...; } else { ...; }

The thing about integers

The "How to C" guide claims there's no good reason to use naked "int" or "unsigned", and that you should use better defined types like int32_t or uint32_t. This is nonsense. The input to many library functions is 'int' or 'long' and compilers are getting to be increasingly type-savvy, warning you about the difference even when both are the same size.

Frankly, getting integers 'wrong' isn't a big source of problems. That even applies to 64-bit and 32-bit issues. Yes, using 'int' to hold a pointer will break 64-bit code (use intptr_t or ptrdiff_t or size_t instead), but I'm astonished how little this happens in practice. Just mmap() the first 4-gigabytes as invalid pages on startup and run your unit/regression test suite, and you'll quickly resolve any problems. I don't need to really recommend to you how best to fix it.

But the biggest annoyance in code is that programmers want to redefine integer types. Stop doing that. I know having "u32" in your code makes it pretty for you, but it just makes it more annoying for me, who has to read your code. Please use something standard, such as "uint32_t" or "unsigned int". Worse, don't arbitrarily create integer types like "filesize". I know you want to decorate this integer with additional meaning, but the point of C programming is "low level", and this just annoys the heck out of programmers.

Use static and dynamic analysis

The old hotness in C was "warning levels" and "lint", but in modern C we have "static analysis". Clang has dramatically increased the state-of-the-art of the sorts of things compilers can warn about, and gcc is busy catching up. Unknown to many, Microsoft has also had some Clang-quality statis-analysis in its compilers. XCode's ability to visualize Clang's analysis is incredible, though you get the same sort of thing with Clang's web tools.

But that's just basic static analysis. There are many security-focused tools that take static analysis to the next level, like Coverity, Veracode, and HP Fortify.

All these things produce copious "false positives", but that's a misnomer. Fixing code to accommodate the false positive cleans up the code and makes it dramatically more robust. In other words, such things are often "this code is confusing", and the solution is to clean it up. Coding under the constraint of a static analyzer makes you a better programmer.

Dependency hell

In corporations, after a few years, projects become unbuildable, except on the current build system. There's just too many, poorly documented, dependencies. One company I worked for joked they should just give their source to their competitors, because they'd never be able to figure out how to get the thing to build.

And companies perpetuate this for very good sounding reasons. Another company proposed standardizing on compiler versions, to avoid the frequent integration problems from different teams using different compilers. But that's just solving a relatively minor problem by introducing a major problem down the road. Solving integration problems keeps the code healthy.

Open-source has related problems. Dependencies are rarely fully document, and indeed are often broken. Many is the time you end up having to install two incompatible versions of the same dependency in order to get the resulting code finally compiled.

The fewer the dependencies, the more popular the code. There are ways to achieve this.

Remove the feature. As a whole, most dependencies exist for some feature that only 1% of the users want, but which burden the other 99% of the user base.

Include just the source file you need. Instead of depending on the entire OpenSSL library (and its dependencies), just include the sha2.c file if that's the only functionality from OpenSSL that you need.

Include their entire source in your tree. For example, Lua is a great scripting language in 25kloc that really needs no updates. Instead of forcing users to hunt down the right lua-dev dependency, just include the Lua source with your source.

Load libraries at runtime, through the use of dlopen(), and include their interface .h files as part of your project source. That means they aren't burdened by the dependency unless they use that feature. Or, if it's a necessary feature, you can spit out error messages with more help on fixing the dependency.

Understand undefined C

You are likely wrong about how C works. Consider the expression (x + 1 < x). A valid result of this is '5'. It's because C doesn't define what happens when 'x' is the maximum value for an integer and you add 1 to it, causing it to overflow. One thing many compilers do is treat this as the 2s complement overflow, similar to how other languages do (such as Java). But some compilers have been known to treat this as an impossibility, and remove the code depending on it completely.

Thus, instead of relying upon how the current version of your C compiler works, you really need to code according to the large spec of how C compilers may behave.

Conclusion

Don't program in C unless you are responsible. Responsible means understand buffer-overflows, integer overflows, thread synchronization, undefined behaviors, and so. Responsible means coding for quality that aggressively tries to expose bugs early. This is the nature of C in 2016.

this is wrong. strlcpy is no better than strncpy -- the result may be \0 terminated but it's still undefined. silent truncation yields undefined results which can be as bad as the stack or heap smash that strcpy might have allowed. allowing a program to reach an undefined state is bad software engineering. there are no shades of grey in "undefined"-ness.

if you're sure that the result will fit, then code a strncpy-like function that abort()'s if you're wrong. if you're not sure the result will fit, then check it yourself at runtime and do something sensible (that isn't silent truncation.)

I too disagree with many (most?) of Matt's points on his original post. Overall he trains the programmer into taking care of many more exceptions than needed, resulting in poorer code with many more traps to fall into. The best example is the lazy struct initialization which doesn't initialize everything. Another example is the recommendation to use calloc() instead of malloc(), resulting in people completely forgetting about memset() and forgetting it as well after a realloc().

I also disagree with the recommendation on integers. You should use "int" when you want an int, you have to use uint8_t when you want a 8-bytes unsigned int, etc. Forcing a 8-bit AVR to open-code a 32-bit counter for a for() loop scanning 10 items is a total waste of CPU cycles and program code. And making a 32- or 64- bit platform implement a loop on uint8_t is inefficient as well. Not to mention the extra complexity required in printf format... The proper solution is to teach people all available types in modern compilers and to train them to use the type that matches what they *need*.

I don't like inlined variable declarations, they often shadow other ones, and require to the reader to visually scan the whole function to find where that damn decalaration was hidden. By putting your variables at the top of a block, the reader only has to visually scan each outer block.

I don't like to use bools because results are rarely white or black, often you need several degrees. At least "OK", "recoverable error", "fatal error". You mentionned the possible use of strlcpy(), for me it's a perfect example where the caller would like to know if the string was truncated because for a number of use cases it will have to be considered a failure and only the caller knows.

Code formaters make your code unversionable and unmaintainable.

It is very important to teach people to write code that *makes sense* and not *valid code*. Developers spend maybe 10x more time reading their code than writing it. It is critically important that code is easy to understand and non-ambiguous. Writing bugs is fine. Bugs are everywhere. What is important is not to avoid bugs, it's to ensure they're quickly spotted because what you read suddenly doesn't make sense anymore. And comments are there for this reason : explain why you are doing what you do. If someone disagrees, possibly he found a bug in your code. And in general you'll find that code written in this spirit fails far less often than code written according to Matt's way.

However I appreciate that someone started such a work of collecting best practises, at least it has ignited a flow of responses and may lead to people picking what they like best and taking care of the documented traps.

"That means no pointer arithmetic. Yes, in the 1980s, this made code slightly faster, but since 1990s, it provides no benefit, especially with modern optimizing compilers. It makes code hard to read. Whenever there is a cybersecurity vulnerability in open-source (Hearbleed, Shellshock, etc.), it's almost always in pointer-arithmetic code."