The Legacy of C Programming Language!

I started learning C in 1996, and have had fun programming in it for many years now. It was also the first programming language for most of my classmates. Most students today learn languages like Java — no doubt it’s a safer language to program, and hence a good option to start with — but I think they miss the fun of programming for the machine.

For example, I remember writing a C program that switched on the keyboard Caps Lock (without pressing the actual key). More fun was graphics programming by writing directly to video memory, and the fancy things I could do by creating icons and windows (in the old days of DOS) by switching a matrix of pixels on and off, as needed.

C is not a perfect language, and writing programs in C is often like walking (or running) on a slippery slope. As Dennis himself commented, “C is quirky, flawed, and an enormous success.”

C is quirky; take, for instance, the way arrays, strings and pointers are related, and how this relationship can be exploited. As an example:

while(*t++ = *s++);

Given that s is the source string to be copied and t is the destination, this while loop copies the string from s to t. This curt code is possible because of the following: strings are implemented as an array of characters, and the starting of a string is an address (pointer). We can traverse an array by starting from the base of the array, and perform pointer arithmetic to access the elements.

In this code, as long as the characters from the source are non-NULL characters, the truth value in the while loop is non-zero (which is considered true), and hence the characters will be copied to the destination. When the source character value in the string is \0 or NULL, the while condition will be zero, and hence the loop will terminate. The result is that it copies the string from source to destination.

Of course, lots of things can go wrong in code like this.

Here, in the expression *s++, it is difficult to find out which operator has higher precedence — is it dereference (*) or is it postfix increment (++)?

If you look at the large operator precedence table, you’ll find that postfix increment (++) has higher precedence than dereference (*), and hence s++ is executed first, followed by *.

However, because ++ is postfix here, s++ is not effective till the end-of-the-statement (or more technically, the sequence point), and hence *s++ will be the value of the current character of the string to which s points.

Also, from *s++, it is not clear if the ++ applies to the underlying location in the string, or the character in the string. Since ++ is applied first, it applies to the address in the underlying string, which has the effect of changing the address to point to the next character.

Further, in the while loop, we purposefully use = instead of == (to assign the character). As you know, this behaviour is prone to bugs; in fact, mistyping = instead of == is one of the most common sources of bugs in C.

Similarly, there are many other quirks. Consider break and continue statements, for example. The break statement can be used within switch statements or the body of loops (while, for, and do-while). However, the continue statement can be used only within the body of loops, and not within switch statements. That’s a quirk.

By default, if we forget to use a break statement, control will fall-through to the next statement. If you think about it, it makes sense to use continue also — it could direct the control flow to continue to the next case statement, instead of having the default behaviour being to fall-through to the next statement. In this way, it could have also prevented countless bugs caused by forgetting break statements within switch statements.

Because of quirks like this, C is perhaps one of the very few programming languages in which a book has been written on its “traps and pitfalls” (‘C Traps and Pitfalls’, Andrew Koenig, Addison Wesley, 1989).

C is also flawed in many ways. For example, consider the following statement:

if(variable & BIT_FLAG != 0)

What we are perhaps trying to do here is to check if the variable has the BIT_FLAG set on or not. However, the expression would be treated as if( variable & (BIT_FLAG != 0) ) and not as if( (variable & BIT_FLAG) != 0 ). Why is this?

Because the operator precedence of relational equal operators (== and !=) is higher than bitwise operators (such as &, |, and ^). However, other bitwise operators, such as >> and << are of higher precedence than relational equal operators (which is correct). Then why this mistake?

An old mail from Dennis Ritchie explains how this happened:

—-

From decvax!harpo!npoiv!alice!research!dmr Fri Oct 22 01:04:10 1982

Subject: Operator precedence

Newsgroups: net.lang.c

The priorities of && || vs. == etc. came about in the following way.Early C had no separate operators for & and && or | and ||. (Got that?) Instead it used the notion (inherited from B and BCPL) of “truth-value context”: where a Boolean value was expected, after “if” and “while” and so forth, the & and | operators were interpreted as && and || are now; in ordinary expressions, the bitwise interpretations were used. It worked out pretty well, but was hard to explain. (There was the notion of “top-level operators” in a truth-value context.)

The precedence of & and | were as they are now. Primarily at the urging of Alan Snyder, the && and || operators were added. This successfully separated the concepts of bitwise operations and short-circuit Boolean evaluation. However, I had cold feet about the precedence problems. For example, there were lots of programs with things like if (a==b & c==d)…

In retrospect, it would have been better to go ahead and change the precedence of & to higher than ==, but it seemed safer just to split & and && without moving & past an existing operator. (After all, we had several hundred kilobytes of source code, and maybe 3 installations….)

Though we should not assign too much importance to language popularity and ratings, it is still noteworthy that C continues to be a popular language in the world today. Also, what is remarkable is that the other popular languages in this list — Java, C++, and C# — are heavily influenced by, and are direct or indirect descendants from C (though other languages like Simula and Smalltalk have more influence on these languages when it comes to OO). This influence is obvious in the form of basic data types, operators, keywords, syntax (such as using curly braces for blocks), etc., in these languages.

What is not obvious is the influence of C on various other aspects, such as semantics and pragmatics, in these languages. For example, one of the first few languages to separate I/O functions from the core language and move it into a supporting library was C; other languages we have listed here follow this tradition.

C is clearly not the cleanest language ever designed, nor the easiest to use, so why do many people use it? This is the question that you might find yourself asking.

Here is Stroustrup’s answer to it: “It is flexible [to apply to any programming area]… It is efficient [due to low-level semantics of the language]… It is available [due to availability of C compilers in essentially every platform]… It is portable [can be executed on multiple platforms, even though the language has many non-portable features]…”

Let’s discuss each of these points now.

C is a powerful and flexible systems programming language. Though it was originally designed for writing the UNIX OS, it is today used for a wide variety of system programming, such as database management systems, compilers and virtual machines, Web servers, text-processing systems, telephone-switching systems, etc.

All this is because of its flexibility in various ways. For example, C is not strongly typed. These days, many of the new languages are strongly typed. Strong typing allows one to catch mistakes (related to data-type usage) early, and hence helps develop more robust code. For example, in K&R C (i.e., before standardisation), we can do implicit conversions between pointers and integers, which is buggy. However, it has been disallowed now, but it still has its uses.

Today, most C compilers do stronger type-checking, and warn of potential mistakes. Still, C allows us to override these checks. The ability to write type-unsafe code is useful and important because it is often required in low-level programming tasks such as writing device drivers.

C is so flexible in its syntax that it is possible to misuse its flexibility — for example, to write obfuscated code in it; in fact, since 1984, there is a yearly contest for writing obfuscated C code —The International Obfuscated C Code Contest!

C is also efficient. For example, unlike most other languages today, there is no “C runtime”. At runtime of a C program, what exist are the memory-management routines, etc., but there is no sophisticated runtime support. Comparing the C runtime with the Java runtime would be wrong, since Java is meant for application programming — but this comparison is just to understand what we mean by “almost no runtime” support.

JVM is a sophisticated runtime, which performs various tasks such as checking the validity of the bytecode to execute (see if the code is well-formed, if the program is safe to execute, etc.), load and unload the bytecode from Java class files as needed, support GC, check the instructions before executing them and to throw an exception if needed (such as divide-by-zero, out-of-bounds access, etc.

C’s syntax and semantics is so close to the machine that it is often considered a low-level programming language, almost an abstraction over the assembly code. Today’s enterprise-quality optimising C compilers can generate code as fast as assembly.

Further, the size of the code generated by C also tends to be (arguably) small in size. It can be debated that some of C’s features help in that. Consider the example of the postfix ++ operator. Ken Thompson, while implementing the B compiler, noticed that increment operations generate more compact code than adding 1 to a variable and storing it back. In other words, an increment was one instruction, while adding 1 and storing it back was more than one instruction. So he implemented the prefix ++ (and –) operators, and generalised it by adding a postfix version. This tradition of prefix and postfix operators continues today in languages like Java.

“C is also portable.” This statement might seem surprising in the context of VM languages such as Java, which are popular today because they work without change (well, experience indicates, it is mostly without change) on different platforms. C is also available on almost all platforms today. The original C compiler written by Dennis Ritchie had numerous dependencies on the features of PDP-11 (an early machine for which C was first implemented). Around 1978, Steve Johnson wrote a portable compiler for C (The Portable C Compiler, or pcc for short — now available under the BSD Licenses). It made the task of porting the compiler to various other machines and platforms easy.

This helped spread C fast, together with the fast growth of UNIX installations (C was the primary language for UNIX machines). It is used in embedded devices (such as microwave ovens) to super computers. However, given the fact that C is a low-level language, its portability is surprising. Consider pointer arithmetic as an example for portability.

Given that a pointer is an abstraction of a (machine) address, one would logically assume that pointer arithmetic would require knowing the size of the data types on which the pointer points to. However, since the data type is encoded in the pointer type, the compiler automatically calculates the sizes required for pointer arithmetic. For this reason, the pointer arithmetic, though low-level, is portable, since the size of the data types is abstracted.

Of course, the size of the data type itself differs from machine to machine. For example, the size of the int data type is implementation-dependent. However, this implementation dependency aids in efficiency: the compiler can use the native size of the data type on that machine, and hence produce faster code. If the size of int is fixed in the language, say to 4 bytes, then it will be difficult (or even impossible) to port it to tiny machines; further, even if it is possible, such hard-coding would make such compiled programs comparatively inefficient.

Now, consider another example for portability: using floating-point types in the switch statement. In C, we can use only integral types in switch-case statements, and not floating point types. This is because, if it were allowed, it would require direct comparison of floating-point numbers, which is not portable.

To explain this, the implementation of floating-point numbers can differ across machines (though these days the IEEE 754 standard for floating-point arithmetic is almost universally followed).

Many real numbers cannot be accurately represented in the floating-point format; for example, the common real number 0.1. Hence, comparing floating-point numbers directly in equality checks can result in wrong results — and that too, different results for platforms with different implementations of floating-point numbers. If switch statements were to allow floating-point numbers, then the comparison of the switch condition variable value with case statement’s value would require direct comparison, the results of which would vary across platforms. Hence, floating-point values are not allowed in switch statements in C! (Though some modern languages allow it, they also mandate that the implementations must follow the IEEE 754 standard, and hence it is not much of a problem in those languages.)

To summarise, C is an interesting language to learn, and is fun to work with. It is also a small language, and behind its veil of simplicity lies power — it just requires many years of experience to understand and appreciate this fact.