Monday, September 13, 2010

In response to a previous article a poster in some forum called me an idiot and said "everybody knows that C is a portable assembly language with some syntax sugar." The idiot insult hurt deeply and I cried, but once the tears were dry I resolved to write a bit more on why C isn't assembly, or is at best an assembly for a strange, lobotomized machine. Also, I may have misspelled "more on" in this article's title.

To compare the two things I kinda need to define what I'm comparing. By C I mean an ANSI C standard, whichever version floats your boat. Most real C implementations will go some distance beyond the standard(s), of course, but I have to draw the line somewhere. Assembly is more problematic to define because there have been so many and some have been quite different. But I'll wave my hands a bit and mean standard assembly languages for mainstream processors. To level set: my assembly experience is professionally all x86 and I had a bit of RISC stuff in college plus I did some 6502 back in the day.

Assembly-Like Bits

There are some very assembly like things about C and its worth mentioning them. The main one is that C gives you very, very good control over the layout of data structures in memory. In fact, on modern architectures, I'd bet that this fact is the primary ingredient in the good showing that C (and C++) have in the language shootout. Other languages produce some damn fine executable code, but their semantics make it hard to avoid pointer indirection, poor locality of reference, or other performance losses.

There are a few other assembly-ish thing in C. But not much. Now to why C is just not assembly. I'll start with...

The Stack

As far as I'm concerned the number one thing that makes C not assembly is "The Stack." On entry into a function C allocates space needed for bookkeeping and local variables and on function exit the memory gets freed. It's very convenient, but it's also very nearly totally opaque to the programmer. That's completely unlike any mainstream assembly where "the stack" (lower case) is just a pointer, probably held in a register. A typical assembly has no problem with you swapping the stack (e.g. to implement continuations, coroutines, whatever) while C defines no way to do much of anything with the stack. The most power that C gives you on the stack is setjmp/longjump - setjmp captures a point on the stack and longjmp lets you unwind back to that point. setjmp/longjmp certainly have an assembly-like feel to them when compared to, say, exceptions, but they are incredibly underpowered when compared to the stack swizzling abilities of a typical assembly language.

Speaking of continuations, let's talk about...

Parallel and Concurrent Code

The C standard has almost nothing to say about parallel and concurrent code except that the "volatile" keyword means that a variable might be changed concurrently by something else so don't optimize. That's about it. Various libraries and compilers have something to say on the subject, but the language is pretty quiet. A real assembly, or at least a real assembly on any modern architecture, is going to have all kinds of stuff to say about parallelism and concurrency: compare and swap, memory fences, Single Instruction Multiple Data (SIMD) operations, etc. And that leads a bit into...

Optimization

Under gcc -std=c99 -S at -O1 I get pretty standard sequential code. Under -O3 I get highly parallel code using x86 SIMD operations. An assembler that changed my code that way would be thrown out as garbage but a C compiler that DIDN'T parallelize that code on hardware that could support it would seem underpowered. And given the highly sequential nature of the original code the transformation to parallel code goes way beyond mere de-sugaring.

Optimization leads me to ...

Undefined Behavior

The C standard says a bunch of stuff leads to undefined behavior. Real compilers take advantage of that to optimize away code, sometimes with surprising results.

What assembly has undefined behavior on that scale? mov al, [3483h] ;(Intel syntax) is going to try to copy the byte at 3483h into the al register, period. It might trap due to memory protection but the attempt will be made.

Conclusion

Alan Perlis famously said "a programming language is low level when its programs require attention to the irrelevant." What's relevant depends on the problem and the domain, so a corollary would be that a language that's at too high a level prevents paying attention to some of what is relevant for your needs.

C is a low level language as compared with most languages used for most day to day programming. If you spend all day hacking Ruby or Haskell then C might look like assembly. But C is actually at a substantially higher level than assembly. Assembly both permits and requires you to do things that C hides.

Please stop perpetuating the myth that C is sugar for assembly because real compilers make complicated, non-obvious transformations to get from C to machine code that assemblers don't. And please stop saying that C is a portable assembly because if it is then it's the assembly for a very peculiar machine that is both underspecified and a bit stupid when compared to real hardware.

52 comments:

FredZarguna
said...

Actually, the oldest FORTRAN compilers were, in many ways lower-level than 360 Assembler. We used to code all the character routines in Assembler because FORTRAN was so awful at string manipulation and formatting. For small relocatable code, we even wrote some inline stuff in machine language and punched as INTEGERs it into COMMON Blocks, which we then called as if they were subroutines.

Am I dating myself here? (I did that in High School and didn't care for it...)

Anyway, what about:

void foo(){ asm mov ax,[3483H]}

/* just kidding... anybody who says C is assembler hasn't coded any assembler, and probably hasn't coded any C. */

Generally speaking, each IBM (mainframe) COBOL statement compiles to one or two IBM 360 assembly language instructions. After all, the main COBOL directives are almost the same as the assembly mnemonics. The COBOL language came first (1960 vs 1963), so I suspect this is not an accident.

As a 6502 and x86 assembly programmer, I'm surprised at your "undefined behavior" comment. Both chips had variants produced by multiple companies which ended up with different opcodes in certain places.

You could most definitely write 6502 assembly code whose behavior was undefined, in the sense that it ran differently on different 6502's. The only thing that makes C special is that they admitted it. An honest technical specification for the 6502 would indeed list undocumented opcodes as undefined behavior.

No, that is absolutely not true. A "PERFORM" statement, for example, has no direct equivalent in 360 assembly language. I've written both COBOL and 360 assembler and I can honestly say that no similarity between the two ever crossed my mind.

The Main definition of an "Assembly" language is a One-TO-One correspondence with the CPU machine language it is designed for.

FORTRAN is a high level language (FORmula TRANslation) and is no where near the level of the machine. C, C++, C#, BASIC, COBOL are all high level languages. They technically don't have that "One-To-One" correspondence to the machine code of the hardware they run on.

There are of course exceptions, COBOL on the Burroughs B3200-B4900 medium scale main frames of the 60's through the 80's DID have a one to one correspondence. There were machine language instructions like "MOVE A TO B EDITED WITH C" where A, B and C were references to memory locations containing the input record (A), the output record (B) and the PICTURE formatting (C). The CPU practically "thought' in COBOL. There was generally one machine language instruction generated by the compilier for each COBOL Procedure Division statement. It went so far as to even addressed memory in BASE 10! COBOL on those machines ran like a bat out of hell. Very fast.

Actually C is only resembling assembly if it is translating to a p-machine. That's probably where the phrase came from. In the early days (1970s, 1980s), p-machine hardware was more common.

If you translated C to assembly without optimization (e.g. on an HPUX machine running on a 9000 series 300, which was originally designed to run Pascal p-system and HP BASIC (which compiles source to run p-code - those "PROG" files)), you'd see the assembly had a 1:1 correspondence to the C statements for the most part. Leaving the p-system world has changed that.

C is certainly not "assembly language with syntactic sugar". Any programmer who knows the history of programming languages knows that FORTRAN I is assembly language (and not particularly portable) with syntactic sugar.

The assembler it most resembles is that of the PDP Dec machines obviously before your time. The morons are those who think that the original statement was literal. It does map quite well though hence the suggestion...

C is relatively easy to compile to straightforward and predictable assembly, but I agree, that doesn't give it all the characteristics of an assembly language.

Another thing I'd add is that C doesn't model the von Neumann architecture very well. It lets you create data structures just fine, but the capability to create code, dynamically, at runtime, surely the defining characteristic of a shared instruction-data architecture, is all but completely absent.

C feels like PDP-11 assembler (e.g., ++i and i++ come from PDP-11 assembler). But "C is portable assembler" isn't a literally true statement -- it means that C is low-level enough to write an operating system in, when combined with a bit of real assembler for the non-portable parts.

Hey, great post.But your layout does something shitty. If the text is too big in my browser, then your code example gets cut off on the right margin.I spent 2 or 3 minutes looking for your closing paren on your example fuction.

I sense an insecurity about low level languages as though you need defend yourself. People are suckers for ever higher level languages and pride them self on ever higher, et infinitum, becoming buffoons. C is fine and those who say otherwise are gay. In programming, you can divide between liberals and conservitives, pretty readily, can't you -- liberals are the gay ones who make buffoons of themselves being suckers for the latest fad that is sheek.

I'd argue that undefined behaviour in C is analogous to the illegal opcodes offered by the 6502. The 6502 has 151 legal opcodes. It also has 105 illegal opcodes - 40% of the available commands - which could be used by a skilled hacker to perform all sorts of optimisation tricks.

Good article, but it is a bit of a straw man that there is more hardware control in assembler -- anyone ever argue different? :-)

An Assembler with good macro facilities (or a simple precompiler) could do much of C's local variable stuff.

Some macro for subroutine entry reserves stack space and define offset constants. A macro for returning, which released the extra space. It could be neat to use if the processor architecture has sane addressing modes.

Disclaimer: I haven't touched even C for quite a few years. :-) And the only time I read about the x86, I was too disgusted to continue after a couple of hours.

I'm not sure. If you write a program that takes advantage of undefined behaviour of C and compile it a dozen times with the *same* compiler, will you get the same behaviour every time?

Now imagine you run your favourite C64 game on an emulator that doesn't support the illegal opcodes. Do you get the same behaviour?

Running a program on a 6502-compatible CPU that doesn't implement the illegal opcodes (which were only ever a "feature" of the manufacturing process, not designed into the chip) gives you the same effect as compiling a C program with a compatible compiler that doesn't implement undefined behaviours in the same way.

maybe FORTRAN V was. when did you check the language definition of FORTRAN the last time? It has everything a high level programming language needs. (classes, namespaces, pointers and concurrent programming). C looks like stone age wrt to FORTRAN2008.

I cut my C teeth on VAX/VMS, where the (unoptimized) machine instruction stream corresponded exactly to the C statements.

On the other hand, C was extremely inefficient at certain operations that were vital to the software I was developing at the time. We wanted to move millions of single bytes from one buffer to another, and would ordinarily code a loop such as

In assembly this was about five instructions, specifically using the MOVB (move byte) instruction to do the data copy. The C version, compiled, even when optimized, produced many thousands of instructions, because C decrees that atomic datatypes must be promoted to the largest possible integer datatype before being manipulated (in this case assigned) and then truncated back... and THAT incorporated all kinds of special cases for handling signed versus unsigned quantities, etc. etc. etc. I always preferred assembly for precisely the reasons cited in this article: you had control over what was going to happen.

The reason why people equate C to assembly is not because they think C is an inadequate substitute for assembly. It's because they're trying to insult both.

Programming in assembly isn't something anyone should do, except at the lowest level. It takes a huge amount of effort to produce a small amount of code, which is locked to the specific hardware of a single machine. Trying to pretend that C is assembly is a way of saying that anyone who programs in C is similarly non-productive.

This isn't true, obviously. But I guess it's a way for people who aren't bright enough to understand pointers to pretend to themselves that they're actually smarter for not even trying.

Your logic would seem to disconnect C from C++ because of their differences, while ignoring their similarities.

All those things you list are twists (like undefined behavior), limitations, and additions, but does not change the fact that C is in reality an convenient interface to assembly. And it is most evident from the keyword "asm" which gives you the ability to embed assembly directly in the program. And C can do all those things you mention it can't, *precisely* because of the fact that it can embed assembly code; like mess with the stack.

For the most part, most C constructs have a 1-to-1 relationship with assembly code and structures, as seen through a compiler. Yes, C hides stuff from the programmer; just as assembly hides stuff from the programmer (like different machine representations of the same instruction, endianess, bit representation, etc), but it is still analogous to machine code.

Your take on who's the "idiot" is misplaced. Write a simple compiler and see how "different" they really are; I assure you it will be fairly easy.

What do you mean with "++i and i++ come from PDP-11 assembler"? Certainly it did not use that syntax. PDP-11 like many other CPUs (Z80?) have array/buffer operations that could load/store registers using an array index register that could be pre-decremented or post-incremented.

I am always amused when someone calls C "assembly language". I work with C in embedded x86 systems (not a Linus or Windows derived code base) and often have the debugger show me the C source along with the generated assembly. I can say that that C != ASM from first hand experience. One of the more difficult things our systems do is an operation called a thunk, or a change of processor operating modes. Because of the nature of this operation, the code to handle the transition including setting back up thing like The Stack must all be written in assembly, C simply does not have the capability to do this nativity.

I came across this through Reddit, and I'm almost certain this is a joke. C absolutely is a assembly trundled up in syntactic sugar. The stack, optimization, etc. is the nasty medicine hidden by the sugar. That is its entire mission in life: to make assembly a bit tastier.

I wrote a 68k to C converter to port Sonic 3D from the Sega Genesis to the Saturn (and PC as well). There are a lot of tricks you can do in assembly language that are hard to do in C. The main thing is the condition flags, since these are set by the processor after almost every instruction (not true on ARM, where it is optional, and not on operations on address registers in 68K). Checking for 0 is easy enough in C, but there is no easy way of checking for integer overflow (V flag in 68K). Also assembly language allows you to do things like addq #4,sp (which pops the return address of the stack), so you can return to subroutine that called the method, a() { .. b(); codeWeDontWantToCall(); } b() { addq #4,sp; } would return to the method that called a and would skip codeWeDontWantToCall().

In addition you can mix data and methods much easier, I used to use data structures to control animation of sprites with code like dc.w MOVE_XY + CALL,-1,-1dc.l methodToCall, with conditional flags and I could put part of the logic into the animation, which lead to easier code and smaller code as well.

Also using assembly language meant I could write a sound playback using an interrupt at 8khz, which used the user stack pointer as a temporary variable, which saved 24 cycles compared to memory variable, which was about 2% saving of the processor speed.

Assembly language was also more portable than C code (well, for porting Atari ST to Commodore Amiga it was, since they were the same processor, but totally different operation systems).

I'm pretty sure that Z80 didn't have {pre,post}-{increment,decrement}. Anyway, 8008/8080/Z80 were later than PDP-11. C of course uses different syntax from PDP-11 assembler, but "x++ = y++" can be done in 1 instruction on PDP-11.

I would have thought that the definition of assembly is not how close it resembles the machine instruction set, but that you are defining the instruction set on a word to byte basis.Anything that requires a complier seems to be a higher level language (even if it is still low).That's a non programmer perspective - from an older electronics engineer.

C code may lead to unexpected only if You use implicit types and write unclear code. Use "signed" and "unsigned" types, do not use "int" and do not write horrors like "i = ++i - i++;" and everything will be ok. And those example with "array" copy myst be replaced with single memmove/memcpy stdlib call, which is definetly based on assembly "rep movs[b,w,d]" instruction. C code could be VERY expectable, stable and clear, otherwise why people write OS kernels with C? It's just a human factor.

When you get right down to it, an assembler and a compiler simply work differently, with different functions. A compiler builds a parse-tree and tries to figure out what the programmer is trying to get done. It then produces assembly code which can then be presented to an assembler for conversion to machine code. Assembler to machine code is a much simpler and more direct mechanism because its basically a choice of which of the following machine code instructions is going to work best for the current assembler instruction. The journey from C to assembler is way more complex and can take many different twists and turns, especially when optomization is involved. C works via a compiler. It is an irrelevant comparison from the start.

How could it, on any processor? A C program is compiled to run in a given environment, usually an operating system. Changing processor modes may change this environment to a point where a program cannot function. The OS itself may be able to do that, but it would probably require resetting some parts of the OS as well.

Changing modes from a running program can be like sawing off the branch on which you sit, so support for this functionality appears a bit pointless.

Well, duh. If you need to be literal about it, then C is not assembly language, by definition.

That being said, if you want to open up your brain a little bit, C can be viewed as a portable assembly language for the simple reason that it has fairly simple data and execution models which are at a pretty low-level on most CPUs, and because it's nearly ubiquitous. Most modern C compilers to a good job of optimizing for their targets, so C is reasonably efficient (and if you feel a need to have an argument about hand-rolled assembly vs. compiler-generated, head on over to the I Missed the Point Cafe and yak away. Don't save me a seat, though -- I won't be joining you).

The metaphorical point is that the 'C-machine' is a perfectly reasonable target for a higher level language's compiler. This is explicitly the reason that the first C++ implementations were actually C-language generators, which were in turn compiled by local C compilers. Stroustrup himself uses the term 'portable assembler' to describe C (you can read more about this in his "Design and Evolution of C++").

Sorry, but most of this article and a good deal of the responses missed this point.

Just because both C and the PDP-11 both have autoincrement, that doesn't mean that one was derived from the other (both sharks and dolphins have fins).

To make a logical argument, you actually have to show that K+R modeled C's autoincrement operator after the PDP-11s autoincrement addressing mode.

in fact, Dennis Ritchie wrote exactly the opposite:

"People often guess that they were created to use the auto-increment and auto-decrement address modes provided by the DEC PDP-11 on which C and Unix first became popular. This is historically impossible, since there was no PDP-11 when B was developed." [1]

One year later, juste a quote that might be useful for anyone reading this : from Linux Torvalds [http://torvalds-family.blogspot.com/2009/08/programming.html]"Some people seem to think that C is a real programming language, but they are sadly mistaken. It really is about writing almost-portable assembly language"

I successfully manage to translate C into assembly by flattening the HELL out of a working C function, and explicitly dedicating registers (possibly redundant in C), and turning all while and, and for loops, and so on with if and jump statements... Managed to translate square root algorithm and bubblesort :)

Portable assembler is oxymoron. A protable assembler is not assembler since assembler is specific to a processor. Assembler is representation of its instructions in human readable form yet there is one-to-one correspondence between machine instructions and assembler mnemonics. It is impossible to have a protable assember. Porting means a translation (aka compilation), which is alternative to assembling (1-to-1 mapping). Yet, you can consider assember as a special case of compiler, the compiler that does the simplest form of translation, 1-to-1 mapping. So, compiler ("C language") is a generalization of assembler. This permits saying "asm IS-A compiler" but not "compiler IS asm". Cat is an animal but not Animal is a Cat.