Posted
by
Soulskillon Wednesday August 03, 2011 @12:18AM
from the catchy-but-totally-misleading-internet-headline dept.

An anonymous reader writes "Poul-Henning Kamp looks back at some of the bad decisions made in language design, specifically the C/Unix/Posix use of NUL-terminated text strings. 'The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? ... Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.'"

Robert Frost had a fairly depressing outlook on life, and the point of the poem is that it doesn't matter what road you take.

I mean, just pay attention to the narrative tense in the last stanza, the one people take to be so life-affirming and "do something different!". The narrator isn't saying "I did this, and I know it was important"; he's saying "I did this, and I think that in the future I'm going to tell people it was important".

Perhaps it means that regardless of which path you take, the one you do take(the decision you make), should be analyzed and reflected upon to verify it is inline with what you wish to accomplish.
Of course, I could continue and counter your you vain and shallow remarks and how they reflect upon a person taking literal interpretation of a poem and scrutinizing it with the inability of the author to respond.
But I digress.

Not sure where you took your "poetical exegesis" class, but you should ask for a refund.

The narrator as "vain, shallow individual" is entirely a character pulled out of your hindquarters, as there is nothing in the text of the poem to lead to that conclusion.

The poem is simply a reflection on how we, as individuals, make choices in life. Some of us choose to take the direction taken by most of those around us. That might be university, family, job, retirement in FL. Some of us choose to turn aside from that direction and try another path. Programming a PDP to play "Space Travel," for example. Or writing an operating system "just for fun."

Frost's suggestion is that these choices of path may seem insignificant at the time -- both paths being nearly the same; but that, as "way leads on to way," there's no going back and thus we may find ourselves down a path that leads to unexpected places. When Linus Torvalds wrote linux, he could not know that "the path less traveled" would lead to fame and fortune, literally. The college kids who created Slashdot could not know it would make them rich.

In fact, the point of the poem is exactly that it does matter which path you take. But that you don't always know how your choice is going to turn out. Frost himself might have continued his career as a teacher, a stable and certain means of supporting his family. Instead, he chose to focus on his poetry. He took a chance. And it worked well for him.

I'm tempted to bookmark this response as a great example of why engineers should not fear breadth requirements. (I'm assuming anyone with such a low Slashdot ID works in engineering...)

The ironic interpretation is widely held because it's supported not only by the text, but also Frost's own statements, and the broader context of his work--in which seemingly simple descriptive verse hides darker, more complex themes. (A major reason why he is

Frost's suggestion is that these choices of path may seem insignificant at the time -- both paths being nearly the same; but that, as "way leads on to way," there's no going back and thus we may find ourselves down a path that leads to unexpected places. When Linus Torvalds wrote linux, he could not know that "the path less traveled" would lead to fame and fortune, literally. The college kids who created Slashdot could not know it would make them rich.

http://poetrypages.lemon8.nl/life/roadnottaken/roadnottaken.htm [lemon8.nl]
Robert Frost on his own poetry:
"One stanza of 'The Road Not Taken' was written while I was sitting on a sofa in the middle of England: Was found three or four years later, and I couldn't bear not to finish it. I wasn't thinking about myself there, but about a friend who had gone off to war, a person who, whichever road he went, would be sorry he didn't go the other. He was hard on himself that way."

Since it was published before 1923, it's already in the public domain. See the footnote at the bottom of the wikisource page for the poem [wikisource.org], and then you can follow the links from there if you care to read more.

Firstly, it makes it seem like the address+length format is a no-brainer, but there are quite a lot of problems with that. It would have had the undesirable consequence of making a string larger than a pointer. Alternatively, it could be a pointer to a length+data block, but then it wouldn't be possible to take a suffix of a string by moving the pointer forward. Furthermore, if they chose a one-byte length, as the article so casually suggests as the correct solution (like Pascal), it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer. (Though a size_t length would make more sense.) Furthermore, it would be more complex for interoperating between languages -- right now, a char* is a char*. If we used a length field, how many bytes would it be? What endianness? Would the length be first or last? How many implementations would trip up on strings > 128 bytes (treating it as a signed quantity)? In some ways, it is nice that getaddrinfo takes a NUL-terminated char* and not a more complicated monster. I'm not saying this makes NUL-termination the right decision, but it certainly has a number of advantages over addr+length.

Secondly, this article puts the blame on the C language. It misses the historical step of B, which had the same design decision (by the same people), except it used ASCII 4 (EOT) to terminate strings. I think switching to NUL was a good decision;)

Hardware development, performance, and compiler development costs are all valid. But on the security costs section, it focuses on the buffer overflow issue, which is irrelevant. gets is a very bad idea, and it would be whether C had used NUL-terminated strings or addr+len strings. The decision which led to all these buffer overflow problems is that the C library tends to use a "you allocate, I fill" model, rather than an "I allocate and fill" model (strdup being one of the few exceptions). That's got nothing to do with the NUL terminator.

What the article missed was the real security problems caused by the NUL terminator. The obvious fact that if you forget to NUL-terminate a string, anything which traverses it will read on past the end of the buffer for who knows how long. The author blames gets, but this isn't why gets is bad -- gets correctly NUL-terminates the string. There are other, sneaky subtle NUL-termination problems that aren't buffer overflows. A couple of years back, a vulnerability was found in Microsoft's crypto libraries (I don't have a link unfortunately) affecting all web browsers except Firefox (which has its own). The problem was that it allowed NUL bytes in domain names, and used strcmp to compare domain names when checking certificates. This meant that "google.com" and "google.com\0.malicioushacker.com" compared equal, so if I got a certificate for "*.com\0.malicioushacker.com" I could use it to impersonate any legitimate.com domain. That would have been an interesting case to mention rather than merely equating "NUL pointer problem" with "buffer overflow".

"...it would have had the insane limit of 255-byte strings, with no compatible way to have a string any longer."

Compatible with what? Seems to me they could have just used continuation bit for the size field, much the way UTF-8 works to store non-ASCII characters.

This would still make the strings incompatible, because you would only have a 127-byte string length before the "continuation bit" comes into play and you need to switch to a 15-bit string length. All the previous code written with longer-than-127-byte strings would be incompatible.

If we were to switch now, is that the compatibility you're referring to? Well sure.

But nobody's talking about switching now, the point of the topic is that C should have been designed differently. In those days there was very little backwards compatibility to worry about.

And if it had been decided to be 1-byte length + data, and everyone used it like that, and assumed that the full 8-bits are available for the length, then when we switch to variable-byte length encoding, it would create an incompatibility. The incompatibility I speak of is the hypothetical one switching from 1-byte fixed-length length encoding to variable-byte length encoding.

"They could have just used variable length encoding from the beginning." Sure, and they could have programmed everything in Java from the start... the idea of a variable length encoding would have been over-engineering the problem that they were facing.

Oh, Lordy, if you had ever programmed in a language with a 255 character limit for strings you would praise $DIETY every time you use a C string. Dealing with length limited strings is the largest PITA of any senseless and time-wasting programming task.

Suppose C had originally had a length for strings? The only thing that makes sense is for the string length count to be the same size as a pointer, so that it could effectively be all of memory. A long is, by C language definition, large enough to hold a pointer that has been cast into it. So string length computations all become longs. Not such a big deal for most of life... until.... 64 bit addressing. Then all sorts of string breakage occurs.

The bottom line is that in an application programming language strings need to be atomic, as they are in Python. You just should not care how strings are implemented, and you should never worry about a length limit. The trouble is, C is a systems programming language, so it is imperative that the language allow direct access to bit-level implementation. If you chose to use a systems programming language for application programming, well, then it sucks to be you. So why did we do that for so long? Because all the other alternatives were worse.

Hell, I've used languages where the statement separator was a 12-11-0-7-8-9 punch. (Bonus points if you can tell me what that is and how to make one.) So a NUL terminated string looks positively modern compared to that.

On any sane platform, "long" has that property, and can also hold the machine word or more. Win64 is not sane.

Because of it, the standard had to add intptr_t which is the only type portably known to be of same width as void* and char* (but _not_ necessarily the same as a pointer to any other type!). Of course, MSVC doesn't follow standards and doesn't have this type nor stdint.h/inttypes.h at all.

Thus, your code will not work there, and will cut pointers. To make things worse, it will actually work if yo

How big is an abstract character? When C was created, the choices were basically ASCII or EBCDIC, so 7 bits. Then you got 8-bit encodings, but they were all incompatible. When OpenStep was published, Objective-C strings were defined as ordered collections of unicode characters, which were 16-bit values. Modern versions of the unicode specification require more than 16 bits for the entire range, so you end up needing 32 bits for each character. In a high-level language, you can just have a character typ

Not to mention the argument for "because space was at a premium" is specious, because either you had a 8-bit length prepended to the string, or you had an 8-bit special value appended to the end of the string. Both ways result in the same space usages.

From what I read in the summary, (didn't read TFA) this whole thing sounds like a propaganda piece supporting the idea that we should use length+string, by presenting it as "this should have been a no-brainer but the idiots making C screwed up."

As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.

The advantages that I see for counted length are:- it makes copying easier - you know beforehand how much space to allocate, and how much to copy.- it makes certain cases of strcmp() faster - if the length doesn't match, you can assume the strings are different.- It makes reverse searches faster.- You can put binary in a string.But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register. Or having to convert text blocks to print them. Or not being well suited for piped text or multiple threads; you can't just spew the text into an already nulled area, and it will be valid as it comes in; you have to update a text length counter for every byte you make available. And... and...

Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option. Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

All in all, having worked with both systems, I find more advantages with null-termination.

There's also a third system for text - linked lists. It doesn't have the disadvantage of an artificial string length limit, and allows for easy cuts and pastes, and even COW speedups, but requires far more advanced (and thus slower) routines and housekeeping, and has many of the same disadvantages as byte-counted text.. Some text processors have used this as a native string format, due to the specific advantages.

An arbitrary limit equal to the virtual machine size of the computer that was originally targeted.

The advantages that I see for counted length are:- it makes copying easier - you know beforehand how much space to allocate, and how much to copy.- it makes certain cases of strcmp() faster - if the length doesn't match, you can assume the strings are different.- It makes reverse searches faster.- You can put binary in a string.

- It all but eliminates the possibility of buffer overruns for strings.

But that must be weighed against the disadvantages, like not being able to take advantage of CPUs zero test conditions, but instead having to maintain a counter which eats up a valuable register.

But lots of CPUs have an instruction a bit like "decrement register and jump if not zero" which can be used for length+data strings.

Or not being well suited for piped text or multiple threads; you can't just spew the text into an already nulled area, and it will be valid as it comes in;

With modern character encodings, you can't guarantee that whatever string format you use. Couple that with the fact that streamed data tends to be read and written in blocks with a length parameter anyway, and the whole advantage is gone. This is why almost all modern languages have some variation on length + data for their strings and utilities for manipulating raw byte buffers.

Getting a free strlen() is NOT an advantage, by the way. In fact, that became a liability when UTF-8 arrived. With a library strlen() function, all you had to do was update the library, but when the compiler was hardcoded to just return the byte count, that wasn't an option.

Except that strlen() has always and still does count the number of C chars before the null byte. This is enshrined in the C99 standard. UTF-8 has not changed the implementation of strlen(). Also, gcc and probably many other compilers will normally optimise things like strlen() to a few lines of assembler rather than a call to libc, so you'd have to recompile anyway if it does change.

Sure, one could go to UTF-16 instead, but then there's a lot of wasted space.

All in all, having worked with both systems, I find more advantages with null-termination.

There's also a third system for text - linked lists. It doesn't have the disadvantage of an artificial string length limit, and allows for easy cuts and pastes, and even COW speedups, but requires far more advanced (and thus slower) routines and housekeeping, and has many of the same disadvantages as byte-counted text.. Some text processors have used this as a native string format, due to the specific advantages.

I'd still take NULL-terminated for most purposes.

Most modern languages have a proper string type and I would always take that over null terminated char sequences. You can bet that Java's internal implementation of String uses length+data.

Are you suggesting strlen() should return the number of UTF-8 characters, not the number of bytes? That's insane... the entire point of UTF-8 is that stuff like strlen() can treat it as a narrow string. If you want to have a function for returning the number of printable characters in a UTF-8 string, that's going to be a separate function, and isn't any easier or harder with sized strings v.s null-terminated strings.

"But who would waste all that space just to make calculations faster?" Actually, Perl represents all Unicode in UTF-32 (native byte-alignment) and renders out to UTF-8 when printed. Precisely because space is relatively cheap, and processing time is still often the current limiting factor.

Which is not always true either. Often the limiting factor is the space in the cpu cache, since ram access is relatively expensive.

As a nitpicky pedantic note though, if C had gone with length+string format, then other languages would have been written around the C standard, since most of them were written around the C standards to begin with to increase interoperability in the first place.

Yes, but perhaps the simplicity was partly why it caught on. The reason I raised all of the "what about..." questions was to illustrate just how many small variations in an address+length standard there could have been. Even if C had made

The C string has its place, but what I never understood is why the C standard library hasn't also included a string type. Something like the following with all of the accompanying bound checking functions to go along with it.

struct string{
size_t length;
char *buffer;};

There are several third party libraries that do just that, but it seems like something worthy of being there out of the box.

Well C++ includes a class that is pretty much exactly what you ask for. It wouldn't make sense for C to include that, as the whole point is that C gives you the ability to manipulate data however you want. If C included that, it would be criticised for having two incompatible string types. If it only included that, it would be criticised for not being low-level enough (the programmer is forced to call all these inefficient string manipulation functions that do bounds checking).

Then that would be incompatible, yes (and as I said in the original post, there was historically such an issue, as B chose 4 as the string terminator!) But that is just one potentially-incompatible design decision, versus the four I listed in my post for length-delimited strings. Other issues for length-delimitation: do you put the length in a struct with the pointer, or in the buffer with the data? Do you make it variable-width or fixed-width? If it's variable-width, do you use 0 or 1 as an extension bit?

My personal fave is strncpy(), which will silently not terminate the string if the buffer is too small, but if you give it a huge buffer it punishes you by NUL padding the string all the way to the end of the buffer.

Also, have a look at how mysql deals with varchars. There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc. Your arguments about what type of integer to use conveniently ignore conventions like network order. In short, it is not too hard to solve. Do you really think the state of programming was so bad back then that people wouldn't test 129 byte strings?

Also, have a look at how mysql deals with varchars. There is no 255 byte limit

Before Mysql 5.0.3 the limit was 255 and 65535 afterward.

when length exceeds that value, you just go to 2 bytes of length, etc.

It does this because each column defines the maximum length for the varchar and the number of bytes used for length is fixed for each column. This however is also overhead, this information for the size of the length field needs to be stored for each variable. In C this means that each variable now has even more overhead (the actual amount depending on how you encode such information).

Note that my post was not necessarily saying that NUL was the right decision. Just that it isn't a no-brainer -- going the other route has a lot of complications.

What is so undesirable about making a string larger than a pointer?

It would mean that the C library would need to declare a "string" struct instead of using char*. Now rather than passing a char* as an argument, you would have to decide whether it's worth passing the two word "string" struct, or a string* pointer (allowing it to fit into a register

Dealing with NUL is significantly simpler than dealing with length fields, and there are significantly fewer sources for confusion.

Is it, or are you just used to dealing with NUL-terminated strings?

If you wanted the standard to use a variable-length length as you suggest, you would need to make sure that all the programmers correctly store and parse variable-length strings. Of course they could get it right, but there are lots of ways they could get it wrong.

Nope, they are simpler. Re-read all of the questions I asked regarding design decisions that could be made around address+length formatted strings and tell me that they are just as simple. Now I think higher-level languages should be using lengths, because their libraries abstract the details (e.g., C++ or Java). But in a language where programmers fabricate their own strings, simplicity is best.

They are simpler IFF you never use the null value. Do you have any files on your system which have NUL bytes in them? Hint: yes.

Yes -- this is a good reason not to use NUL-terminated strings (which, once again, TFA missed). Remember: I never said NUL terminated strings were good, just that the article missed the point by blaming NUL strings for a different, unrelated problem, and not actually picking up on any of the problems with NUL strings.

And no, the article didn't miss the "real security problems" caused by null termination. Where did you stop reading?

The point was: the security problems the article mentions (buffer overflows/underflows) aren't actually caused by NULL terminated strings, they are caused by buffers that are allocated too small. If the buffer is too small, it won't matter if the string is measured at the beginning or terminated at the end. (It can be fixed by measuring the size of the buffer, but that is a different topic).

However there is a real security problem, as the GP described, although it really was a problem of mixing two standa

There is no 255 byte limit - when length exceeds that value, you just go to 2 bytes of length, etc.

So you have a 255 byte string. You append one byte to it. What do you do now?

Are you really suggesting that people should have to move all the bytes of the string one further along so they can increase the length field to two bytes, and then append the new character, and that programmers who can't remember to put a 0 at the end of a string can do that without screwing up?

Sure, you can force everyone to use library calls for all their string operations, but C was intended to be cheap, dirty and fast, which i

So you have a 255 byte string. You append one byte to it. What do you do now?

Are you really suggesting that people should have to move all the bytes of the string one further along so they can increase the length field to two bytes, and then append the new character, and that programmers who can't remember to put a 0 at the end of a string can do that without screwing up?

I would think that the more logical way to handle this is to handle the string in chunks. The first byte (8 bits) of the string is the length, if all of the length bits are set then at offset 256 from the length byte you have another length byte and another section of the string. Rinse, repeat.

Yes, for very large strings this isn't as efficient as simply using 16 bits at the front of the string (65536 positions in 2 bytes vs 512 positions in 2 bytes) but for small strings - probably the most common case - i

C's "everything is a pointer" approach gives you the power to easily do lots of cool stuff, and adding length to a string would break that elegance. But using NUL-terminated strings creates a lot of security problems, not merely limited to buffer overflows, which are really caused by C's backward memory allocation.

Null termination sounds lovely when you've a teenager writing assembly and doing register allocation by hand, but it's obviously bad once you've seriously thought about runtimes, like after taking an algorithms class.

I spent my formative programming years primarily writing code in assembly and I resent that statement.:-) Runtime is always in one's mind and optimizing for speed is the desired goal. Optimizing for size is something that is merely forced upon us by circumstances beyond our control. No true assembly programmer, nor any true Scotsman, would prioritize size over speed if avoidable.

I think the bigger point that's missed is that if a size field were used, you'd still have the same buffer overflow problem if someone simply specified a size that didn't match the allocated memory, same as strncpy will happily try to keep writing to a buffer if you give it bad size information.

Poor guy. I guess sooner or later he's going to have to learn how to manage his memory and understand how the underlying physical hardware works. That must be a real toughie for anyone who learned to "program" in the Java/C# world.

Yeah, clearly PHK [wikipedia.org] doesn't knows anything about memory allocation. (Except for the malloc library he wrote for FreeBSD...)

Maybe he should RTFM.

I don't have a FreeBSD system at hand, but I wouldn't be surprised if the malloc page was written by PHK.

Seriously, though, it's hard to know what language you as a system administrator should use for something like a data logger that has to run continuously (or cron every minute or so) other than C, but then there's the security problem that some user will come up with some weird filename hack to subvert the system.

I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.

As much as I hate to say it, C# has an interesting feature where most variable-types can not be set to NULL, unless you specifically declare them to be 'nullable' (when you declare the variable, not the type). This avoids a large number of situations where NULL is more of a hinderance than a help, and in the cases where you actually need it, it's available.

Objective C has a standard object called 'nil,' which you can insert into lists and such, but can be sent messages. This prevents several types of NULL

this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day

Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings. Perhaps this is actually an example of premature optimization, optimizing things before proper profiling and analysis has shown the problem exists and the proposed solution is beneficial.

this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day

Actually the tradeoff may not have been rational.

Actually, the choice was rational [bell-labs.com] (at least, on purpose) - you see, it's not about a single byte, it's about a new data type.

C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user.

Actually the tradeoff may not have been rational. The storage bytes saved may have been offset by the extra code bytes necessary for handling unknown length strings.

Not really, no. Having written basic library code for both, it usually requires more code to handle Pascal-style (length+data) strings than C-style (data+null) strings. You save quite a bit of code ("quite a bit" being relative, but I've had to squeeze code into 208 bytes of RAM before) by using the C-style strings most of the time.

hmm. marker character, or a length.Marker: same type as string, so no need to worry about bit size, start/stop bits or other extraneous. String can be any size and only restricted by available memory. (given the ability to swap darn near unlimited pages in current hardware.... and the ability to virtualize across computers... this means strings have a potentially <i>infinite</i> limit)

Length: What's the size? What byte order? What bit size? How will this affect communications between pla

Length: What's the size? What byte order? What bit size? How will this affect communications between platforms?

These aren't hard questions, IMHO. Just say the length is an int (or an unsigned int), and then assuming you didn't freak out when someone asked all those very same questions about ints, then you should be fairly happy with the result.

The real problem with the addr+len approach is that now every string becomes a struct, or a structptr.

This means that when passing a string to a function, either the string takes up two register/stack slots, or you're passing around a const-ptr (but the contents of the struct are not const), which means one more memory access due to pointer indirection.

x86 and the PDP-11 are register-starved. the x86 has 8 registers, with 4 or 5 available as general-purpose registers.The PDP-11 was similar with 8 registers

My assembly class in college was on a PDP-11. I've done quite a bit of x86 assembly over the years. I'm confused as to why you think a pascal style string structure pointer requires any more registers or stack than a C character pointer. In assembly if I want the the length I would reference a size_t at the pointer address, and if I want text I would reference a char at pointer+offset where offset is sizeof(size_t).

We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.

1. Complexity. Without exception, every variable in C is an integer, a pointer or a struct. A null terminated string is a pointer to a series of integers -- barely one step more complex than a single integer. To keep the string length, you'd have to employ a struct. That or you'd have to create a magic type for strings that's on the same level as integers, pointers and structs. And you don't want to use a magic type because then you can't edit it as an array. Simplicity was important in C -- keep it close to the metal.

2. Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.

3. Bytes. It's only one extra byte with a magic type or an advanced topic struct. In both cases with an assumption that the maximum possible length on which the standard string functions will work is 64kb. If you're talking about a more mundane struct then you're talking about an int and a pointer to a block of memory which has an extra set of malloc overhead. That's a lot of extra bytes, not just one.

For the kind of language C aimed to be -- a replacement for assembly language -- the choice of null terminated strings was both obvious and correct.

It is interesting to compare C's approach with that of two nearly contemporaneous languages, Algol 68 and Pascal [Jensen 74]. Arrays in Algol 68 either have fixed bounds, or are `flexible:' considerable mechanism is required both in the language definition, and in compilers, to accommodate flexible arrays (and not all compilers fully implement them.) Original Pascal had only fixed-sized arrays and strings, and this proved confining [Kernighan 81]. Later, this was partially fixed, though the resulting language is not yet universally available.

C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user. Nevertheless, C's approach to strings works well.

No, strings with a listed length would also be pointers to a series of integers - it's just that, instead of giving a value special semantics (0 as end of string), you give a position in the series special semantics (store the length in the first two bytes). In both cases, you need your string-handling functions to be aware of whatever the convention is.

Computational efficiency. Many if not most operations on strings don't need to know how long they are. So why suffer the overhead of keeping track? That makes string operations on null terminated strings on average faster than string operations on a string bounded by an integer.

I don't know that that's true. Operations that do need to know the length of the string could be quicker, and I'm not sure that these cases are less frequen

Operations that do need to know the length of the string could be quicker, and I'm not sure that these cases are less frequent. What are the common cases you are thinking of where C-style strings are faster?

C-style strings are simpler. That's the biggest advantage. For the few cases where performance matters, you can always define your own string type.

What are the common cases you are thinking of where C-style strings are faster?

strcpy(char *d, char *s)
{
while ( *d++ = *s++ );
}

Challenge: come up with the equivalent for pascal-style strings in a way that doesn't compile into at least twice as much code.

In fact, aside from strlen, are there any string functions that aren't made at least twice complex by using P-style instead of C-style strings? Most of strlib.c can be implemented as one-liners, assuming C-style strings.

Normally I tend to agree with what I've read from PHK, but this one seems wide of the mark. If you involve a *real* C guru in the discussion, I don't think there would be much sentiment toward nixing the sentinel.

Perfectly valid, as God intended. A string with a length prefix is a hybrid data structure. What is the size of the length structure up front? It can be interesting in C to sort all suffixes of a string, having only one copy of the string itself. Try that with length prefix strings. (The trivial algorithm is far from ideal for large or degenerate character sequences, but it does provide insight into position trees and the Burrows-Wheeler transform.)

Nor would I blame all the stupid coding errors on the '\0' terminator convention. In C, a determined idiot can mess up just about anything, unless the compiler takes over and does things for you, a la Pascal by another name. If that had been the bias, would be all be using C now, or some other language? Repeat after me: Generativity Rocks. Nanny languages usually manage to bork generativity over. Correct Programming Made Easy never strays far from the subtitle Composition Made Difficult.

No one who ever read Dijkstra and took him serious ever made a tiny fraction of the stupid mistakes blamed on hapless zero.

If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions... what kind of sissy would do that?"

C is a nice design, as evidenced by how seamlessly the STL was grafted onto C++ at the abstraction layer (at the syntax layer, not so much). The problem with C was always a communication problem. To use C well one must test preconditions on operation validity. To use algebra well one must test preconditions on operation validity.

Where does PHK lay the blame for the algebraist who made it possible to divide both side of an equation by zero, or multiply an inequality by -1? Preferably with the complete moron who doesn't check preconditions on the validity of the operation. Two thousand years later, now we have a better solution?

PHK is right about cache hierarchies. By the time cache hierarchies arrived, we had C++ with entirely different string representations.

For some reason I've never been keen on having a programmer who can't manage to correctly test the precondition for buffer overflow making deep design decisions about little blocks of lead in the radiation path.

And it's not even much of a burden. As Dijkstra observed, for many algorithms, once you have all your preconditions right and you've got a provable variant, there's often very little left to decide. It actually makes the design of many algorithms simpler in the mode of divide and conquer: first get your preconditions and variant right (you're now half done and you've barely begun to think hard), *then* worry about additional logic constraints (or performance felicitous sequencing of legal alternatives).

The coders who first try to get their logical requirements correct and then puzzle out the preconditions do indeed make the original task more difficult than not bothering with preconditions at all, supposing there's some kind of accurate measure over crap solutions, which I refuse to concede.

If you want to point to a real steaming pile, strcpy() was designed by a moron with a bad hang-over and no copy of Dijkstra within a 100 mile radius. It was tantamount to declaring "you don't really need to test your preconditions... what kind of sissy would do that?"

To play Devil's advocate, strcpycannot check it's precondition. You can't tell whether a pointer you're given is valid, or how much space is left in the buffer.

(Well, I guess you could go make malloc record far more information than it otherw

The problem with C isn't strings. It's arrays. Strings are just a special case of arrays.

Understand that when C came out, it barely had types. "structs" were not typed; field names were just offsets. All fields in all structs, program-wide, had to have unique names. There was no "typedef". There was no parameter type checking on function calls. There were no function pointers. All parameters were passed as "int" or "float", including pointers and chars. Strong typing and function prototypes came years later, with ANSI C.

This was rather lame, even for the late 1970s. Pascal was much more advanced at the time. Pascal powered much of the personal computer revolution, including the Macintosh. But you couldn't write an OS in Pascal at the time; it made too many assumptions about object formats. In particular, arrays had descriptors which contained length information, and this was incompatible with assembly-language code with other conventions. By design, C has no data layout conventions built into the language.

Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program. The C compiler had to be crammed into that. That's why the original C is so dumb.

The price of this was a language with a built in lie - arrays are described as pointers. The language has no idea how big an array is, and there's not even a way to usefully talk about array size in C. This is the fundamental cause of buffer overflows. Millions of programs crash every day because of that problem.

That's how we got into this mess.

As I point out occasionally, the right answer would have been array syntax like

int read(int fd, char[n]& buf, size_t n);

That says buf is an array of length n, passed by reference. There's no array descriptor and no extra overhead, but the language now says what's actually going on. The classic syntax,

int read(int fd, char* buf, size_t n);

is a lie - you're not passing a pointer by value, you're passing an array by reference.

C++ tries to wallpaper over the problem by hiding it under a layer of templates, but the mold always seeps through the wallpaper when a C pointer is needed to call some API.

Why was C so lame? Because it had to run on PDP-11 machines, which were weaker than PCs. On a PC, at least you had 640Kb. On a PDP-11, you had 64Kb of data space and (on the later PDP-11 models) 64Kb of code space, for each program.

Your relative comparisons are a bit off. The Altair from 1975 (the first versions of C were finished around 1973) had a whopping 1KB of memory. The mini computers of the day ran rings around what PCs there were, both in raw power and in memory.

It's forced in practice by the fact that the entire standard library, and all third-party libraries, all produce and consume null-terminated strings.

What's far worse is that, since C FFI is the lowest common denominator that we have across various languages, null-terminated strings become the standard way to marshal strings between libraries written in different languages. This means many things: for one, no embedded nulls, which is bad for many scenarios where handling them is desired.

It would have been more urgent to find out where an allocated part of RAM ends.

Or just like Integers and floats, strings could have been their very own basic type. Essentially leave the implementation of it to the compiler, so it can do range checks. Most C-programmers seem to believe that this is done already.

TFA suggests the decision was to save a byte, but I don't believe that's the main reason it happened.

If you're traversing a string anyway, what happens is that when you load the data into your register (which you'll be doing anyway, for whatever reason you're traversing the string), you get a status flag set "for free" if it's zero, so that's your loop test right there. Branch if zero. If you have to compare an offset to a length on every iteration, then now you're having to store this offset in another register (great, like I have lots of registers to spare on 1970s CPUs?) and compare (i.e. subtract) to the length which is stored in memory (great, a memory access) or another register (oh great, I need to use another register in the 1970s?!) and the code is bigger and slower.

It's easy to laugh these days about anyone caring about how many clock cycles a loop takes and whether it uses 2 registers or 4 registers, but this stuff used to be pretty important (and more recently than the 1970s). Kids these days: if you weren't there, you just don't know what it was like.

BTW, I have a hunch K & R didn't know they were building such an eternal legacy. It's reasonable to speculate that this is still going to be part of systems a hundred years from now, but in 1970 you would have been a mad man to suggest such a thing. (Not that this invalidates TFA's point at all; I'm just making excuses for K&R I guess.)

If they had gone with the embedded length option we'd be sitting around bitching about how short-sighted it was to use just two bytes for the length. Including how Dennis Ritchie supposedly said "64K strings should be enough for anybody".

Which is worse? Having it be O(N) to get a string length and having inexperienced programmers get confused and make mistakes? Or capping your maximum string length at 0xffff?

I'll take the former, please. I do a lot of string manipulation in C and when you're used to it, it's actually not that bad to get right and still be efficient. And it provides a useful shibboleth to detect people who are no good at C.:-) Just think of how much harder it would be to interview a C programmer if you couldn't give them a crazy string manipulation problem.

And of course, in reality, where people aren't actually tested before providing bad code that performs these tasks poorly and has terrible effects on society at large, your attitude is approximately as useful as a purse to a fish.

Hmmm... let's not stop mid-way.

in reality, where people aren't actually tested before providing bad code that performs these tasks poorly and has terrible effects on society at large, you are approximately as useful as a purse to a fish.

FTFY: in a world who doesn't give a dam' about professionalism, being one is useless.

The downside to this approach however is it requires some extra steps when retrieving a string from a C-based API. And of course if the external C-based library has a string handling bug, you're back to square one.

1. with NULL-terminated strings, there's no distinction (other than in the string.h and related library) between a char * and a other_type *. Inventing a "string" type in C (not C++) would have made the compiler more complex (see footnote **)
2. because char * is no different than other_type* , I can pass the address in the middle of the string char * for processing. Not so much for a std::string. How does it matter? Well, take parsing for exa

They don't look the same to me, these days the "IT" decisions are taken by the MBA type guys, with the sole purpose of maximizing their chances to get more visibility, "exceed objectives" and get a larger bonus/promotion/whatever. Sure they're rational too but what do they have in common with CS?

Programmer for 20+ years here, BS and MS in CS. I used to share such opinions. Then I went to business school. I really enjoyed business school in part because I was constantly amused by how ignorant and wrong I had been regarding such opinions. May I be bold enough to suggest that the portrayal of MBAs in popular and nerd cultures are about as accurate as the portrayal of programmers in popular and non-nerd cultures.

None of the above should be interpreted to mean that business school makes one appreciat

I mean, I think what the parent's saying about management is largely true. They're out to save their own asses, but you're right that there's something else going on behind the scenes. They really do need to look after our own interests too. It wouldn't add up if it turned out somehow you had a rockstar manager but completely incompetent boobs for subordinates.

My old boss was friggin' awesome on this point. She made it a point to highlight the accomplishments of her