Saturday, March 27, 2010

Does anyone understand types and magnitudes?

Regularly I have to work with many popular libraries out there, as well as many libraries written by coworkers and similar.

One thing which I see constantly misused over and over again is the type system used in C and C++.

I wonder if most people understand it, have bothered to contemplate it, or simply don't care if their library is misprogrammed trash.

First of all, C/C++ for integers supports both signed and unsigned types. Many programmers seem to ignore the unsigned types altogether. Maybe some education is in order.

Each base type in C consists of one or more bytes. Each byte on modern main stream systems are 8 bits to a byte. Meaning a single byte can store 28 values, or in other words 256 possibilities. Two bytes can store 216 values, or in other words 65,536 possibilities.

A type designed for integers (whole numbers) which is unsigned will operate with the meaning of its values as ranging from 0 till amount of possibilities - 1. In the case of a one byte type: 0 to 255, for a 2 byte type: 0 to 65,535.

When a value is signed (the usual default if no signed or unsigned description is applied), a single bit is used to describe whether the rest of the bits represent a positive or negative value. This effectively cuts the amount of positive values that can be represented in half. In the case of a single byte, 27 values will be negative, -128 to -1, 27-1 values will be distinctly positive, 1 to 127, and one value will be 0 (for a total of 256 possibilities). This same math extends to as many bytes being used in a given type.

When dealing with something which requires working with both positive and negative values such as debt (monetary balance), direction, difference, relative position, and things of a similar nature, signed values is quite natural and should of course be used.

But if you're working with something which has no meaning for negative values such as "Number of students in this class", "How old are you", "Amount of chickens in the coop", "Amount of rows in database table", or "Amount of elements in this array", or amounts of any nature really, using a signed type is one of the worst things you could possibly do. You are reserving half of your possible values for a situation which will never occur (barring mistakes in programming).

Let's take a look at some real world examples. The most popular type used for at least the past decade would have to be the 32 bit integer. 232 gives us 4,294,967,296 different value possibilities. If that was a storage size, it would be 4GB. Effectively a 32 bit unsigned integer can store the values 0 to 4,294,967,295, or one byte less than 4GB. If it was signed, it would be able to store the values -2,147,483,648 to 2,147,483,6487, topping out at one less than 2GB.

If you have an Operating System or File System which limits you to 2GB as a max size for files, or your Digital Camera only lets you use up to 2GB flash cards, it's because the idiots that programmed it used signed values. They did this because either they were expecting files and flash cards of negative size, or because they were a bunch of idiots. Which of those two possibilities actually happened should be obvious.

The same goes for any programming library you see with a limit of 2GB. As soon as you hear anyone mention a limitation of 2GB, what should immediately register in your head is that idiots worked on this application/library/standard/device.

Now sometimes programmers try to justify their ignorance or their idiocy with remarks such as using negative values to indicate error conditions. Meaning, they write functions which return a positive value in case of success, and -1 in case of error. This is a very bad practice for two reasons. First of all, one shouldn't use a single variable to indicate two different things. It's okay to use a single variable where each bit (or sets of bits) indicate a flag for a set of flags. But it's very wrong to have a single value indicate success/failure and amount, or day of week and favorite color, or any two completely unrelated things like that. Doing so only leads to sloppy code, and should always be avoided. Second of all, you're cutting your amount of usable possibilities in half, and reserving a ton of values just for a single possibility. You can't be more wasteful than that.

If you need a way for your function to return success/failure and amount, there are cleaner and more effective ways to do so. Many times an amount of 0 is invalid. If a function is supposed to return the amount of something, 0 isn't really an amount. If I asked how many students are in my class, and I get 0, that in itself should be indicative of an error. If I asked how many rows did my SQL statement just change, and I get 0, that should be indicative of an error, or at the very least, there were no SQL rows to change. If I really need to know if the 0 means a real error, or there was no data to work with, a separate function can tell me if the last call was a success or failure. In C and in POSIX, it is common to use the global variable errno to indicate errors. If you're making your own library, you can make a function to tell you why the last value was 0. There was no data, there was an error accessing the data, and so on.

Another method would be to pass one parameter by pointer/reference to store the amount or success state, and have the other returned from the function. Another technique would be to return an enum which tells you which error condition happened or everything is okay, and use a different function to retrieve the value of that operation upon success. Lastly, C++ programmers can use std::pair<> for all their multiple return needs, or simply return the amount normally, and throw an exception on any kind of error.

Now that hopefully all my loyal readers are now more educated about the sign of their types, and various function return techniques that better libraries use, let's talk a bit about magnitude.

I see again and again libraries that don't have a clue about magnitude. The various C/C++ standards state the following about the normal built in types.

1) The amount of bytes used for the following types should be in this proportion: short <= int <= long <= long long.2) short should be at least 2 bytes.3) long should be at least 4 bytes.4) long long should be at least 8 bytes;.

In order to get types of a particular size, C99 added the header <stdint.h> and C++-201x added <cstdint>, with types such as int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, and several others. Each of these types are the size in bits that their name indicates, with those without a u at the beginning being signed, and those with being unsigned. Using these types you can get the exact size you need.

But remember to use things appropriately. I see several popular SQL engines for example that support as many rows in a table as a uint64_t is able to contain, but if you ask them how many rows were altered by the last query, they'll report that information in a plain and simple int! You by now should realize there are two problems with such a course of action.

Now lastly, if you're writing some kind of function which takes an array, it is normal convention to pass a pointer to the array, as well as a size variable to specify how many elements or bytes are within the array.

The standard C library uses a type called size_t to deal with such values. A size_t can contain the highest amount of bytes the program is able to address, and is of course unsigned.

If you need to pass an array, your prototype should always be along the lines of:

void func(void *p, size_t size);void func(uint8_t *p, size_t size);

If you need to return a size, for something which is addressable memory, again always use a size_t. Functions like strlen() and sizeof() return a size_t, and functions like memcpy(), memset(), or malloc() take a size_t as one of their arguments.

Now notice for 32 bit, the void * type is 4 bytes, and on 64 bit, it is 8 bytes. This means that a void * can address any position in memory that the system is able to. In order to specify the amount, you'll notice in each case the sizeof(size_t) matches the sizeof(void *). If I were to get a 128 bit system which had a void * of 16 bytes, the size_t would also be at least 16 bytes.

There are other types which also matched the size of the void * each time around, but if you look at my earlier explanation of the relationship of the various C types, you'll notice those other types are too variable to know if they'll be good for an amount of every system you try to use your application or library on, so always stick with the size_t.

All too often I see programmers use an int for such an amount which is clearly wrong, or an unsigned int, which is better, but also wrong, especially on my 64 bit system. Some programmers I also see use the type of just unsigned by itself. I don't know who thought that up, but it's identical to an unsigned int, which is also clearly wrong. I included unsigned and signed specifically in my test above because some people are under the mistaken notion that those types become as large as possible. Every time I see code which contains one of them, I want to vomit.

If you ever have any doubt as to what size a type may be on your system, or get into an argument with someone, test things yourself with a sizeof() call, it's not hard to do, or point them to this article.

Now go out there and write some good libraries. I have too much vomit and bile on the floor here from all the horrible code I have to work with or review.

22 comments:

I use unsigned because it's much easier than typing unsigned long long. I don't use size_t because it's not a good name for a type.

For instance, operator[](size_t thisIsAnIndex_notASize);

I use uintmax_t when I need the biggest value possible, and I use that for many things, like string<>hex conversion.

I don't use it everywhere for a couple of reasons:* there is a performance hit in critical code on some systems; especially for 64-bit division* it introduces casting errors on strongly-typed template arguments and loose-typed constant shift operations unless you use the unsavory ULL suffix* it just flat out isn't necessary in many cases: if you are working with a string that is over 4GB in length, you probably shouldn't be using strlen() on it

For the SNES bus, unsigned is good enough, as it won't exceed 24-bits. To be quite honest, I would desperately prefer uint24_t to represent its exact size, but sadly that requires a good performance hit.

I understand there are systems where int is 16-bits, but I also understand that my program is too big to fit in their memory spaces anyway, so I use a static_assert and don't worry about compatibly with PDP-11s or 286 real-mode DOS. If I were paranoid about portability, I would not use C++. It's a rabbits-hole of UB. Shift right, value of NULL, size of types, endian, size of member function pointers on the same system, alignment of variables in memory, and on and on. At some point you have to draw the line.

I imagine the auto keyword is going to make this problem a lot worse. Since auto i = 0 will default to int.

Lastly, strpos() is the de-facto "return -1 on error" example. Passing a valid flag by pointer or reference is painful to use on the fly. But really, the best option here is as you said, std::pair. But even that sucks.

As far as I can tell, there are only two references to strpos; PHP and some random 'C portability' library that nobody actually uses as far as I can tell. PHP doesn't have the function interface you describe, and the 'C portability' library's 'strpos' doesn't do anything you can't do just as easily using a standard function.

For your information, the C99 has header that contains portable definitions for a bunch of useful integer types. I'm not sure if it is in C++ standards, but GCC has it anyway. That (and size_t, ptrdiff_t and friends) allowed writing efficient code for two different architectures with following sizes:

sizeof(short): 1sizeof(int): 1sizeof(long): 2

and

sizeof(short): 2sizeof(int): 4sizeof(long): 4

Yes, the first machine had 16 bit bytes. There was also a thing that had following values:

sizeof(short): 1sizeof(int): 1sizeof(long): 1

A char was 32 bits.

Regarding the comment about performance: On 64-bit Intel machines, 64-bit arithmetic is faster than 32-bit. You can find it out by studying, or get it free by using int_fast16_t et al.

Hey Nach, I don't see where you said that std::pair is the "best option" for error notification, as byuu intimated, but if that is indeed what you think, what is the benifit of that over just throwing exceptions? I'm no programmer, but if I understand correctly, exception throwing seems to be more the fashion these days in higher level languages - or, one might even say, intended coding practice.

Also how exactly would you use pair, to send an pointer and an array length? Or just play it straight and send a result and an error indication?

>I don't see where you said that std::pair is the "best option" for error notification

I didn't say it was the best option. I said it was an option.

I believe the best option differs from scenario to scenario. One technique will feel more natural in some situations than others.

>what is the benifit of that over just throwing exceptions?

I wrote the option of throwing exceptions too.

Exceptions versus returning have several pros and cons.

Exceptions Pros:Single location for error management.Clean interface for working with code when no errors will occur.

Exception Cons:Overhead.Possible memory leaks if not using techniques like std::auto_ptr<>.Less natural usage in cases where you want to deal with issues immediately, and use slightly different initialization logic.

It's for reasons like this that the standard memory allocation feature "new" allows it to return error either by throwing, or by returning a null pointer at the programmers behest.

>but if I understand correctly, exception throwing seems to be more the fashion these days in higher level languages

Indeed. Some of the newer "higher" level languages today seem to enforce using certain features over and over again. That however doesn't mean they are the best way of doing something. A crucial point many newer languages miss out on is allowing the programmer to do whatever feels the most natural for a particular situation. However they balance this rigidness with a language with more consistent constructs making them easier to maintain with bad and inexperienced programmers.

>Also how exactly would you use pair, to send an pointer and an array length? Or just play it straight and send a result and an error indication?

Er, yeah, I understand HOW to do it, but what I was asking was what you meant by "using std::pair to return a value as well as an error check". So I see from your code that indeed you meant to return a canonical pair of (error_status, computation_result), instead of the more general case of (array, length_of_array), in which, say, returning a length of 0 or returning a null pointer would constitute an error notification.

And yeah, I didn't think you were advocating std::pair particularly over other methods, but byuu seemed to be saying you were, so I just asked.

What I was most curious about was the pros and cons of using exception throwing. Thanks for the explanation! I see how the proper choice could vary depending on the situation.

In some cases, I already know the text contains a valid integer, I just need it converted to an actual integer type. In that situation, there is no reason for me to bother checking if it failed, since it never will.

In other cases, I'll handle an integer with value 0 exactly the same as if a conversion failed, so again I have no need to check the boolean parameter.

But the function does offer me the option of checking via the boolean parameter if I ever need to know the difference between a value of 0 and failure.

In this scenario, using std::pair<> would force more annoying usage, and require conveying information that is usually not needed. Using exceptions would also add a lot of overhead to what would otherwise be a simple and straight forward conversion process, for which I usually would like to ignore dealing with throws.

> No one suggested you use uintmax_t, that's generally overkill. For the cases you're describing, you're probably best off using a template.

What? Your whole point of using size_t was to get the biggest variable so you're not stuck with 2GB/4GB when you have 64-bits available. Is there even a platform where sizeof(uintmax_t) > sizeof(size_t)?

> If you're using C99, uintptr_t is also an option.

uintptr_t on x86 is 32-bits, no better than unsigned.

> Those suffixes should not be used, as they differ between platforms in meaning. If you need constants, use the INT32_C(), UINT32_C(), INT64_C(), and UINT64_C() macros.

My point about terseness is being lost :/

> Neither the C, nor the C++ standards contain a function by that name. I can't really comment on a made up function, or a function from some random library written by idiots.

Yes, C and C++ have piss-poor support for parsing strings compared to other languages.

I'm shocked at the general lack of interest here in such a basic string function.

Fine, ignore the exact function and think about the concept: return a position from 0-N, and still be able to return a not-found flag. std::pair has some ugly syntax for that. But yes, it will get the job done.

I like this better for two reasons1) You don't need the syntactic ugliness of using a "pair"2) If you forget to check the bool part of the pair there is no compiler warning. With an optional variable at a minimum your using a "*" so the compiler yells at you if you attempt to the value directly (which usually means you forgot to check for an error)

@flamingspinachUsing exceptions is the approach that Bjarne Stroustrup recommends in his book "Programming -- Principles and Practice Using C++".The approach I've generally seen taken with exceptions is that you throw them, for well, exceptions. If you when you call a function your *expecting* an error (your asking 'tell me the number of kids in the class if this class exists') a different approach should be taken.

>> No one suggested you use uintmax_t, that's generally overkill. For the cases you're describing, you're probably best off using a template.>What? Your whole point of using size_t was to get the biggest variable so you're not stuck with 2GB/4GB when you have 64-bits available.

You clearly need to read the article again. And perhaps look at the output of the sample code provided.

sizeof(size_t) matches that of sizeof(void *) so it is 32 bit on 32 bit, and 64 bit on 64 bit. One doesn't get stuck using it, because it expands in size appropriately, so one can index any position in an array.

>Is there even a platform where sizeof(uintmax_t) > sizeof(size_t)?

Yes, my 32 bit system has a 64 bit uintmax_t and a 32 bit size_t.

>uintptr_t on x86 is 32-bits, no better than unsigned.

You're missing a crucial point. It grows with the rest of the system. It will be whatever size it needs to be. Hence better than unsigned which can be smaller than it needs to be.

>My point about terseness is being lost :/

Terseness is not the issue, what is the issue is correct programming for compatibility and portability.

>I'm shocked at the general lack of interest here in such a basic string function.

If you mean something like strpos() from PHP, it's needed in PHP because of a lack of raw access to pointers. C/C++ has no need for such a function, and offers strchr(), strstr(), strtok(), and similar for all your string needs.

Index via position instead of location is a bad way of doing things. However if you need to work with things via position, C++ does in fact provide that option with std::string::find().

>Fine, ignore the exact function and think about the concept: return a position from 0-N

The function you suggest using can't return a position from 0-N, it can only return at most a position to N/2.

>But yes, it will get the job done.

Not really. However returning ~0 could be an option for not found, and that way you can address up to N-1.

>Yes, I see you don't care as much about all the red tape.

In other words, you don't care about programming correctly.

There are proper ways of doing things, and improper ways of doing things. You're free to do things improperly, but that just adds to the vomit I have on the floor.

You're quite correct. Using a C++ class for some kind of guarded pointer to wrap the value would be best. Especially instead of recreating the same concept over and over again for each function you have.

However, if you need a technique which works in vanilla C, you would then do something along the lines of what I described.

All this nonsense is a needless distraction from REAL programming. The move from assembly to high level languages was good because it allowed the programmer to focus more on getting things done - not on how they are done. The same goes for the move to object oriented programming. The days when you needed to hand optimize your code to fit on 32 kilobytes of ram are behind us; I don't care how large the program is - certainly not a few extra kilobytes - I just care that he programs runs. Don't make me think about bits and bytes, I just want a number. JavaScript making every number a 64-bit floating has some flaws (0.1+0.2==0.3), but overall gives a lot of value to the programmer.

Unsigned types don't exist. An integer is neither signed nor unsigned, it's just a collection bits that are interpreted a certain way. 0XFF is either -128 or 255, depending on how you interpret it. Addition, subtraction, multiplication, left shift, zero filling right shift, and bitwise operations, all produce the same resulting bits whether the operands are signed or unsigned.

Using an unsigned type to get "double the range" is a dodgy practice, because if half the range was too little the other half will eventually be too little too. Also unsigned numbers work badly in loops that count down to 0 (x >= 0 will never be false for unsigned).

The semantics of maths that mixes signed and unsigned are, what? Nobody knows, so it's bad to do it.

If Java's gotten away without unsigned types for 15 years, why were they necessary in the first place?

You are quite correct that things like this is a distraction from the actual programming.

Every programming language unfortunately has much to distract from the actual programming.

When programming with C/C++, dealing with types is a given, that's how the language is. In order to program properly in those languages this is a crucial topic.

One who instead programs in JavaScript will instead have to know the quirks of JavaScript to program properly === vs. ==, and everything else.

Regardless of the programming language used for a program, we want our programmers to do their best with the language that their using, and not do a bad job because they don't know the language well, or don't care about the language's quirks.

Hi Damjan.

>Unsigned types don't exist.

You're correct that it's simply a matter of how you interpret it, but it is up to the programmer to interpret it properly. Such as not throwing out have the possibilities on cases that don't exist, when they have more cases that do exist.

>Addition, subtraction, multiplication, left shift, zero filling right shift, and bitwise operations, all produce the same resulting bits whether the operands are signed or unsigned.

Right shift is not always zero filled. On most systems, right shifting a value with the first bit set (negative number) which is signed shifts in a one. Otherwise a zero. In this case, not all systems are the same either, and you have to be careful about using >> in portable programs.

>Also unsigned numbers work badly in loops that count down to 0 (x >= 0 will never be false for unsigned).

Piece of cake, use while (--x), surrounded by another if, if need be, or put an if (!x) { break; } in the loop.

Needing to count down for whatever reason is no excuse to use a signed type.

>The semantics of maths that mixes signed and unsigned are, what? Nobody knows, so it's bad to do it.

The semantics do differ from system to system. However on a given system, they are defined.

However, what you're saying is a good point, and is one reason why we have casting operators. But it also means that all libraries you make should work with the correct types, don't return a signed type when anyone needing to work with it sanely would be using unsigned math.

>If Java's gotten away without unsigned types for 15 years, why were they necessary in the first place?

Java removed features which they deemed too hard for most idiot programmers to deal with. That however doesn't mean you should limit yourself to what Java provides in non Java languages.

This article explains exactly why unsigned types should be used in the first place. There are multiple ways to program things that you can get a program to work exactly as needed with various limitations in place, such as only signed, but doing so only adds overhead to every operation.

Because of this, I doubt you'd see an operating system and drivers written purely in Java any time soon.

>Also unsigned numbers work badly in loops that count down to 0 (x >= 0 will never be false for unsigned). "ok" Let's take my most recent proggie... ... while (R) { result *= ((long double)N--) / R--; } ... while (i--) { for (j=wnum, val=pstat[i].val; j--;) ... while (ret) { ... for (wk=rnum-1, wk_p=rdat_p->l[wk]; wk; wk--) ... if (--ret) ... } ... for (wnum_wk=11; wnum_wk--;) ... for (tnum_wk=tnum; tnum_wk--;) ... do { blk_wp += cca(high--, low); } while (low--); ... and then some. You get the drift. You'll notice most of these actually work with the index having a value of 0, too (except where I don't need it, by design). All those indexes are of course unsigned and the loops do work quite well, thanks for asking. Because a negative amount of data, of result tables, of elements in a group and of threads ran in parallel totally makes sense, except when it doesn't. Which is rather often. ¬_¬

"If you have an Operating System or File System which limits you to 2GB as a max size for files, or your Digital Camera only lets you use up to 2GB flash cards, it's because the idiots that programmed it used signed values. They did this because either they were expecting files and flash cards of negative size, or because they were a bunch of idiots. Which of those two possibilities actually happened should be obvious."

Where did you get this idea and what filesystems are you referring to? This definitely isn't true of SD, or of any version of FAT except a draft version of FAT16 that never actually appeared on any OS after like 1985.