Unionize Your Variables – An Introduction to Advanced Data Types in C

Programming C without variables is like, well, programming C without variables. They are so essential to the language that it doesn’t even require an analogy here. We can declare and use them as wildly as we please, but it often makes sense to have a little bit more structure, and combine data that belongs together in a common collection. Arrays are a good start to bundle data of the same type, especially when there is no specific meaning of the array’s index other than the value’s position, but as soon as you want a more meaningful association of each value, arrays will become limiting. And they’re useless if you want to combine different data types together. Luckily, C provides us with proper alternatives out of the box.

This write-up will introduce structures and unions in C, how to declare and use them, and how unions can be (ab)used as an alternative approach for pointer and bitwise operations.

Structs

Before we dive into unions, though, we will start this off with a more common joint variable type — the struct. A struct is a collection of an arbitrary amount of variables of any data type, including other structs, wrapped together as a data type of its own. Let’s say we want to store three 16-bit integers representing the values of a temperature, humidity, and light sensor.

Yes, we could use an array, but then we always have to remember which index represents what value, while with a struct, we can give each value its own identifier. To ensure we end up with an unsigned 16-bit integer variable regardless of the underlying system, we’ll be using the C standard library’s type definitions from stdint.h.

Alternatively, the struct can be initialized directly while declaring it. C offers two different ways to do so: pretending it was an array or using designated initializers. Treating it like an array assigns each value to the sub-variable in the same order as the struct was defined. Designated initializers can be arbitrarily assigned by name. Once initialized, we can access each individual field the same way we just assigned values to it.

Notice how the fields in the designated initializers are not in their original order, and we could even omit individual fields and leave them simply uninitialized. This allows us to modify the struct itself later on, without worrying much about adjusting every place it was used before — unless of course we rename or remove a field.

Bitfields

The bitfield is a special-case struct that lets us split up a portion of an integer into its own variable of arbitrary bit length. To stick with the sensor data example, let’s assume each sensor value is read by an analog-to-digital converter (ADC) with 10-bit resolution.

Storing the results in 16-bit integers will therefore waste 6 bits for each value, which is more than one third. Using bitfields will let us use a single 32-bit integer and split it up in three 10-bit variables instead, leaving only 2 bits unused altogether.

We could also add a 2-bit wide fourth field to use the remaining space at no extra cost. And this is pretty much all there is to know about bitfields. Other than adding the bit length, bitfields are still just structs, and are therefore handled as if they were just any other regular struct. Bitfields can be somewhat architecture and compiler dependent, so some caution is required.

Unions

Which brings us to today’s often overlooked topic, the union. From the outside, they look and behave just like a struct, and are in fact declared, initialized and accessed the exact same way. So to turn our struct sensor_data into a union, we simply have to change the keyword and we are done.

However, unlike a struct, the fields inside a union are not arranged in sequential order in the memory, but are all located at the same address. So if a struct sensor_data variable starts at memory address 0x1000, the temperature field will be located at 0x1000, the humidity field at 0x1010, and the brightness field at address 0x1020. With a union, all three fields will be located at address 0x1000.

What this means in practice is easily shown once we assign values to all the fields like we did in the struct example earlier.

Unlike the struct example, the value printed here won’t be the assigned value 123, but 789 instead. Since every field in the union shares the exact same memory location, any time one of the fields gets assigned a value, all other field’s previously assigned values are overwritten. For this reason, it rarely makes sense to have fields with the same data type inside a union, but instead mix different types together. Note that the data type sizes don’t need to match, so it’s no problem to have a union with, for example, a 32-bit and a single 8-bit integer, the 8-bit value is simply truncated if needed. The size of the union itself will be equal to the biggest field’s size, so with a 32-bit and a 8-bit integer, the union will be 4 bytes in size.

Using Unions

A union essentially gives one memory location different names and correspondingly different sizes. That might seem like a strange concept, but let’s see how that can be used to easily access different single bytes within a longer data type.

union data_bytes {
uint32_t data;
uint8_t bytes[4];
};

Here we have a 32-bit integer overlapping with an array of four 8-bit integers. If we assign a value to the 32-bit data field and read a single location from the bytes array, we can effectively extract each individual byte from the data field.

The actual output will depend whether your processor architecture is little-endian or big-endian. Little-endian architectures will interpret the array index 1 as the integer’s second least significant byte 0x56, while big-endian architectures will interpret it as the integer’s second most significant byte 0x34.

The same principle used to extract a byte works also the other way around, and we can use unions to concatenate integers. Let’s consider a real world example involving the ATmega328’s analog-to-digital converter. The ADC has a 10-bit resolution, and looking at its registers, the converted value is stored in two separate 8-bit registers — ADCL and ADCH for the lower and higher byte respectively. A struct with two fields named after those two registers seems like a good choice for this, and since we also want the whole 10-bit value of the conversion, we’ll use the struct together with a 16-bit integer inside a union.

Note that accessing the struct fields anonymously will only work as long as there are no name conflicts. If there are duplicate field names, the struct itself will require a field name. Once the struct has its own identifier, we can also add a type name to the struct itself, which lets us use it also outside the union.

Once the register values are stored in the struct fields, we can read the full value from the 16-bit `value` field. Of course, it doesn’t require a union to combine those two register values, we could also just use bitwise shifting and an OR operation:

printf("0x%04x\n", (ADCH << 8) | ADCL);

Truth be told, there is actually nothing unique about unions. In whichever way you are using them, you could achieve the same with either bitwise operations or pointer casts. But that equivalence is exactly what makes them interesting.

Shortcuts with Unions

Let’s have another look at the previous byte-extraction example and see what other options we have to get a single byte out of an integer. As we remember, we had a union with a 32-bit integer and an array of four 8-bit integers:

union data_bytes {
uint32_t data;
uint8_t bytes[4];
};

The most common way to extract parts of any value is combining bitwise shifts with an AND operation, however, in this particular case, we can also cast the 32-bit value to a series of 8-bit values. Well, let’s just implement all of these options and see how that will look like.

Taking a closer look at the pointer casts, we basically tell that whatever is located in the memory address of the 32-bit value, is in fact a collection of 8-bit values. Now, applying the same terminology to the union declaration, we basically tell that whatever is located at the union‘s memory address is either one 32-bit or four 8-bit values, so just like we can do with the cast — except, with a union, we will be very explicit which one of those two types it will be when we access the value. In a sense, unions provide a shortcut to data type conversions, while at the same time making sure the data itself is used in a way that makes sense and is valid in its context, with the compiler keeping you honest. You could say that unions are to pointers what enums are to a bunch of preprocessor constants.

Looking into floating point numbers

Let’s have another example and explore floating-point numbers, IEEE 754 single-precision floating-point numbers to be precise — also known as a float. If you ever wondered what a float looks like to a CPU, just make it think it’s an integer. Obviously not in a “cast an int to float to remove the fraction part” way, but in a “raw IEEE 754 binary32 format” way.

Both will output 0x42835000 which won’t tell us much without thoroughly studying the binary32 format, which is a combination of a sign, exponent, and fraction value with a standardized bit width. Recalling the concept of a bitfield, we can extend the union with a struct, helping us taking the binary32 format apart. For completeness, the same data is also extracted with bitwise operations as a non-union alternative.

I’ll leave it for you to decide which option is clearer to read and easier to maintain. Either way, the output will give us a sign value 0, exponent 133, and the fraction 0x35000. Following the format’s definition, we can construct the initial floating point number 65.65625 back from it. So if you ever end up analyzing some raw data dump or binary blob and come across a floating point value, now you know how to use a union to find out what number it represents.

That’s All Folks

There are two more things to worry about when using unions to peer inside other data types: endianness and alignment. Most computers and microcontrollers are little-endian, but watch out for Motorola 68k and AVR32 architectures which are big-endian. For performance reasons, different processors also like to align memory on 2-byte or 4-byte boundaries, which may mean that two uint8_ts might be located four bytes apart in memory. In GCC, you can use the aligned attribute to control this behavior, but you may be subject to a speed penalty and it’s beyond the scope of this article.

This concludes our expedition into structs and unions. Hopefully we could give you some new insights and ideas of how to arrange your variables, and some convenient alternatives to handle them. Let us know if you can think of other ways to make use of all this, and in what peculiar ways you have used or come across unions before.

Unions were designed to save space, using the same memory to store two or more different types of data. They were not meant to be used to extract bytes, nibbles, or bits, nor to implicitly cast data. Using unions to do so is unsafe, as the compiler has wide latitude to implement the actual storage however it wants to.

Mis-using unions in these ways is non-portable, and will likely result in entertaining hours of bug hunting.

As follow-on advice: use the native register size [unsigned] int whenever you’re not dealing with a value range which is precisely the 2^sizeof(short) or 2^sizeof(char). Range-checking at the start of your API will spend many fewer cycles than the generated machine code and bus cycles to manipulate less-than-register-sized values. You might think you’re saving space, but it just isn’t worth it at today’s memory sizes.

If you understand why padding from struct alignment borks portability inside a union, than it can be useful.
Most C/C++ people will also avoid using direct in-line assembly as well, and when necessary wrap an abstraction with a meaningful comment explaining why it was done. Almost every modern mcu I have used will have a specific subset of special macros to handle platform specific eccentricities.

The gcc tool suite may sometimes generate unoptimized binaries on some platforms (sometimes this feature is useful too), but it does support most modern processors rather consistently. I really am thankful industry decided to embrace an unofficial standard compiler after 35 years of pain-in-class compilers. =)

It’s important to remember that unions in C++ don’t follow the C convention. After writing to one field of a union, accessing the other fields is undefined behavior (although most compilers will implement them the same way as C).

For those using C# you can achieve a union using explicit struct layout. Mark the struct with a StructLayout attribute with the parameter LayoutKind.Explicit and then apply FieldOffset attributes to each field. The Field Offset attribute accept an argument for how many bytes from the start of the structure to offset to that field. No bit-level alignment like C/C++ but even byte level alignment is very powerful. Just take care during your order of initialization since the C# rule about every field of a struct must be initialized is still enforced for an explicit struct.

Not just another platform, I’ve had terrible problems with structure alignment on Solaris, when trying use gcc to build some code that simply refused to compile correctly with Sun CC. Small test programs can save you hours of head scratching when you’re trying to debug this stuff.

i do this with a lot of serialization code but you have to be careful transporting them between different architectures. like transporting a struct from an mcu to a computer did require bit shifting at one end to decode the data correctly.

I use bit fields in a struct with XC8 in MPLAB X. It’s a great way to hold status flags that are less than 8 bits wide. This is also the way compiler makes it possible to use the names of individual bit fields of registers…

Nice article.
One thing (already mentioned here) that from time to time bites me it’s how structs are packed. This one from Eric Raymond covers many odd cases in more detail: The Lost Art of C Structure Packing

A few comments:
A lot of folks don’t like bitfields because there is no specification on how the compiler will pack them, so your code might not work correctly on other systems. As stated “some caution is required”.
They are also disliked because folks expect the compiler to make less efficient code to deal with them than they can code by hand. Maybe that concern has dissipated in this era of super-high speed processors, but for microcontrollers I imagine it is still important.
Finally, as specified they only work with int sized-items. Some compilers might work beyond the spec to allow packing into unsigned chars, and short ints, etc. but you can’t count on it.
Many programmers avoid them and stick with doing explicit bitwise operations so that they know their code will always work.
I like bitfields myself, as they make code much more readable, but I have to be very aware of how I am using them if I ever intend to port the code to any other platform.

Also, thanks for letting me know about “designated initializers” in C. I learned well before C99 and did not know that they were a thing!

This reminds me the REDEFINES clause in Cobol, but in C it seems to be there only to confuse coders and generate bugs.
Maybe because C is just assembly language badly written. Just joking…
Exemple in Cobol :
05 A PICTURE 9999.
05 B REDEFINES A PICTURE 9V999.
05 C REDEFINES A PICTURE 99V99.

I’ve only been program in C for a bit over 40 years (I still have an original copy of K&R on the shelf) :-), and I’d generally say that if are using the ‘union’ feature you either a) don’t know what you are doing b) your program is crap, or c) both.

If you want to treat a block of memory as a block of memory, use a block of memory. If you want to use a type variable, use one. If you want to convert it from one to the other do so deliberately so you get it right in the context that you are using.

The ‘union’ approach will just lead to bugs, side effects, and hard to maintain code.

Please, Wise One, enlighten me, how I should correctly solve a following problem:

I’m writing for 8-bit micro so memory is accessed in single bytes. I have three short ints that hold 3 16-bit wide calibration values.set by user. I want to save them to internal EEPROM, but I can only write 8 bits at a time, and I have to read them in the same order upon reboot. So I packed them in a struct and unionized it with array of 6 chars. I then can write first char from array to first EEPROM address, then second to second, etc. I read it in the same order.

O, Wise One, show me the correct way to do it. Enlighten me. Show me the way…

First problem is that you can only write one byte at a time. So you write a routine (you could do eeprom bounds checking if required etc) – and assuming you aren’t doing more than 254 byte writes, which you probably aren’t on a constrained system.
So a simple one would be (and I agree this could be much more optimized if speed was a problem)

So you replaced an union with explicit type conversion, which does exactly the same thing but without word “union”. And with pointers, which should be avoided at all costs. Besides this is rather application-specific thing and won’t be ported to other platforms without considerable rewriting…

So you replaced an union with explicit type conversion, which does exactly the same thing but without word “union”. And with pointers, which should be avoided at all costs. Besides this is rather application-specific thing and won’t be ported to other platforms without considerable rewriting…

Indeed… using a union is no safer in this case. Plus, I really don’t see how you can write a non-trivial application without using pointers. Your entrypoint typically has the prototype int main(int argc, char** argv);; whoops there’s a pointer right there!

Unless you live by declaring everything statically in one place where it’s all globally accessible (ugh!), you’re going to have pointers.

yes, pointers are much better (properly used) than union is.
union has side effects. It is that simple. you update what looks like an variable, and another variable changes.
You can do that with pointers too, but reasonable programmers don’t ie they don’t have two different pointers to the same object unless there is some type of management code..
And even then, dereferencing a pointer is clearly changing something in memory..

Unless you have a highly specific reason to use unions, I don’t think you should..

That’d allow for a number of common C data types, with a struct member that defines which of those union members is relevant and how to interpret it. i.e. if type == TYPE_CHAR; then ptr should be considered a char*. If something is horrendously big; you might use a bit in flags to indicate that size is measured in 16-bit words or 32-bit long words. If you’ve got a tiny string; you’d stuff it in byte and the type would be set accordingly.

Yes, the struct is 16-bytes long, but it’d let you handle just about anything.

“generally” may be true, as the vast majority of programmers do not “generally” deal with hardware registers or have memory constraints that limit how much storage can be utilized. I, too, have 40 yrs experience, with most of that dealing with hardware manipulation (CPU, PCI, etc.). Your comment about code being crap for using unions and structs is just plain wrong. Yes, there are appropriate times and places to use unions and structs, but to diss them wholesale is bogus; I use these constructs as necessary, not on a whim.

Here is a snippet from Second Edition of [i]The C Programming Language[/i] by Brian W. Kernighan and Dennis M. Ritchie

When storage space is at a premium, it may be necessary to pack several objects into a single machine word; one common use is a set of single-bit flags in applications like compiler symbol tables. Externally-imposed data formats, such as interfaces to hardware devices, also often require the ability to get at pieces of a word.

Using the above named fields is far easier to read and maintain, and removes worry about which bits are being manipulated. A former coworker was not aware of bit fields in C, and thus used masks to access bits in registers, sometimes incorrectly.

I think you missed a bit of what I said – I in no way that stucts were bad! They are good! Use them everywhere!

In your example, I see why you have used a union. The function is expecting a uint64_t, but it really isn’t one as it is a packed bit field. And you (I assume) are never going to manually change Uint64, you are just going to pass it to a function that you don’t control. If memory is super tight etc etc I can see why you would do that.

Still, it would be better to encapsulate it and pass an address of the structure instead, it would be much clearer and less prone to errors..

Though my viewpoint is significantly biased as I
1) write code that some other idiot is going to have to maintain or change in the future.
2) sometimes that idiot is me,as I am still supporting code I wrote 30 years ago…

Every register access in world of PIC programming using XC8 uses the combination of register bitfields in struct unionized with name of register. So for example I can access a bit in a port using PORTABits.RA0 or write to entire port with PORTA. So if compiler and IDE use unions all the time, why it’s evil to use them in my code?

Also to improve readability of code I use meaningful names and lots of comments. Anyone who says that code should be its own documentation is a moron…

I use unions for embedded code on microcontrollers with small amounts of RAM. I’ll declare a global array of unions, each element of which can be a single uint32_t, two uint16_t variables, or four uint8_t variables (I generally avoid floating point in my embedded code). Make convenient scratch-pad variables without having to declare them within each function.

You won’t be having any floating point code if your worried about 8 bytes of ram for non overlapping scratch-pad variables! I must admit that it’s only been very recent embedded chips (last year or so) that I have ever been tempted to use floating point – and to be honest I still haven’t found anywhere I want to. The floating point library is just too big!
So I get what you are doing, but the way I do it is to have just a blob of memory that every function knows it can use, but can’t rely on if they call anything else or exit.. The you don’t have to worry about accidently writing to one variable while whacking another (the union way..)

Everyone’s needs are a little different. To me, the primary advantage of floating point numbers is their ability to hold very very large numbers or very very small numbers, but I don’t need that for most of my code. I’d rather use a (u)int32_t and fixed point precision to make the math less computationally intensive.

As for the union, I’m using it to cast the “blob” of memory you refer to as the kind of variable type that I need. It makes the code very clean. It’s always is clear from inspection how I am using the memory and I don’t have typecasts sprinkled all over the place.

“Make convenient scratch-pad variables without having to declare them within each function”

So you’re doing by hand what the compiler does? On a proper CPU, a function’s local variable allocation is always ideal, because exactly the right amount of space is allocated as its needed, per function and only the required space that’s needed for live variables in the entire program or thread is currently occupied.

On an 8-bit PIC, the compiler attempts to do this by creating a statically allocated stack, but variables from different functions at the same call level will occupy the same RAM addresses: they are part of a ‘union’. The only difference with doing it by hand is that you’re likely to make more mistakes.

I’m a C n00b and just came across unions and this name.variable stuff in some code I was looking at, which I didn’t understand and didn’t yet bother to look into, and then BAM!: HaD has a post on this very topic just in time! The comments have also been helpful for things I should look out for if I intend to write code that uses them.

I guess the tracker on my PC isn’t working correctly; I could have used some help on structs about a month ago.

Anyway, this was a fairly meaty and interesting article. It has sort of scared me off of unions, though I am now aware of them, and in future I will likely fall across a problem where they will be useful.

One great thing about structs is using them for passing data between functions. Sometimes, especially when you need to update some sort of state variable, you don’t know at the outset everything that you need to include in that state variable. Structs (and to a lesser extend unions) to the rescue! Simply pass a struct or union (or better yet a pointer to one, since structs pass by being copied onto the stack otherwise) to your function. If you find that you forgot something, just add it to your struct definition and it automatically is passed along for the ride without having to modify your function definition (i.e. the parameter list stays the same); you just access the new members of the struct.

The only usable case I (would) use unions is when one function is a dispatcher for others, depending on the input data, without the need of using void pointers and guessing the data type/format. For example signalling between different layers of software.
Create a union of structs which all have the same format for the few of the first variables (eg. signalId, senderId, timestamp) and the rest completely different. Now the dispatcher can check those first values (due to same begining of stuctures they are always at the same memory addresses), cast the rest in the correct structure and call the needed function. Minimum use of memory and the type cast is always correct…

Most of you are writing as if the one, the only, and the primary concern of ALL C-language programming is to write portable code.

But sometimes, it doesn’t matter. I know, that sounds like sacrilege these days. But let’s get real. Sometimes you need throwaway interface code that does something very specific on a very tight platform.

Even when C was first written, it was obvious that there would have to be platform-specific things that would need to be rewritten. The goal of writing tight, memory efficient code was still very much a concern. That’s why the union keyword exists. That’s why C has all those wonderful ways you can shoot yourself in the foot with pointers.

Yes, like pointer math, unions are dangerous tools that can be used and abused very badly. But sometimes, you need them.

If speed and memory efficiency are secondary concerns and portability is the primary concern, then go use whatever high level language you want. The C programming language is probably not the language you should be using.

And if you REALLY don’t care about portability but want ultimate speed and memory efficiency, macro assembly language is rapidly becoming a lost art.

I write code for very memory-constrained microcontrollers, and sometimes I have to write assembler just to fit it all in. But when I don’t, I use C, and portability is a long way down my list of priorities. Tight (and readable) code is at the top of my list. And as you say, pointers & unions are two ways of assisting with that goal.