A New Appreciation for Data Types

A New Appreciation for Data TypesI've spent a good part of my professional life teaching C++. Nearly all of my students have had prior experience programming in C. After all these years, I still find it fascinating how often learning C++ forces C programmers to refine their understanding of, or completely relearn, parts of C that they thought they already knew.Last fall, I explained that every literal in C and C++ has a type ("Numeric Literals," September 2000). For example, many C programmers don't realize that the literal 0 has a type. They know they can use it as an integer, as in:

int n = 0;

or as a pointer, as in:

char *p = 0;

but they don't really know what type it is. It makes little difference to C programmers.
It does make a difference to C++ programmers. C++ supports function name overloading, and the exact type of a literal just might determine which function is the best match for a particular call. For example, given:

int f(int);
char *f(char *);

which of these functions does f(0) call? The answer is the first one, because 0 is an int.
Why bring this up again? My most recent column on reference types in C++ used operator overloading as an example of a legitimate use for references ("References vs. Pointers," April 2000). I thought I'd write a bit more about operator overloading, and I was about to do just that when I remembered that operator overloading is another one of those C++ features that forces C programmers to revisit what they think they already know.

Operator overloading lets you define new meanings for existing operators. Operator overloading can make new types easier to use by making them look and act like other types in the language with which you are already familiar. For example, C++ does not provide rational numbers (exact fractions) as a predefined type. However, you can define a rational number type (as a class) and define new meanings for operators such as +, -, *, and / so that they have the "expected" behavior when applied to rational numbers.
In nearly all cases, the "expected" behavior of an overloaded operator is behavior that closely resembles that of its built-in counterpart. For example, for any object x of a scalar type, ++x using the built-in prefix ++ operator yields exactly the same result as x += 1 using the built-in += operator. When you define ++ and += for any rational number x, ++x should still yield the same result as x += 1.

Operator overloading makes life easier by letting you keep doing what you already know how to do, but only if you also get a familiar result. Thus, if you're going to do operator overloading in C++-and do it well-you really need to understand the behavior of the built-in operators so you can mimic that behavior in overloaded operators. For many programmers, operator overloading is a real eye opener. When they try to get an overloaded operator to behave as much as possible like its corresponding built-in operator, they realize they don't really know how the built-in operator works.

C programmers have a less compelling need to know the precise behavior of the built-in types and operators. Nonetheless, most good C programmers make an effort to understand the type system anyway-because that's the sort of thing that good programmers do.

Defining your own types and operators raises all sorts of interesting questions about design philosophy and programming style. I believe the answers to many of these questions can be found among the answers to more fundamental questions about the nature of data types in general. Before we delve into defining new types and operators, let's look carefully at the ones we already have.
C and C++ provide a variety of data types. These types include:

Predefined types such as character, integer, and floating-point types

Scalar types such as enumerations and pointers

Aggregate types such as arrays, structures and unions

C and C++ have some minor differences here. For example, both languages provide signed and unsigned varieties of the character and integer types. Both languages also provide short and long varieties of the integer types, but only C has long long types. C also includes complex arithmetic types among the predefined types. On the other hand, C++ includes a predefined boolean type. It provides references in a category all by themselves. (C++ does not consider references as scalar types.) And, of course, C++ provides classes.

Why do C and C++ offer so many types? One fairly obvious answer is storage economy. Different types represent different amounts of storage. A character typically occupies only a single byte. A plain integer often occupies a 4-byte word. C and C++ give you a variety of types so that you can pick sizes for your objects that are big enough for your needs, without being unnecessarily big (and wasteful).

Storage economy is a reason for supporting a small variety of types, but it's not a reason for supporting a larger variety. Assembly languages have very few rudimentary types, such as byte, word, long word, and arrays thereof. These types are enough to let you specify any size object you want. Why don't C and C++ have just a few types such as byte, short word, word, and long word? Why do programmers need any more variety?

I can think of two reasons, the first of which cuts to the very heart of what a data type is.

Reason #1: data types improve compile-time error detection

When you declare an object with a particular type, such as int or char *, you're telling the compiler how you intend to use that object later in the program. The compiler can then check that your program uses the object only as intended. This is a good thing.
Compile-time type checking turns potential run-time errors into compile-time errors, which are much easier to spot and fix.

For example, when you declare n as an int, you're telling the compiler that you intend to store integer values into n. You can do arithmetic with n by using it as an operand for operators such as +, -, binary *, and /. However, you can't use n as a pointer by applying the unary * operator, as in:

int n;
...
*n = 3; // can't dereference an int

When you declare p as a char *, you're telling the compiler that you intend to store pointer values into p. You can use p as a pointer by applying the unary * operator. You can do limited arithmetic on p by using it as an operand of the + and - operators:

Moreover, you can't use p as an operand of the binary * and / operators.
In general, a data type describes a set of behaviors for a data object. It tells the compiler what the program can and cannot do with that object during program execution. The compiler can verify that the program uses the object only in ways permitted by its type, and reject any program that oversteps that permission.

Reason #2: data types simplify programming via overloading

Conceptually, adding two integers is the same operation as adding two floating-point numbers, or adding an integer to a floating-pointing number. However, on most architectures, these operations are mechanically different. Adding two integers uses an integer add instruction. Adding two floating-point numbers uses a floating-point add instruction, or, in the absence of floating-point hardware, it calls a floating-point add subroutine. Adding an integer to a floating-point number involves converting the integer to a floating-point number before performing a floating-point add.

When you program in assembly, you have to fuss with these different ways of adding numbers. When you program in a higher-level language like C or C++, the compiler does the fussing for you. In your source program, you just use the + operator to specify that you want the program to add two numbers. The compiler uses the types of the operands to determine what it must generate to do it. For example:

In C, a data type is strictly a compile-time entity. The compiler keeps track of type information as it compiles each translation unit. When the compilation is done, the compiler leaves no overt trace of the type information anywhere in the object code.
For example, when the compiler encounters an enumerated type definition, such as:

enum day { Sunday, Monday, ... };

it stores each of the names (day, Sunday, Monday, and so on) into a symbol dictionary along with information describing what each name means. However, the compiler generates no data storage or machine instructions to represent the names in the object code.
Later, when the program declares:

day d;

the compiler generates data storage to hold d. The attributes of type day determine the size of d, but day itself generates no storage. When a statement such as:

d = Monday;

appears in the program, the compiler generates code to store 1 (the value of Monday) into the storage allocated for d. Depending on the target machine's instruction set and addressing modes, the value 1 representing Monday may appear as storage in the resultant object program, possibly as an immediate operand in some instruction or as data in the constant data segment. In any event, the type day itself still doesn't materialize directly in the object program.

When you compile the program with symbolic debugging enabled, the compiler does plant type information into the object file. But the type information still isn't part of the object program; it's auxiliary data for the debugger.

Whereas data types in C are strictly compile-time entities, data types in C++ can have run-time properties. Class types in C++ can have virtual member functions, whose behavior is determined by type information that must be available at run-time. I'll have more to say about virtual functions in due time.

Dan Saks is a high school track coach and the president of Saks & Associates, a C/C++ training and consulting company. He is also a consulting editor for the C/C++ Users Journal. He served for many years as secretary of the C++ standards committee. You can write to him at dsaks@wittenberg.edu.