I define a data representation as a finite set of tokens that stand for units of elemental information in a given field of interest. The tokens can be symbols, numbers, or other signals. The purpose of a data representation is for the recording of information, transmitting information, etc. In a general sense, the set of twenty-six Latin letters that comprise the Western alphabet is an example of a data representation; its symbols were invented for recording words of natural language in written form. The letter 'M', for instance, is the symbol that represents the spoken sound "Mm." One must learn to read (that is, learn how to interpret the data representation) before one can make the connection between the symbol 'M' and the sound "Mm." These letters are combined in series to form words by more-or-less regular rules of phonetic spelling. Invention of a data representation is motivated by the needs of recording information, transmitting information, etc.

In situations were spoken or written communication is not feasible, people may want to spell out words and transmit the letters across a distance. Secondary representations have been invented for transmitting alphabetical letters in a non-written form. We call them 'secondary representations', because their tokens stand for the tokens of another data representation. Examples of secondary representations include: Morse code (a system of dots and dashes used in telegraphy); the maritime flag code (flags used in ship-to-ship communication); and semaphore code (signalling of messages by the positioning of two flags).

Secondary Data Representations for the Letter 'M'

Morse code

Maritime flag

Semaphore signal

Another common data representation is the set of Arabic numerals (0 through 9), which was invented for recording quantities of things. The symbol 3, for instance, is used to represent a quantity of three objects of a given type.

A helpful feature for a good data representation is that the correspondence between tokens and units of information be unambiguous. The meaning of a token can sometimes depend on the context, butgiven the proper interpretive contextthe intended meaning should be clear. In other words, one token in the data representation should correspond to one unit of information in the field of interest.

For the discussion that follows, I restrict the definition of a data representation to sets of tokens for use in digital computers. In modern computers, all information is represented fundamentally as series of binary digits. A binary digit (or, bit) can represent only two different values, which are usually called one and zero. [Binary representation is not a necessary property of computers. Computers have existed where the fundamental representation had ten possible values, three possible values, etc. Analog computers represent data in a continuous range between two boundaries, similar to the speedometer of a car or the mercury in a thermometer. It is somewhat an historical accident that the binary property of computers prevailed, but by now computers are almost universally so.] Having only two possible values, a bit can represent only two different units of information. These units can be interpreted as one/zero, true/false, on/off, yes/no, hot/cold, up/down, etc. Correctly reading the meaning of a bit depends on knowing which of the many possible binary data representations was used when the bit's value was recorded.

To allow computers to represent more than two different things, bits are combined in series. For the sake of convenience, the value of a bit is usually talked about as '1' or '0', regardless of its actual meaning. Using combinatorics, a series of two bits can represent four different units of information, as follows: 00 01 10 11
A three-bit series can represent eight different units of information; four bits can represent sixteen units; and so on. To represent the set of twenty-six Latin letters, we need at least five bits in series.

As in the Morse code example above, we now have two levels of data representation:

In the primary data representation, the letter 'M' represents the spoken sound "Mm."

In the secondary data representation, a series of bits represents the letter 'M'.

The most common binary data representation for text is ASCII (American Standard Code for Information Interchange). ASCII uses a seven-bit series, with combinations for the upper- and lower-case alphabetical characters, numerical digits, punctuation symbols, and various codes that are frequently needed for computer communications via modem. [There are several 8-bit extended ASCII representations that add special characters.] The ASCII code for the letter 'M', for instance, is 1001101. In most modern computers, textual data is recorded in the ASCII data representation. The binary pattern 1001101, however, is also used in computers to represent the numerical quantity 77 decimal [viz., 1x64 + 1x8 + 1x4 + 1x1], and so this bit pattern has multiple semantics. This same series also serves to represent many other types of data (for instance, as a bitmap for a graphical image). Therefore, an interpretation must be provided to a program if it is to understand the intended meaning of a given seven-bit pattern.

Let's say we need a binary data representation for colors, so that we can record which color we want displayed on the computer screen. We will need an unambiguous data representation for that as well as text. We have three choices of strategy, as follows.(a) We can increase the number of bits in the series we invented for Latin letters, so that there will be enough combinations for representing both letters and colors unambiguously.(b) We can let the binary codes for colors overlap the codes for letters, so that a given binary combination can be either a letter or a color. This entails that one must know which data representation is being used in order to understand the meaning of a given binary code.(c) We can express the screen colors in English as a series of Latin letters, such as "green," "yellow," "aquamarine," etc. This entails that the hardware for the computer screen must translate these English words into whatever form it uses internally to control the screen color.

The first strategy (called, increasing the code space) is reasonable, but for the fact that it is 'brittle'. If we later find that we need to encode more types of information (for example, a set of numbers to represent the locations of files on a hard drive), then we'll need a longer bit series. Adding bits to the series entails that we must redefine the data representation we made for letters and colors in order to accommodate the added bits. All existing data would need to be converted to this new representation. That idea is impractical, given that there are trillions of bits of computer data dispersed around the world. To avoid having to do a data conversion in the future, let us take an inventory of all types of information we might ever need to represent in a computer. We'll specify a bit-series that is long enough to unambiguously represent all units of information in all fields of interest. General-purpose computers, however, are intended to serve the needs of an unbounded number of fields of interest, and so it is not possible to specify an exact number of bits that we would need to unambiguously represent all types of information.

The second strategy (called, overlapping code spaces), is the technique most commonly used for data representations in computers. This strategy entails that one always must know which data representation is being used if one is to understand the meaning of a bit series. If one examines a bit series at a random location in computer memory, it is impossible to know what the bit series stands for; one must know which data representation the bit-series was coded in. One of the main responsibilities of computer programs is to keep track of where the data are in memory and what data representation they are coded in. Using overlapping code spaces, there is an unbounded number of possible data representations, and existing data do not have to be reformatted when new data representations are invented.

The third strategy (called, text interpretation), is used in some data representations. An important example of an interpreted-text data representation is SGML (Standard General Markup Language). One of the derivatives of SGML is HTML (Hypertext Markup Language), which is the data representation for documents on the Worldwide Web. Interpreted-text data representations typically use only ASCII characters; no other binary representations are needed. When new types of information are required, they are encoded using a system of text modes. The meaning of a character depends on which mode the text is in. One of the modes is plaintext mode, in which characters stand for themselves and no interpretation needed. In other modes, characters stand for something other than themselves. An escape sequence changes the mode of interpretation. One or more characters in the data representation are designated as escape characters; an escape character followed by certain other characters make up an escape sequence. SGML-derived data representations use the angle bracket < > characters as escape characters to mark the beginning and end of an escape sequence; these sequences are called tags. In HTML, for example, the <MAP> opening tag and the </MAP> closing tag are used to change the mode into and out of image map mode; everything between these tags is interpreted not as text but as an image map. The text within a pair of tags (called, opening and closing tags) is modal to that particular tag. These tags can be nested (or, put one inside another), so that a piece of text can be in several modes simultaneously.

The important difference between interpreted-text data representations and pure binary representations is that one is modal and the other is non-modal. With modal representations we now have three levels of representation:

In the primary representation, the letter 'M' represents the spoken sound "Mm."

In the secondary representation, the ASCII code 1001101 represents the letter 'M'.

In the tertiary representation, the character 'M' can have many meanings, depending on the modal context.

Note that, to interpret an HTML document, the computer must first know that the binary data are in the ASCII mode. It must then interpret the ASCII data, watching for escape sequences, and maintain a nested set of modes in order to interpret the text correctly.

The meaning of a token in a modal data representation is highly dependent on context. If one examines a character at a random position in a modal data file, it is impossible to know what the character stands for; one must know its modal context. In data representations where modes can be nested, determining the immediate context is not sufficient; typically one must read and interpret the file from its beginning to the point of examination in order to know the meaning of the token.