Bit Rot

The Higher Formatting

How computers represent numbers and characters is not something that most of us spend time brooding over. In any event, we don't have much choice about these low-level formats. Once you choose a computer, those decisions are made for you. But further layers in the hierarchy of file formats are not rooted so deeply in the silicon and the operating system.

When you write with a word-processing program, the information stored in the document includes more than just a sequence of alphabetic characters and punctuation marks. There are also formatting codes that indicate how the words are to look on the page. The codes specify typefaces (Times, Caslon, Palatino), type sizes (10 point, 12 point), stylistic variations (italic, boldface, superscript), the alignment of text (centered, flush-left, justified) and dozens of other properties. Sometimes the codes are explicitly entered into the text; in The Final Word, which was inspired by a famous text-editing program called Emacs, I would italicize a word by typing "@i[Titanic]." More recent software generally hides the formatting codes and shows only the results of applying them. When you select the command to italicize a word, the word appears in italics on the display screen, as if the text were simply stored inside the computer in italic type. This is an illusion. A computer's memory has neither italic bits nor roman ones. Hidden somewhere in the document file are explicit markers indicating the change in type style.

Some of the formatting information can be complicated. Particularly troublesome are nonlocal constructs, such as footnotes, cross-references and markers for index and table-of-contents entries. Consider a manuscript with consecutively numbered notes printed at the end. When a note is entered in the middle of the text, the number that appears there depends on how many notes precede it, while the output of the note itself has to wait until the rest of the manuscript has been processed. Thus the document cannot be viewed as a strictly linear text, with characters arranged in sequence; there are data structures that span the entire file. Features of this kind tend to be the hardest to translate when you convert a file to a new computer system or to a new purpose such as presentation on the Web.

Some files include formatting at an even higher level of abstraction, with labels that indicate the function of various parts of the document, rather than instructions about how they are to appear. The most familiar examples come from HTML, the Hypertext Markup Language of the Web. Headings and subheadings in an HTML document can be labeled with tags such as [h1] and [h2], which indicate the relative importance of the headings but don't say directly how they should look; decisions about visual formatting are deferred until the document is displayed. Similarly, a phrase can be marked with the tag [em] to indicate it bears emphasis, or with the [strong] tag for strong emphasis. The emphatic text is usually displayed in italic or boldface type, but those are not the only possibilities; an old-fashioned printer might underline the phrase, and a text-to-speech system might make a change of intonation.

This more abstract style of formatting is known variously as generic or descriptive markup, in contrast to visual or presentational markup. In this context "markup" refers to anything included in the file that's not part of the text or data. Descriptive markup has important advantages for the forward-looking content provider. As a document goes through its various transformations from inked paper to Web to CD-ROM to synthetic speech to whatever's next, a heading might be displayed in many different ways, but it always remains a heading.

Some word processors and other programs offer a rudimentary form of descriptive markup by means of style sheets. You define a style called "Heading," and assign it a set of visual formats; then if you change your mind or adapt the document to some other purpose, revising the definition will alter all text that has the Heading style.

No standards for higher-level visual or abstract markup have the universality of ASCII. Every program goes its own way. And, unfortunately, abstract markup seldom survives translation between file formats. When you convert a document, the heading style is replaced by the corresponding visual attributes, such as 12-point bold type. This transformation is irreversible and entails a loss of information: You cannot subsequently convert all instances of 12-point bold type back into headings, because nonheading text may have the same attributes.