Tools

Text Editors: Algorithms and Architectures

Source Code Accompanies This Article. Download It Now.

From Stallman's GnuEmacs to Microsoft's Word, text editors are one of the most taken-for-granted, yet most often used, applications around. When done right, however, the choice of core algorithms and how they're implemented in the overall architecture can make the difference between a good editor and a great one.

To maintain the view of text as an uninterrupted stream of characters, you can use a number of data structures, such as linked-list structures, the buffer-gap structure, and virtual-memory blocks.

Plain ASCII text is fine for editing program source code, but other uses require additional attributes to be associated with the text stream. These attributes control the way text is formatted or presented on the screen: typeface, point size, justification, and so on.

One implementation choice is to embed these presentation attributes into the text stream, thereby mixing formatting commands with text content. This approach is used in many older-generation typesetting machines, back before WYSIWYG. In general, this is also the approach that the future version of GnuEmacs will use. Version 19 of GnuEmacs, to be released later this year, will view a file as a stream of first-class Lisp objects that can represent text characters, formatting commands, events such as mouse-clicks, or any arbitrary Lisp function.

A different way to deal with text attributes is to maintain parallel streams--one of content, the other of presentation attributes related to content. This is used in Microsoft Word. Likewise, on the Macintosh, the environment provides an equivalent to the Windows Edit control called TextEdit that allows for "runs" or sequences of text styles, a "text style" being defined as a particular combination of font and point-size attributes. Text styles are implemented as an array of style records that point to locations in the text stream where a particular "run" begins. There's an elaborate figure on page 15-38 of Inside Macintosh Volume VI (Addison-Wesley, 1992) that shows how style records are maintained in relation to the text. In System 7.1, the WorldScript facility generalizes the notion of runs even further.

As you step up from ASCII text editors and simple word-processors to more sophisticated document-processing and desktop-publishing programs, text can no longer be regarded as a one-dimensional array of characters (even when associated with a parallel stream of text attributes). In document processors such as FrameMaker, Interleaf, or Ventura Publisher, the text content has its own elaborate structure -- words, sentences, paragraphs, subsections, chapters, appendices, and volumes. This is known as a "semantic" or "logical" structure, in contrast to the geometrical or visual structure of the presentation. Document-processing programs have the most complex task of all graphical editors, to maintain a consistent mapping between two tree-like structures: the semantic hierarchy of text content and the geometrical hierarchy of pages, columns, frames, and lines. Dealing with this complexity in an interactive, optimized manner is what leads to million-line-plus programs.

The Machine Representation of Text

In the EditLine() example, text is represented in the machine as a single array of ASCII-encoded bytes. Inserting or deleting a character merely requires calling movmem() to shift every byte by one memory location. Depending on the CPU, this brute-force method can work for even medium-to-large amounts of text. At some point, of course, this profligate expense of machine cycles becomes unworkable. Then the "buffer-gap" approach comes into play.

The buffer-gap approach divides the single array of characters into two parts, separated by a movable gap. The gap is an internal construct, not visible to the user. From the user's point of view, text remains in an unbroken stream. As the user navigates over the text, moving the cursor from one character to the next, the system updates the corresponding pointer in the text data structure, skipping over the buffer gap as necessary. When the user enters in a bunch of text, the system shifts the gap over to the point of insertion, then shrinks the gap by one character for each keystroke. This method avoids most of the shuffling and reshuffling of text required by the earlier approach.

Implementing a buffer-gap manager is not difficult, but requires attention to detail to avoid fencepost errors. As Finseth points out, three coordinate systems are in play at the same time. In the user coordinate system, location 0 corresponds to the position before the first character of the text. Note that coordinates label the positions between the characters, rather than the characters themselves. This is similar to a 2-D graphical coordinate system, such as that used by QuickDraw or Windows GDI, which labels the positions between pixels rather than the pixels themselves.

Second, there's the buffer-gap coordinate system, which is the same as the user coordinate system, except that the continuum is broken up by the variable-length gap. The third system is the storage coordinate system, which labels the memory locations where characters are stored (rather than the positions between them) and is the one used by pointers to memory. If you don't scrupulously maintain the distinction between these three coordinate systems, you'll be plagued by an ongoing cascade of fencepost errors. Fortunately, the code available electronically contains bufgap.c, a module that implements all the basic functions for managing a buffer gap--inserting and deleting text, moving the gap, expanding the gap, searching the buffer for a particular string, and so on. Ecerpts from buf_gap.c are shown in Listing Two. This code is heavily based on an example posted by Joseph Allen to the Internet on September 10, 1989. The module is not a stand-alone program, but assumes other modules for input, command dispatching, redisplay, memory allocation, and screen output.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!