Design

Text Editors: Algorithms and Architectures

Source Code Accompanies This Article. Download It Now.

From Stallman's GnuEmacs to Microsoft's Word, text editors are one of the most taken-for-granted, yet most often used, applications around. When done right, however, the choice of core algorithms and how they're implemented in the overall architecture can make the difference between a good editor and a great one.

In contrast to the buffer-gap approach, some text editors use a doubly linked list of lines to represent a text file. As with the buffer-gap, the lower levels of the program shield the higher levels from dealing with the particulars of the machine representation. As regions of text are deleted or inserted, the lines in the linked list are split or joined as necessary. This is used by many text editors written in Lisp or Lisp-like languages. Although written in C, Jove also uses this approach, as does Bill Joy's vi. A variant was used by the Xerox Bravo editor, which kept the text file in unbroken form at the memory location where it was initially loaded; as changes occurred during an editing session, pointers were inserted into the original text to link in inserted text or to skip over deleted text. Upon finishing the session, the complete file was written out by traversing the resulting tangle of pointers.

A limitation of all the above methods is that they don't work well with files too large to fit into physical memory. Even on systems with hardware-supported virtual memory, performance can still be poor. To deal with the size limitation and the lack of hardware virtual-memory support, many PC-based text editors, such as Mince and Epsilon, implement their own virtual-memory scheme in software. In this approach, a file is broken up into fixed-size pages, whose size is a multiple of the physical disk-block size. A working set of pages--as many as will fit--is loaded into RAM, with the remainder kept in a swap file on disk. Pages on disk are swapped in as needed; pages in RAM are swapped out on a least-recently used basis. This software-only, virtual-memory mechanism exists at a lower level than the rest of the editor routines. As with the buffer-gap and linked-list approaches, the low-level coordinate system is rendered invisible to the higher-level routines that do inserts, deletes, and searches. In this case, the low-level coordinate system requires a tuple (a page number and an offset within the page) to identify a location in the virtual text stream.

All these schemes need to be modified in situations where text isn't represented by 8-bit ASCII encoding. For example, Windows/NT uses 16-bit Unicode internally for all text. Such an adaptation would be straightforward. More complex is where the document consists of a stream of variable-length objects or has a tree-like semantic structure. Programs such as Aldus PageMaker use their own object-oriented database manager to store and retrieve these structures.

Incremental Redisplay Algorithms

Given a bunch of stored text, how is the visual representation of it derived? Again, there's a spread from simple to complex. In the simplest case, where there's no word-wrap and no typographic formatting, the algorithm for generating a screen display is trivial: Clear the screen, find the text position that corresponds to the top of the screen, then step through lines of text in the buffer one at a time, outputting each one until you reach the bottom. This algorithm is the 2-D analogue of the one in EditLine(). Unfortunately, it is basically useless for an interactive program, due to intolerable flashing that results from the constant regeneration of the entire screen display. However, a version of the algorithm can be used for generating the initial screen display at program startup.

The next step taken by many authors is to transform this trivial display-generation routine into one for incremental redisplay. The changes are analogous to going from EditLine() to EditLine2(). Instead of doing the entire task each and every time, only the necessary work is done. A number of data structures come into play to avoid constant recalculations and unnecessary output to the screen.

Before deciding which data structures can be used in the time vs. space trade-off, there is a basic decision to make: How extensible should the editor be? Because the choice of editor is so deeply felt, many implementors try to make their editor as customizable as possible. The decision has profound architectural implications, because intermixing command processing with redisplay routines (as shown in Editline2) often renders the editor customizable only by the original author.

To allow for greater extensibility, many authors choose a bipartite architecture in which the low-level routines, including the redisplay, are implemented in a language such as C or assembler, and the high-level commands are implemented in a simple interpreted extension language that can only access the "inner editor" via an API. The extension language of the "outer editor" is often Lisp-like (in the case of GnuEmacs, Brief, Sine, ZMacs, and others), but can also resemble C (in the case of Epsilon, CBrief, ME), Forth (in the case of Final Word), Basic (Microsoft Word), or even Awk (Sage). The user-visible commands (such as "delete a word" or "move forward one sentence") are implemented in the extension language, and cannot directly access the redisplay structures except where allowed by the API.

In this case, the redisplay algorithm needs to be more intelligent--able both to generate a new image from first principles as well as optimally update the existing image using hints left by low-level buffer-management routines.

In mapping text data to screen display, a common technique is to keep a small array of records that maintain the correspondence between text-stream locations and screen coordinates. This text-to-screen map is updated as changes are made to the text buffer. The buffer-management routines can invalidate part or all of the mapping structure, depending on the extent of changes to the text content. The map gets rebuilt as necessary during the editing session and is discarded upon termination. While this approach works well for plain ASCII text editors, it becomes inadequate for more sophisticated word-processing and electronic-publishing programs.

In addition to the text-to-screen map, another useful data structure is one that optimizes output to the physical screen. In the early days of text editors, character-mode terminals were connected via low-bandwith lines to a host computer (as was the case for both time-shared systems and CP/M microcomputers) and much attention was paid to minimizing data transfer to the screen. For timeshared systems, the situation was complicated by the wide variety of terminals that could be connected to the host computer. Some programs dealt with this by keeping track of two screen images: a virtual image that represented the desired display and a second image that represented the currently visible terminal screen. One program went to heroic lengths to minimize the bytes transferred over the communication line, by using a sophisticated dynamic-programming algorithm that calculated the optimal sequence of commands to update the screen, tuned to the particular brand of terminal device. However, users found it somewhat disconcerting, because portions of the screen would jump up and down seemingly haphazardly as the system moved snippets of text around to piece together a complete display, exploiting the terminal's built-in commands for scrolling, insertion, and deletion. The source code to this module was prefaced with the following comment: "This routine is rather complicated. If you read this code and think you understand it, you are very wrong."

Since then, both PCs and workstations have memory-mapped displays, so communications bandwidth is no longer the problem it was. At the same time, the formatting process has become much more complicated. Instead of monospaced fonts and simple word wrap, formatting has become more like typographic composition. A simple linebreak routine has now become a research project. For example, Donald Knuth, in TeX: The Program (Addison-Wesley, 1986) considers his line-breaking algorithm the "most interesting algorithm of TeX," and devotes 40 pages of his book, as well as a journal article, to describing it. Knuth writes: "The line-breaking problem can be regarded as a special case of the problem of computing the shortest path in an acyclic network."

TeX's typographic capabilities are indeed impressive, and Knuth's boxes-and-glue metaphor for page-level formatting is elegant. Nevertheless, the task of formatting in TeX is eased by the fact that it's a noninteractive batch program. This explains why an interactive electronic-publishing program with similar capabilities is an order of magnitude larger than the 60,000 lines of TeX.

Delving into such complexity is beyond the scope of this article, but some general observations can be made. As the formatting requirements become more complex, deriving the text-to-screen map is no longer cheap. The mapping structures are therefore no longer blithely discarded once they have been computed, but are instead kept around on a permanent basis, stored on disk along with the text content (vastly increasing resource requirements). To reduce the lag between the time a keystroke is entered and the time the screen gets updated, some systems take advantage of multithreading, if it is available. For example, in the OS/2 version of PageMaker, the formatting process is a separate thread from the input process. This dynamic thread architecture is the logical extension of the static module decomposition used by singly threaded implementations.

Conclusion

There are so many degrees of freedom in implementing text editors, it's no wonder that there are so many instances, each one unique. As a concrete illustration of this discussion, the electronic version of the listings includes a multifont, mouse-aware text editor I wrote some time ago for Windows. I hope the context presented here will make the 4000 lines of C code in my implementation more understandable.

Ray is DDJ's senior technical editor. He can be contacted at rayval@well.sf.ca.usa.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!