Thoughts on Perl and Emacs, technology and writing

What is the Datatype Behind an Emacs Buffer

As Emacs is open source, we don’t need to speculate. We can just read the code. I’m lazy though, so I prefer to speculate ;) If you were making a text editor, what would you use for your buffer type?

Array of Chars

The obvious, naive choice is an array of chars. But that means that, while append may be fast (until you need to resize your array at least), insertion is very slow. To insert a character in the middle of an array of chars, you need to move all of the characters following the insertion point up by one character. If we assume that insertions are evenly distributed, inserting a character will take, on average, time proportional to N/2.

Queue up Inserts

We can be a bit smarter than that. With normal text editor usage, for any given insert, we will probably insert a bunch of characters in the same place. We could append those characters into a preallocated array, and after a certain amount of time or number of characters, insert those characters into the array of chars. This brings the amortized speed cost down to N/M/2 where M is the average number of characters inserted in one go.

A Tree of Characters

So much for arrays. How about a tree? Assuming the tree, is balanced, insertion will take log2 N. The downside is that traversal is a lot slower (still order N, but a decent constant factor slower). And we are also likely to lose cache locality.

Side note: I often surprised that the main ‘scripting’ languages don’t provide much more than the equivalent of perl scalars, arrays and hashes. Sure, Python has tuples and Ruby has symbols (okay, I do miss symbols in Perl), but where are the built-in trees? Even Lisp has them. I had a quick look for trees in the CPAN and found Tree::Binary and Tree::RedBlack. I couldn’t find at a quick glance whether Tree::Binary self-balances or not. Hmmm…

A Rope

The final simple option is a rope (PDF). This is a binary tree where the leaves, instead of being characters, are arrays of characters. This improves cache locality (depending on the average size of the array), and traversal speed, although it isn’t as fast as a simple array of characters.

Share this:

Like this:

Related

4 Responses

I think most languages don’t include trees because hash tables are close enough for most purposes. You don’t usually need to traverse key/value pairs in the map in sorted order, which is about the only thing a tree buys you over a hash. And if you know that you care about sortedness, you probably know enough to go implement your own.

One other obvious datatype is to use a linked list of lines; that makes copying characters around for inserts significantly less painful, as your N is only the number of characters in a line, rather than the size of your buffer.

Sure I could implement almost all of the basic datatypes if I wanted to (in fact I’ve partially implemented extensible vectors in elisp). I could also implement my own compiler, or database, or socket library or whatever. But why should I? A huge reason I use an imperfect, yet good enough language like Perl is the excellent libraries it provides. If I want a libraryless language, I know where to find Scheme.

I don’t agree that trees are much like hash tables, but that may be worth a whole post.

And yes, a linked list of lines would, in some ways, be better than an array. But is there any way it would be better than a rope?

Ropes are significantly more complicated than a simple linked list of lines, particularly if you’re working in a language without GC. I can see memory fragmentation becoming a problem for ropes, too.

It’d be interesting to look at text editors released over the last decade or so to see what their internal model is. I doubt any of them use anything more sophisticated than linked lists, perhaps Emacs’s gap buffer. I’d be surprised if ropes were used.

Many versions of Emacs, including GNU, use a single contiguous character array virtually split in two sections separated by a gap. To insert the gap is first moved to the insertion point. Inserted characters fill into the gap, reducing its size. If there’s insufficient space to hold the characters the entire buffer is reallocated to a new larger size and the gaps coalesced at the previous insertion point.

The naive look at this and say the performance must be poor because of all the copying involved. Wrong. The copy operation is incredibly quick and can be optimized in a variety of ways. Gap buffers also take advantage of usage patterns. You might jump all over the window before focusing and inserting text. The gap doesn’t move for display – only for insert (or delete).

On the other hand, inserting a character block at the head of a 500MB file then inserting another at the end is the worst case for the gap approach, especially if the gap’s size is exceeded. How often does that happen?

Contiguous memory blocks are prized in virtual memory environments because less paging is involved. Moreover, reads and writes are simplfied because the the file doesn’t have to be parsed and broken up into some other data structure. Rather, the file’s internal representation in the gap buffer is identical to disk and can be read into and written out optimally. Writes themselves can be done with a single system call (on *nix).

The gap buffer is the best algorithm for editing text in a general way. It uses the least memory and has the highest aggregate performance over a variety of use cases. Translating the gap buffer to a visual window is a bit trickier as line context must be constantly maintained.