25.1 The Text in a Buffer

The text in a buffer consists of a sequence of zero or more
characters. A character is an integer that logically represents
a letter, number, space, or other unit of text. Most of the characters
that you will typically encounter belong to the ASCII set of characters,
but there are also characters for various sorts of accented letters,
special symbols, Chinese and Japanese ideograms (i.e. Kanji, Katakana,
etc.), Cyrillic and Greek letters, etc. The actual number of possible
characters is quite large.

For now, we can view a character as some non-negative integer that
has some shape that defines how it typically appears (e.g. as an
uppercase A). (The exact way in which a character appears depends on the
font used to display the character.) The internal type of characters in
the C code is an Ichar; this is just an int, but using a
symbolic type makes the code clearer.

Between every character in a buffer is a buffer position or
character position. We can speak of the character before or after
a particular buffer position, and when you insert a character at a
particular position, all characters after that position end up at new
positions. When we speak of the character at a position, we
really mean the character after the position. (This schizophrenia
between a buffer position being “between” two characters and “on” a
character is rampant in Emacs.)

Buffer positions are numbered starting at 1. This means that
position 1 is before the first character, and position 0 is not
valid. If there are N characters in a buffer, then buffer
position N+1 is after the last one, and position N+2 is not valid.

The internal makeup of the Ichar integer varies depending on whether
we have compiled with MULE support. If not, the Ichar integer is an
8-bit integer with possible values from 0 - 255. 0 - 127 are the
standard ASCII characters, while 128 - 255 are the characters from the
ISO-8859-1 character set. If we have compiled with MULE support, an
Ichar is a 21-bit integer, with the various bits having meanings
according to a complex scheme that will be detailed later. The
characters numbered 0 - 255 still have the same meanings as for the
non-MULE case, though.

Internally, the text in a buffer is represented in a fairly simple
fashion: as a contiguous array of bytes, with a gap of some size
in the middle. Although the gap is of some substantial size in bytes,
there is no text contained within it: From the perspective of the text
in the buffer, it does not exist. The gap logically sits at some buffer
position, between two characters (or possibly at the beginning or end of
the buffer). Insertion of text in a buffer at a particular position is
always accomplished by first moving the gap to that position
(i.e. through some block moving of text), then writing the text into the
beginning of the gap, thereby shrinking the gap. If the gap shrinks
down to nothing, a new gap is created. (What actually happens is that a
new gap is “created” at the end of the buffer’s text, which requires
nothing more than changing a couple of indices; then the gap is
“moved” to the position where the insertion needs to take place by
moving up in memory all the text after that position.) Similarly,
deletion occurs by moving the gap to the place where the text is to be
deleted, and then simply expanding the gap to include the deleted text.
(Expanding and shrinking the gap as just described means
just that the internal indices that keep track of where the gap is
located are changed.)

Note that the total amount of memory allocated for a buffer text never
decreases while the buffer is live. Therefore, if you load up a
20-megabyte file and then delete all but one character, there will be a
20-megabyte gap, which won’t get any smaller (except by inserting
characters back again). Once the buffer is killed, the memory allocated
for the buffer text will be freed, but it will still be sitting on the
heap, taking up virtual memory, and will not be released back to the
operating system. (However, if you have compiled XEmacs with rel-alloc,
the situation is different. In this case, the space will be
released back to the operating system. However, this tends to result in a
noticeable speed penalty.)

Astute readers may notice that the text in a buffer is represented as
an array of bytes, while (at least in the MULE case) an Ichar is
a 21-bit integer, which clearly cannot fit in a byte. This means (of
course) that the text in a buffer uses a different representation from
an Ichar: specifically, the 21-bit Ichar becomes a series of one to
four bytes. The conversion between these two representations is complex
and will be described later.

In the non-MULE case, everything is very simple: An Ichar
is an 8-bit value, which fits neatly into one byte.

If we are given a buffer position and want to retrieve the
character at that position, we need to follow these steps:

Pretend there’s no gap, and convert the buffer position into a byte
index that indexes to the appropriate byte in the buffer’s stream of
textual bytes. By convention, byte indices begin at 1, just like buffer
positions. In the non-MULE case, byte indices and buffer positions are
identical, since one character equals one byte.

Convert the byte index into a memory index, which takes the gap
into account. The memory index is a direct index into the block of
memory that stores the text of a buffer. This basically just involves
checking to see if the byte index is past the gap, and if so, adding the
size of the gap to it. By convention, memory indices begin at 1, just
like buffer positions and byte indices, and when referring to the
position that is at the gap, we always use the memory position at
the beginning, not at the end, of the gap.

Fetch the appropriate bytes at the determined memory position.

Convert these bytes into an Ichar.

In the non-Mule case, (3) and (4) boil down to a simple one-byte
memory access.

Note that we have defined three types of positions in a buffer:

buffer positions or character positions, typedef Charbpos

byte indices, typedef Bytebpos

memory indices, typedef Membpos

All three typedefs are just ints, but defining them this way makes
things a lot clearer.

Most code works with buffer positions. In particular, all Lisp code
that refers to text in a buffer uses buffer positions. Lisp code does
not know that byte indices or memory indices exist.

Finally, we have a typedef for the bytes in a buffer. This is a
Ibyte, which is an unsigned char. Referring to them as
Ibytes underscores the fact that we are working with a string of bytes
in the internal Emacs buffer representation rather than in one of a
number of possible alternative representations (e.g. EUC-encoded text,
etc.).

25.2 Ibytes and Ichars

25.3 Byte-Char Position Conversion

Oct 2004:

This is what I wrote when describing the previous algorithm:

The basic algorithm we use is to keep track of a known region of
characters in each buffer, all of which are of the same width. We keep
track of the boundaries of the region in both Charbpos and Bytebpos
coordinates and also keep track of the char width, which is 1 - 4 bytes.
If the position we’re translating is not in the known region, then we
invoke a function to update the known region to surround the position in
question. This assumes locality of reference, which is usually the
case.

Note that the function to update the known region can be simple or
complicated depending on how much information we cache. In addition to
the known region, we always cache the correct conversions for point,
BEGV, and ZV, and in addition to this we cache 16 positions where the
conversion is known. We only look in the cache or update it when we
need to move the known region more than a certain amount (currently 50
chars), and then we throw away a “random” value and replace it with the
newly calculated value.

Finally, we maintain an extra flag that tracks whether the buffer is
entirely ASCII, to speed up the conversions even more. This flag is
actually of dubious value because in an entirely-ASCII buffer the known
region will always span the entire buffer (in fact, we update the flag
based on this fact), and so all we’re saving is a few machine cycles.

A potentially smarter method than what we do with known regions and
cached positions would be to keep some sort of pseudo-extent layer over
the buffer; maybe keep track of the charbpos/bytebpos correspondence at
the beginning of each line, which would allow us to do a binary search
over the pseudo-extents to narrow things down to the correct line, at
which point you could use a linear movement method. This would also
mesh well with efficiently implementing a line-numbering scheme.
However, you have to weigh the amount of time spent updating the cache
vs. the savings that result from it. In reality, we modify the buffer
far less often than we access it, so a cache of this sort that provides
guaranteed LOG (N) performance (or perhaps N * LOG (N), if we set a
maximum on the cache size) would indeed be a win, particularly in very
large buffers. If we ever implement this, we should probably set a
reasonably high minimum below which we use the old method, because the
time spent updating the fancy cache would likely become dominant when
making buffer modifications in smaller buffers.

Note also that we have to multiply or divide by the char width in order
to convert the positions. We do some tricks to avoid ever actually
having to do a multiply or divide, because that is typically an
expensive operation (esp. divide). Multiplying or dividing by 1, 2, or
4 can be implemented simply as a shift left or shift right, and we keep
track of a shifter value (0, 1, or 2) indicating how much to shift.
Multiplying by 3 can be implemented by doubling and then adding the
original value. Dividing by 3, alas, cannot be implemented in any
simple shift/subtract method, as far as I know; so we just do a table
lookup. For simplicity, we use a table of size 128K, which indexes the
“divide-by-3” values for the first 64K non-negative numbers. (Note that
we can increase the size up to 384K, i.e. indexing the first 192K
non-negative numbers, while still using shorts in the array.) This also
means that the size of the known region can be at most 64K for
width-three characters.

Unfortunately, it turned out that the implementation had serious problems
which had never been corrected. In particular, the known region had a
large tendency to become zero-length and stay that way.

So I decided to port the algorithm from FSF 21.3, in markers.c.

This algorithm is fairly simple. Instead of using markers I kept the cache
array of known positions from the previous implementation.

Basically, we keep a number of positions cached:

the actual end of the buffer

the beginning and end of the accessible region

the value of point

the position of the gap

the last value we computed

a set of positions that are “far away” from previously computed positions
(5000 chars currently; #### perhaps should be smaller)

For each position, we CONSIDER() it. This means:

If the position is what we’re looking for, return it directly.

Starting with the beginning and end of the buffer, we successively
compute the smallest enclosing range of known positions. If at any
point we discover that this range has the same byte and char length
(i.e. is entirely single-byte), then our computation is trivial.

If at any point we get a small enough range (50 chars currently),
stop considering further positions.

Otherwise, once we have an enclosing range, see which side is closer, and
iterate until we find the desired value. As an optimization, I replaced
the simple loop in FSF with the use of bytecount_to_charcount(),
charcount_to_bytecount(), bytecount_to_charcount_down(), or
charcount_to_bytecount_down(). (The latter two I added for this purpose.)
These scan 4 or 8 bytes at a time through purely single-byte characters.

If the amount we had to scan was more than our “far away” distance (5000
characters, see above), then cache the new position.

#### Things to do:

Look at the most recent GNU Emacs to see whether anything has changed.

Think about whether it makes sense to try to implement some sort of
known region or list of “known regions”, like we had before. This would
be a region of entirely single-byte characters that we can check very
quickly. (Previously I used a range of same-width characters of any
size; but this adds extra complexity and slows down the scanning, and is
probably not worth it.) As part of the scanning process in
bytecount_to_charcount() et al, we skip over chunks of entirely
single-byte chars, so it should be easy to remember the last one.
Presumably what we should do is keep track of the largest known surrounding
entirely-single-byte region for each of the cache positions as well as
perhaps the last-cached position. We want to be careful not to get bitten
by the previous problem of having the known region getting reset too
often. If we implement this, we might well want to continue scanning
some distance past the desired position (maybe 300-1000 bytes) if we are
in a single-byte range so that we won’t end up expanding the known range
one position at a time and entering the function each time.

Think about whether it makes sense to keep the position cache sorted.
This would allow it to be larger and finer-grained in its positions.
Note that with FSF’s use of markers, they were sorted, but this
was not really made good use of. With an array, we can do binary searching
to quickly find the smallest range. We would probably want to make use of
the gap-array code in extents.c.

Note that FSF’s algorithm checked ALL markers, not just the ones cached
by this algorithm. This includes markers created by the user as well as
both ends of any overlays. We could do similarly, and our extents could
keep both byte and character positions rather than just the former. (But
this would probably be overkill. We should just use our cache instead.
Any place an extent was set was surely already visited by the char<–>byte
conversion routines.)

25.4 Searching and Matching

Very incomplete, limited to a brief introduction.

People find the searching and matching code difficult to understand.
And indeed, the details are hard. However, the basic structures are not
so complex. First, there’s a hard question with a simple answer. What
about Mule? The answer here is that it turns out that Mule characters
can be matched byte by byte, so neither the search code nor the regular
expression code need take much notice of it at all! Of course, we add
some special features (such as regular expressions that match only
certain charsets), but these do not require new concepts. The main
exception is that wild-card matches in Mule have to be careful to
swallow whole characters. This is handled using the same basic macros
that are used for buffer and string movements.

This will also be true if a UTF-8 representation is used for the
internal encoding.

The complex algorithms for searching are for simple string searches. In
particular, the algorithm used for fast string searching is Boyer-Moore.
This algorithm is based on the idea that if you have a mismatch at a
given position, you can precompute where to restart the search. This
typically means that you can often make many fewer than N character
comparisons, where N is the position at which the match is found, or the
size of the text if it contains no match. That’s fast! But it’s not
easy. You must “compile” the search string into a jump table. See
the source, ‘search.c’, for more information.

Emacs changes the basic algorithms somewhat in order to handle
case-insensitive searches without a full-blown regular expression.

Regular expressions, on the other hand, have a trivial search
implementation: try a match at each position. (Under POSIX rules, it’s
a bit more complex, because POSIX requires that you find the
longest match in the text. This means you keep a record of the
best match so far, and find all the matches.)

The matching code for regular expressions is quite complex. First, the
regular expression itself is compiled. There are two basic approaches
that could be taken. The first is to compile the expression into tables
to drive a generic finite automaton emulator. This is the approach
given in many textbooks (Sedgewick’s Algorithms and Aho, Sethi,
and Ullmann’s Compilers: Principles, Techniques, and Tools, aka
“The Dragon Book”) as well as being used by the ‘lex’ family of
lexical analysis engines.

Emacs uses a somewhat different technique. The expression is compiled
into a form of bytecode, which is interpreted by a special interpreter.
The interpreter itself basically amounts to an inline implementation of
the finite automaton emulator. The advantage of this technique is that
it’s easier to add special features, such as control of case-sensitivity
via a global variable.

The compiler is not treated here. See the source, ‘regex.c’. The
interpreter, although it is divided into several functions, and looks
fearsomely complex, is actually quite simple in concept. However,
basically what you’re doing there is a strcmp on steroids, right?

Depending on the pattern, *b may have a general relationship to
*p. I.e., direct comparison against *p is
generalized to include checks for set membership, and context dependent
properties. This depends on &*b. Of course that’s meaningless
in C, so we use b directly, instead.

Although to ensure the algorithm terminates, b must advance step
by step, p can branch and jump.

The information returned is much greater, including information about
subexpressions.

We’ll ignore (3). (2) is mostly interesting when compiling the regular
expression. Now we have

All of them have the following structure, except that the ‘next_*’
functions decide where to jump (for ‘p’) and whether or not to
increment (for ‘b’), rather than checking for satisfaction of a
matching condition.