Suppose for the sake of argument you've got an append-only data-structure that tracks history on a particular scale, and you'd like to use it for very large data sets. Like, larger than will fit in memory. Before you even begin approaching that threshold on a server, you'll need to start pruning the things you keep around. My first reflex was to change up how history is stored and projected. So, rather than keeping history around in memory, keep the current projection, some ~400 history states, and the starting projection|1|. You then get all the benefits of a total history without the drawback of filling up all your memory with it. The trouble only starts when you want to back out further than the ~400 states you have in memory, and then it gets painful fast. You need to get increasingly more chunks off of disk to work with, and in the degenerate case, you need to read all of history anyhow.

When I explained this to a friend of mine, he asked why I needed to keep history in memory at all. And suggested that, since my :insert/:delete/:change tokens are perfectly reversible, when I need to slice up history, I read entries from the end of the on-disk representation|2| and reverse them off of the current projection.

It so happens that Common Lisp provides file-position for random-access to on-disk data, so this sounds like a plan. Also, my on-disk data is being kept in trivially serialized lists, so it should be fairly simple to read them.

That starts at the end of a file, gets a list by skipping to the matching paren, then calling read. Except this won't handle some edge cases. Or rather, non-edge cases, given what I frequently write about. Here's a sample entry that would break the above:

Except, no, that'll get you odd behavior too. In the rarer, but still possible case of an escaped &#8216; followed by a "`. Maybe I'm worrying too much about this, but just so you know, the Emacs HTML mode I've got running right now is having a hell of a hard time highlighting this article, for something not dissimilar to the edge case I'm discussing. Something like this

at a candidate ending quote, keep reading until you run out of #&#92;s, and count them. If you counted an odd number, one of them escapes the quote, so you need to keep looking for another quote. Otherwise, resume looking for a matching paren.

And that's basically what the above does. Lets go step by step so you understand what's going on.

Finally, we set the file-position of sto cur and call to-entry-start|3|. The result of that operation should be to get ss' file-pointer to the place where the last history entry starts. At that point, we can read it|4|.

slashes jumps until it gets to something other than a #&#92;, and keeps count of the slashes it jumps past. If there was an odd number of slashes, then we just saw an escaped quote rather than a real one and we need to call to-open-quote again|6|. Otherwise, we return, which will continue from the call point in to-entry-start. Oh, lastly, peek

...
(peek () (peek-char nil s))
...

is shorthand for peeking from the current file stream without skipping whitespace.

The principal use-case for this thing is going to be getting some intermediate history entry, not necessarily the last one. We'll need to make a couple of changes to make that possible. First, we could accept a number of entries to skip, and second we could specify a starting position. The skip argument should default to 0, and it shouldn't let the user skip negative values 'cause that wouldn't make any sense. The start argument can't be larger than one less the length of the file, but we can't really do much if we're passed a pointer to a character in the middle of an s-expression somewhere. Really, I'll only want to use this to skip previous entries that I've read, so a good idea would be to have read-entry-from-end return the starting point of the expression it returned|7|. Also, I'll put up a sign to be careful with that second one.

Most of the difference is at the bottom there. We're calling to-entry-start a number of times equal to one more than the number we're skipping|8| instead of once, and we're returning the file-position at which the returned entry starts as a second value. The only other differences are the added docstring, the changed arglist, the assertion on skip and the assignment of cur. That last one now checks that the given start point is 0 or greater, and less than the total length of the specified file, otherwise it defaults to the end of the file|9|. Here it is in action:

That should do for what I need. At this point, I need to tear out the in-memory history on fact-base and make sure to reconcile that properly with the still in-memory delta. Of course, there's a couple of other approaches to consider; one is just caching history entry offsets as I read/write them and keeping that list around. Not sure how I feel about it, since it would still eventually saturate memory, though at a greatly reduced rate. I'll be exploring that for the next little while.

3 - |back| - Incidentally, on Linux/Unix, both the file-length and the file-position calls are pretty friggin efficient. I have no idea what the situation is on Windows, so if you're looking to implement something there, you might want to see whether it'll perform well enough for you before getting in too deep.

4 - |back| - I thought about trying to do this in a single traversal, but decided that you basically can't and that reading from the stream is the most efficient way to get this s-expression out. Why? Because there's basically two ways to do it. You could accumulate the characters you're peeking into a list by using push, then coerce it into a string at the end. The trouble is, you still need to call read-from-string on the result, and the coerce call probably has to allocate a fresh array, then copy the chars into it. Which is worse. The other approach is to try to pre-allocate a string and copy the peeked chars directly into it. The problem with that is that you have no idea how big a string you'll need before-hand. The premise here is that I want this to work on files that are too large to fit in memory, so you can't just allocate a string to hold the entire file, but it's anyones' guess where the next s-expression ends, and you still need to call read-from-string on the result. Even if I decided that I really reallyreally want this, and snuck length footers into the fact-base file-format, I still need to do that read-from-string at the end. The length footers might still help marginally, but my best bet still seems like it'd be reading directly from the stream once I have the file pointer where I want it.

5 - |back| - Also, in case you're wondering, I'm constantly clarifying that because of that Emacs HTML mode bugbear I mentioned. If you have an unmatched quote in a buffer, even if it's escaped, the rest of the buffer ends up being highlighted as a string. It's surprisingly annoying. So I try to close it up as soon as I can.

6 - |back| - Which on reflection I should probably call to-quote for accuracy. This was a note to self.