Pages

Meta

Month: April 2005

In yesterday’s episode of Diff Quest, I outlined a function called Expendable, which takes two CSes and tells you whether you can throw one of them away. Today I’ll describe that function’s implementation.

Expendable takes two parameters, CS1 and CS2. Each of those is a common subsequence, and therefore consists of a bunch of matches. And each match is a pair of indexes, like [0,1].

The Expendable function is all about the idea of growth potential. Growth potential depends on two things, and two things only: the length of the CS, and the last match in the CS. [0,0] [2,2] has the exact same growth potential as [1,1] [2,2] — the earlier matches don’t matter; what matters is what we can tack onto the end.

This means there are three numbers that we care about, when we’re thinking about a CS’s growth potential: its length; the AIndex of its last match; and the BIndex of its last match. So the inputs to Expendable can be described by six individual numbers, and the different return values of Expendable are determined by the relationship between corresponding pairs:

Is CS1Length less than, equal to, or greater than CS2Length?

Is CS1AIndex less than, equal to, or greater than CS2AIndex?

Is CS1BIndex less than, equal to, or greater than CS2BIndex?

The following table lists all the possible combinations of these. The “Len” column shows “<” if CS1Length < CS2Length, “=” if CS1Length = CS2Length, and “>” if CS1Length > CS2Length. The “AI” column is the same thing for AIndex, and “BI” is BIndex. After that, there’s a sample CS1 and a sample CS2 that exhibit these relationships (right-justified so the last matches always line up, since it’s the last match that’s important).

The next two columns are a kind of game, where I try to come up with a set of inputs that would make CS1 “win” (i.e., inputs that would build CS1 into a longer subsequence without also growing CS2), and then a set of inputs that would make CS2 win. It doesn’t matter how contrived those inputs are; what matters is whether those inputs are possible (and sometimes they’re not, which is when expendability kicks in). “EOF” means that no further input is required to make that CS win — if we reached end of file, and there were no more matches, then it’s already won.

Finally, the last column shows the result of Expendable(CS1, CS2).

#

Len

AI

BI

CS1

CS2

Make CS1 win

Make CS2 win

Exp

1

<

<

<

[3,3]

[0,0] [6,6]

[4,4] [5,5]

EOF

nil

2

<

<

=

[3,3]

[0,0] [6,3]

[4,4] [5,5]

EOF

nil

3

<

<

>

[3,6]

[0,0] [6,3]

[4,7] [5,8]

EOF

nil

4

<

=

<

[3,3]

[0,0] [3,6]

[4,4] [5,5]

EOF

nil

5

<

=

=

[3,3]

[0,0] [3,3]

None possible

EOF

1

6

<

=

>

[3,6]

[0,0] [3,3]

None possible

EOF

1

7

<

>

<

[6,3]

[0,0] [3,6]

[7,4] [8,5]

EOF

nil

8

<

>

=

[6,3]

[0,0] [3,3]

None possible

EOF

1

9

<

>

>

[6,6]

[0,0] [3,3]

None possible

EOF

1

10

=

<

<

[3,3]

[6,6]

[4,4]

None possible

2

11

=

<

=

[3,3]

[6,3]

[4,4]

None possible

2

12

=

<

>

[3,6]

[6,3]

[4,7]

[7,4]

nil

13

=

=

<

[3,3]

[3,6]

[4,4]

None possible

2

14

=

=

=

[3,3]

[3,3]

None possible

None possible

Either

15

=

=

>

[3,6]

[3,3]

None possible

[4,4]

1

16

=

>

<

[6,3]

[3,6]

[7,4]

[4,7]

nil

17

=

>

=

[6,3]

[3,3]

None possible

[4,4]

1

18

=

>

>

[6,6]

[3,3]

None possible

[4,4]

1

19

>

<

<

[0,0] [3,3]

[6,6]

EOF

None possible

2

20

>

<

=

[0,0] [3,3]

[6,3]

EOF

None possible

2

21

>

<

>

[0,0] [3,6]

[6,3]

EOF

[7,4] [8,5]

nil

22

>

=

<

[0,0] [3,3]

[3,6]

EOF

None possible

2

23

>

=

=

[0,0] [3,3]

[3,3]

EOF

None possible

2

24

>

=

>

[0,0] [3,6]

[3,3]

EOF

[4,4] [5,5]

nil

25

>

>

<

[0,0] [6,3]

[3,6]

EOF

[4,7] [5,8]

nil

26

>

>

=

[0,0] [6,3]

[3,3]

EOF

[4,4] [5,5]

nil

27

>

>

>

[0,0] [6,6]

[3,3]

EOF

[4,4] [5,5]

nil

Row #14 is interesting, in that the Expendable function could meaningfully return either CS1 or CS2. That’s because there’s nothing to differentiate the two inputs — they are, as far as we care, the same. One might be [0,1] [2,2] and the other might be [1,0] [2,2], but they both have the same growth potential, so there’s no need to keep both. We can throw one of them away, and it really doesn’t matter which one; we can decide arbitrarily.

I spent a while staring at this table, trying to figure out the pattern, because of row #14. I originally had it marked as “nil”, but I finally realized that that’s not right; we do want to throw away one or the other. Once I figured out that both CS1 and CS2 were expendable for that row, the pattern became clear:

If CS1Length < CS2Length, and CS1AIndex >= CS2AIndex, and CS1BIndex >= CS2BIndex, then CS1 is expendable.

If CS1Length > CS2Length, and CS1AIndex <= CS2AIndex, and CS1BIndex <= CS2BIndex, then CS2 is expendable.

It’s not the most graceful of algorithms, but it’s not a hard one to implement.

At this point, I’ve described all the major ingredients for a practical (better-than-O(2n)) LCS algorithm; I could (or you could) go ahead and start writing code, and end up with a working LCS library. But I’m not going to start writing code for it, not just yet. Why? Because even though we’re now well into the realm of polynomial time, we’re still doing more work than we have to. There are a few more sneaky shortcuts in store, and I’ll start delving into them next week.

We’ve had some beautiful weather lately, so last night I pumped up our bike tires, intending to go for a ride. And then I didn’t do it. Ah well.

I got out this morning, after getting the dust cleaned out of my helmet. I made it all the way around the block. It’s a largish block, with a hill. I’ve ridden up worse hills before, but I was in shape then.

Coasting down the hill was easy. I made it about 2/3 of the way back up before I had to get off and walk. At the time, I debated pedaling a little farther, but I’m glad I didn’t; I was puffing something fierce by the time I got back home. I’m now happily collapsed into my chair and debating how long I can lie here before I have to get back up and go to work. I’ve got some vacation days saved up, don’t I?

Lesson learned #1: Next time, I will take the garage door opener with me, so I don’t have to park my bike, climb up the outside stairs, unlock the front door, go inside, and then go back down the stairs to open the garage. Save myself a little extra exertion.

Lesson learned #2: Next time, I will pour the glass of Gatorade before I leave, so it’s ready when I get back.

Last time, on the Diff Chronicles, I brought up a useful optimization called MaxLength. But useful as it is, it’s totally impractical to put into practice, because finding the exact MaxLength takes too long.

But what if we could look at two CSes, and know that one’s MaxLength is better than the other’s, without actually calculating their MaxLengths?

Let’s define a new function, called Expendable(CS1, CS2). We give it two CSes, and it tells us which of them we can throw away. Sometimes it returns nil, meaning we can’t throw either of them away. (Thanks to Sam for suggesting that this be a function — it simplifies the explanation quite a bit.)

Assuming that we can implement the Expendable function, it’s simple to plug it into yesterday’s algorithm: Every time we come up with a new CS that we want to add to our list, we first loop through our existing CSes, and compare each one to our CSnew. If Expendable(CSi, CSnew) ever returns CSi, we delete CSi from our list, and then move on to CSi+1. If Expendable(CSi, CSnew) ever returns CSnew, we stop there, and don’t bother adding CSnew to our list. And whenever Expendable(CSi, CSnew) returns nil, we just keep going.

This is essentially what we did yesterday, but we were using an Expendable function that was implemented in terms of MaxLength. Now we just have to figure out how to implement the Expendable function without using that expensive MaxLength calculation. How do we look at two CSes and figure out that we can eliminate one without hurting our chances of finding the LCS?

Let’s look at (you knew this was coming, didn’t you?) some examples:

Expendable([0,0] [1,1], [1,1]) => [1,1].
Anything we can add onto one of these, we can add onto the other. But one of them is already longer. It’s a sure thing that MaxLength([0,0] [1,1]) = MaxLength([1,1]) + 1; we can tell that without knowing exactly what their MaxLengths are. Therefore, [0,0] [1,1] wins, and [1,1] gets flushed.

Expendable([0,0], [1,1]) => [1,1].
We know that there’s a match [1,1] out there somewhere, because it’s in the second CS. So we know that the [0,0] will end up growing to [0,0] [1,1] sooner or later. That means that this example is exactly the same thing as the previous example, and we already know that once it grows to that point, [1,1] is going to lose. Why wait that long? We can just cut our losses now, and toss the second CS right away.

Expendable([3,6], [5,2]) => nil.
We can’t eliminate [3,6] because it might well be part of the LCS. For example, there might be only one more match after this, and that match might be [4,7]. We can’t eliminate [5,2] because there might be only one more match, and it might be [6,3]. We can’t guarantee that either of these CSes is out of the running, so we can’t eliminate either one.

Expendable([0,5] [1,6] [2,7], [3,0]) => nil.
It’s tempting to say, “We’re working towards the longest possible CS, so when we need to throw one away, go for the shorter one.” But we’ve tried that before, and it didn’t work. If the next three matches are [4,1], [5,2], and [6,3], then [3,0] will grow into the next big powerhouse, because we can add those matches onto [3,0] to get a length-4 CS, but we can’t add them onto [0,5] [1,6] [2,7]. Bottom line, we can’t throw away either CS without waiting to see what other matches are yet to come.

Expendable([4,0], [4,4]) => [4,4].
We can guarantee that throwing away [4,4] won’t hurt our chances of finding the right LCS. There is no possible way that MaxLength([4,4]) could be larger than MaxLength([4,0]), but there are possible ways that MaxLength([4,0]) could be better; for example, the next match might be [5,2], which would grow [4,0] but would not grow [4,4]. It’s not guaranteed that [4,0]will result in a better MaxLength, but it is guaranteed that it won’t ever be worse.

The basic concept is something I think of as “growth potential”. [0,0] has more room to grow than [1,1] does; [1,1] is already closer to the end of the lists, and you can’t build your Lego tower much higher if there’s a ceiling in the way. [4,0] has more room to grow than [4,4] does. But whether [3,6] has more room to grow than [5,2] depends on a lot of things, like the size of the input lists, and where the matches fall; so neither one is guaranteed expendable.

Tomorrow, I’ll provide an actual algorithm for the Expendable function, and put it to work.

So we’ll cancel the append even if the re-raise fails, and execution falls through to the line after the raise. I can honestly say that this bit of resource-protection code never would have occurred to me.

A co-worker found another gem yesterday. This one also dates back to 1998 (though written by a different programmer):

Query.Locate(Query.FieldByName('ID').FieldName, ID, []);

So we call FieldByName, which searches for, and returns, the field whose FieldName is ‘ID’… and then we read that field’s FieldName. Brilliant, no?

I’m sad to say, though, that these coding gems are no longer in our code base. Somehow, they got lost during some code cleanup. I can’t tell you how much that distresses me.

Remember that we build longer CSes by starting with shorter ones, and adding stuff onto the end. Suppose we write a new function MaxLength(CSA), which returns the length of the longest CS that starts with CSA. Let’s go back to our last example:

MaxLength([0,0] [2,1]) is also 3, because [0,0] [2,1] will grow into that same length-3 CS, [0,0] [2,1] [3,2].

MaxLength([0,0] [2,1] [3,2]) is — you guessed it — 3.

MaxLength([0,0] [3,2]) = 2, because [0,0] [3,2] has already maxed out; there are no matches we can add at the end to make a longer CS.

MaxLength([2,1]) = 2, because [2,1] will grow into [2,1] [3,2].

Now, here’s the plan. We follow our stateful LCS algorithm, but for every CS we’ve accepted so far, we also keep track of its MaxLength. Then, when we come up with a new CS, we calculate its MaxLength, and compare that to the MaxLengths that are already in our list. If the new MaxLength is better than anything in the list, then we throw away the CSes with worse MaxLengths, and add the new one. But if the new MaxLength is worse than anything in the list, then we throw the new one away.

The highest MaxLength will always belong to CSes that are leading subsets of the LCS. That is, if the LCS is [0,0] [2,1] [3,2], then [0,0] will have the highest possible MaxLength — 3, in this case — and so will [0,0] [2,1] and [0,0] [2,1] [3,2]. So if we immediately throw away any CS with a lower MaxLength — [0,0] [1,3], for example, with a MaxLength of a mere 2 — then we’ll home right in on our LCS. We won’t waste time getting sidetracked, and we won’t linger on CSes with obvious holes ([0,0] [3,2] has a MaxLength of 2, so out it goes).

Well, okay, the obvious holes don’t play too big a part in an example this short. But they’re what makes our previous algorithm take exponential time. Fixing them is what makes the algorithm practical. And with MaxLength, we can do just that. Goody!

Until reality sets in, at any rate.

See, you can’t actually know what any of the MaxLengths are until after you’ve generated all the longer CSes. Which pretty much only happens after you’re done running the entire O(2n) LCS algorithm. So you have to go through the algorithm once, without optimizations, so you can run it again, with optimizations. Yeah, that makes a lot of sense.

Driver: “Will we get there faster if I turn left or right at this intersection?”

Navigator: “Well, let’s try turning left, and drive until we get there, and see how long it takes. Then we’ll come back here, and the second time, we’ll turn right at this intersection, and then drive until we get there. Then we’ll know for sure which way will get us there fastest, and we can come back here again and take that route.”

Okay. So, the MaxLength scheme probably isn’t going to work. At all. However, it is a useful concept to build upon, and next time I’ll show a variation that’s nearly as useful, and not nearly as stupid.

Jennie just left for a half-week stay with her friend in Kansas City, so I’m batchin’ it here for a few days.

(I debated with myself over spelling. I mean, the phrase “batchin’ it” comes from the word “bachelor”, so maybe it should be “bachin'” instead. But that doesn’t really suggest a meaningful pronounciation — I mean, does “bachin'” mean grooving to Johann Sebastian, or what? And you do have to consider, this rite traditionally does involve queuing up household chores — laundry, dishes, etc. — to all be performed as a single batch operation when the wife gets home. Usually by the husband, at gunpoint. So I think my spelling is probably fairly appropriate.)

That said, the first thing I did after she left was to turn off the TV, unload the dishwasher, reload it, and start it. No kidding. We’ll see if it lasts.

Of course, the next thing I did was come down to the computer, so I may already be drifting into batch mode…

Next match: [1,3]. Can we tack that onto [0,0] and still have a valid CS? Yes. We now have one longer CS: [0,0] [1,3].

Next match: [2,1]. Can we tack that onto [0,0] and still have a valid CS? No, but we don’t want to lose the [2,1], so we’ll start a new CS. Now we have two: [0,0] [1,3] and [2,1].

Last match: [3,2]. Can we tack that onto [0,0] [1,3]? No. [2,1]? Yes. We end up with two CSes: [0,0] [1,3] and [2,1] [3,2].

Whoops! Given those inputs, we should have had a length-3 LCS: (1 3 4), or [0,0] [2,1] [3,2]. What happened?

What happened was that the [0,0] got sucked into another CS — [0,0] [1,3] — so it wasn’t available as a starting point for us to build [0,0] [2,1] [3,2] from. How can we fix this, and make sure the [0,0] is still available even when it’s been used in another match string?

Well, pretty easily, with a tweak to our list handling. Instead of dropping shorter CSes from the list and replacing them with longer ones, we can just add. So instead of removing [0,0] from our list when we add [0,0] [1,3], we just add them both.

Let’s give that a shot:

First match: [0,0]. Can we append that to any existing CSes? No, so we’d better add it as a new CS. Our CS list now contains [0,0].

Next match: [1,3]. Append this to everything we can — that would be [0,0]. Our CS list now contains both [0,0] and [0,0] [1,3].

Next match: [2,1]. Append this to everything we can — [0,0] once again. Our CS list now has [0,0], [0,0] [1,3], and [0,0] [2,1].

Last match: [3,2]. Append this to everything we can — [0,0] and [0,0] [2,1]. Our CS list now has [0,0], [0,0] [1,3], [0,0] [2,1], [0,0] [3,2], and [0,0] [2,1] [3,2].

But wait a minute. Why is it trying both [0,0] [3,2] and [0,0] [2,1] [3,2]? They’re the same thing, except that the first one has an obvious hole in the middle.

Actually, I took it easy in this example. If I’d been all proper about it, then I would’ve added each match as a length-1 match string, too; so the CS list would also contain [1,3], [2,1], and [3,2] as length-1 match strings, plus anything else we built off of those. Why? We have to put in length-1 match strings that aren’t built off of anything else; after all, we do it in step 1! So after step 2, we would’ve had three CSes ([0,0], [0,0] [1,3], and [1,3]), not two. After step 3, we would’ve had five. After step 4, we would’ve had eight. The farther we go, the more obvious holes we accumulate.

This whole thing is smelling a lot like exponential time, even without adding the extra length-1 match strings. True, this algorithm is able to cut its losses — it doesn’t keep trying longer and longer match strings that all start with, say, [1,3] [2,1], because it’s smart enough to know that none of them will ever work. But it’s not smart enough to avoid obvious holes, and if it can’t cut very many losses (which will be the case a lot of the time), you’ll be waiting a long, long while for your diff to finish.

But it’s a good algorithm, once we cure its predilection for Swiss cheese. What it needs, at this point, are a few good shortcuts.

Trying every possible match string will take a ridiculous amount of time, because it wastes its time a couple of different ways:

Obvious gaps: It tries match strings like [2,2] [4,4] and [3,3]. That’s bad — we want something that can recognize patterns a little better, and just try [2,2] [3,3] [4,4] without trying variants with gaps (since there are a lot of variants with gaps).

Invalid: It tries match strings like [0,4] [2,2] that aren’t going to pass our rule. That’s not so bad, but it would also try [0,4] [2,2] [3,3], [0,4] [2,2] [3,3] [4,4], etc., instead of cutting its losses as soon as it notices that no valid common subsequence will ever contain [0,4] [2,2].

There are really only three match strings — out of a possible 325 (31 if we can impose ordering on our matches, though we haven’t actually worked out a way to do that yet) — that make any sense: [0,4] [1,5]; [2,2] [3,3] [4,4]; and [5,0] [6,1]. We want to ignore the other 322 (28) possibilities, and go straight to the good stuff. How do we do that?

Actually, we’ve gotten pretty close to that already, in our “string ’em all together” algorithm. I made a big deal out of how it didn’t work, but most of that came down to the fact that we only gave it one register; it could only keep track of one “best-yet common subsequence” at a time. If it had a list instead, it would be a pretty decent fit. Let’s run through the way it would work.

We start with the first match, [0,5]. That’s the only match we’ve looked at so far, so we have one common subsequence (CS) at this point: [0,5].

The next match is [1,6]. Can we tack that onto [0,5] and still have a valid CS? We would end up with [0,5] [1,6], which passes our rule, so yes, we can append it. Our CS is now [0,5] [1,6].

Next: [2,2]. Can we tack that onto [0,5] [1,6] and still have a valid CS? We would end up with [0,5] [1,6] [2,2], which fails our rule, so no, we can’t append it to our existing CS. But we don’t want to throw the [2,2] away, so let’s add it to our list. Now we have two CSes going: [0,5] [1,6] and [2,2].

Next: [3,3]. Can we tack that onto [0,5] [1,6]? No. What about [2,2]? Yes. Our CSes are now [0,5] [1,6] and [2,2] [3,3].

Next: [4,4]. Can we tack that onto [0,5] [1,6]? No. What about [2,2] [3,3]? Yes. Our CSes are now [0,5] [1,6] and [2,2] [3,3] [4,4].

Next: [5,0]. Can we tack that onto [0,5] [1,6]? No. What about [2,2] [3,3] [4,4]? No, so we’ll have to start another CS. Our CSes are now [0,5] [1,6], [2,2] [3,3] [4,4], and [5,0].

So in the last few articles, I proposed and refined a simpleminded algorithm for finding the Longest Common Subsequence, only to find that the algorithm couldn’t run deterministically in a single pass. It gets to points where it has to make a decision, and it doesn’t have enough information to make that decision on the spot. It would have to try both options, and see which one works better.

But wait. Last time, I was trying every possible match string, in every possible order: I was building both ab and ba as match strings. But that’s silly, because they can’t both pass our rule. Could we start by putting the matches in some sort of order — [1,1] before [2,2], and like that — and then only generate match strings that have the matches in that order? After all, we already know that [2,2] [1,1] isn’t even worth considering; why generate it in the first place?

We could indeed do this, and it would bring the problem from the super-exponential into the merely exponential. It basically turns the list of N match strings into an N-digit binary number. We ignore the number with all bits set to 0 (because if we have any matches at all, then our LCS will always be at least one match long), so we’re now dealing with 2n-1 possible match strings.

That’s a little better, but still not good enough for 1,000-line files:

3 matches: 7 match strings

4 matches: 15 match strings

5 matches: 31 match strings

6 matches: 63 match strings

10 matches: 1,023 match strings

20 matches: 1,048,575 match strings

1,000 matches: 1.071*10301 match strings

So now, instead of taking much longer than cracking a 1,000-bit symmetric encryption key, our 1,000-line diff now takes merely as long as cracking a 1,000-bit key. My previous comments about the heat death of the universe still apply.

That answers our “What if?” question pretty conclusively.

The thing is, trying all the possible match strings is a bad idea for a much simpler reason: it tries way too many stupid options. It will try match strings with obvious holes in them — every possible combination of obvious holes. If you have matches [1,1], [2,2], [3,3], and [4,4], the “try ’em all” solution will be trying things like [1,1] [2,2] [4,4] and [2,2] [3,3] to see how they work. And it’ll take an obscene amount of time doing it.

So we have to have something less exhaustive than “try ’em all”, but it has to be more comprehensive than “one pass and keep the best so far”. Over the next few articles, I’ll start getting into what that algorithm should look like.

The problem with that algorithm was the linear approach it took. It looked at one match at a time, and after each match, it expected to know the one-and-only longest-yet common subsequence, and be able to ignore everything else.

And it doesn’t work that way. Let’s go back to the example at the end of the last article:

With the above inputs, we want our algorithm to come up with an LCS of [2,2] [3,3] [4,4]. So after it got to the [2,2], we wanted it somehow to know that it should keep that [2,2] instead of the [0,5] [1,6] it had come up with so far. But what if it had this input instead?

A is (1 2 3)B is (4 5 3 6 7 1 2)Matches are [0,5], [1,6], and [2,2].

The first three matches are the same, but this time, after it processes the [2,2], we expect it to somehow know that it should throw the [2,2] away and keep the [0,5] [1,6], because the latter is our LCS. So with the same inputs, we expect different results. The algorithm, as described so far, doesn’t have enough information to make the right decision.

What it comes down to is, when the algorithm has [0,5] [1,6] in its register, and it sees the [2,2], there is no reasonable way to decide which one is better. We need an algorithm that will try them both, and see which one yields the longest common subsequence.

If we try all the possible combinations (permutations? someday I’m going to have to figure out the difference) of 3 matches — let’s call them match a, match b, and match c — we’ll end up with 15 match strings:

3 match strings of length 1: a, b, c.

6 match strings of length 2: ab, ac, ba, bc, ca, cb.

6 match strings of length 3: abc, acb, bac, bca, cab, cba.

I won’t spend a lot of time on the math, but this is 3 + 3*2 + 3*2*1 total match strings. With 4 matches, we would have 4 + 4*3 + 4*3*2 + 4*3*2*1 total match strings. With N matches, we would have N + (N)(N-1) + (N)(N-1)(N-2) + … + (N)(N-1)(N-2)…(2)(1). Here are the actual numbers:

3 matches: 15 match strings

4 matches: 64 match strings

5 matches: 325 match strings

6 matches: 1,956 match strings

10 matches: 9,864,100 match strings

20 matches: way too damn many match strings

We’re looking at a super-exponential progression. That’s right: worse than 2n. So if we had to start by building all the possible match strings, then a diff of two 1000-line files would take longer — much longer — than cracking a 1000-bit symmetric encryption key by brute force. (As far as I know, 128-bit keys are still considered uncrackable, and each additional bit doubles the cracking time.) And when it finished, sometime after the heat death of the universe, you really wouldn’t still care about the diff results anyway.

I’d say there’s a strong possibility that there’s a more efficient way to do it.