I wanted to colorize unified diff output on the commandline; red for deletions and green for additions. As I ​learned on StackOverflow, there is ​colordiff which is the right solution for the problem. But why use an existing solution written by someone else in 400 lines of Perl, when you can use a partial solution of your own in one line of sed?

So here's a one-liner that highlights deletions with a red background and additions with a green background, and hunk markers in blue text.

sed 's/^-/\x1b[41m-/;s/^+/\x1b[42m+/;s/^@/\x1b[34m@/;s/$/\x1b[0m/'

I chose to highlight changes using a background color so that whitespace changes would be more readily apparent. Interestingly, xterm does not display a background color for tab characters. This means that you are able to clearly see tab <-> space indentation changes in a diff. However, it also means that you can't see changes of trailing tabs. Sadly, colordiff does not support background colors.

Filenames are highlighted in the same way as content... for a good reason. You see, to differentiate between a filename line and a content line, you have to fully parse the diff output. Otherwise, if you add a line of text to a file that looks like:

++ vim73/src/ui.c 2013-06-01 14:48:45.012467754 -0500

you will get a line in your unified diff that looks like:

+++ vim73/src/ui.c 2013-06-01 14:48:45.012467754 -0500

which any regex-based approach is going to incorrectly see as a diff filename header. Clearly the same problem arises when deleting lines that start with --. Since colordiff is also a line-by-line regex-based implementation, it also highlights filenames the same as content.
This is one of those cases where you can change your problem specification to make your solution trivial.

Example:

evil.orig

blah blah blah
humbug
one two three
four five six
-- vim73/src/ui.c 2013-06-01 14:48:45.012467754 -0500
@@ -1,6 +1,6 @@
blah blah blah
one two three
four five six
eight nine ten
zero
humbug

evil.new

blah blah blah
bah humbug
one two three
four five six
++ vim73/src/ui.c 2013-06-01 14:48:45.012467754 -0500
@@ -1,6 +1,6 @@
blah blah blah
one two three
four five six
seven eight nine ten
zero
humbug

That one space before the false hunk header is probably the most visually apparent clue that something isn't right. Unless you're paying attention to the actual numbers in the hunk header, that is; but if the hunk is a couple hundred lines long and the false diff portion is only a couple of lines, even that would be hard to notice.

Suppose you want to represent data in a serialized form with the length prepended to the data. You can do something like what Pascal does with strings, and prefix it with an 8-bit length. But that only gives you 0 through 255 bytes of data. So you need something larger, such as a 64-bit value. But then a single-byte data value takes up 9 bytes including the length indicator. We'd really want small data values to use a small amount of overhead to encode the length, and we'd want the large data values to be representable, too. And thus, we want a variable-length encoding of the length. And the length is an integer, so we want a variable-length encoding of an integer. We'll start with representing a non-negative value, but a variable-length encoding of a signed integer may be worth a look too.

You can find some interesting articles on wikipedia about ​universal codes which are variable-length encodings of integers, but they focus on representations of integers in bit-streams. Given our usecase, we're really dealing with byte-streams.

So let's start with a simple idea:
Count the number of leading 1 bits and call that N. The total size of the numeric representation is 2N bytes. Take those bytes, mask off the N leading 1 bits, and interpret the number as a binary integer.

That gives us a way to represent any non-negative integer value. But there is one undesirable characteristic of this approach: there are multiple correct ways to represent any given number. For instance, the number 0 can be represented in a single byte as 0b00000000 or in two bytes as 0b10000000 00000000. This introduces ambiguity when encoding the value 0. There may be situations where this is a desirable property, but in this case, I want there to be one and only one representation of each integer.

A simple solution is to make the representations not overlap by adding the number of valid shorter representations to the integer representation. That is, interpret the 2-byte value as an integer, then add the number of valid 1-byte values to it, And for the 4-byte value, add the number of valid 2-byte and 1-byte values to it. An alternative way to state this is to add the largest number you can represent in the 2(N-1)-byte representation (plus one) to the integer.

Here is a simplistic Python implementation​. One of the nice things about using Python is that it can natively handle huge integers, so only the serialization aspect is needed.

This approach can be generalized in a couple of ways.

The first is that this could be done using leading 0 bits instead of leading 1 bits. I prefer the leading 1 bits because the 1-byte values 0-127 are the same as your normal unsigned char. But whether it is defined as the number of leading 1-bits or 0-bits, it still gives us a way to determine the value of N.

The second is in the translation of N into a representation size in bytes. I chose 2N, but it could just as easily be any function of N. If you wanted to have the size of the representation grow more slowly, you could use f(N) = N + 1. I like f(N) = 2N in part because it gives 1-byte, 2-byte, 4-byte, 8-byte representations that fit well into the natural integer sizes on modern computers.

This can also be generalized to signed integers as long as you define a mapping from the set of non-negative integers to the set of integers. A trivial solution would be to take the least significant bit to be a sign bit, though this gives you a way to represent negative zero. I suppose you could use that as a representation of Not-a-Number (NaN) or something along those lines. Alternatively, use a two's complement representation, though care would have to be taken with sign-extending the value and adding to that the largest magnitude negative or positive value that would overflow the next-smaller representation. This is left as an exercise to the reader.

Returning to our original problem statement, we now have a way to prepend a length to a data value while having the overhead cost low for small values while still supporting very large values. One byte of overhead to represent the length for data of 0 through 127 bytes is acceptable. Two bytes for 128 through 16511 bytes is also fine. By the time the overhead reaches 8 bytes, you're dealing with half a gigabyte of data.

But such a representation has additional possible uses. One that I have toyed with is using such a representation for a binary network communication protocol. Each message you define gets assigned an integer value, and you don't have to commit to a specific maximum number of message types when you define your protocol. Were I to use this for a protocol, I would want to make a 'version check' message have a numeric value < 128 so it fits in a single byte. And most messages would get a number that would map to a 2-byte value. That way, as messages are determined to be bandwidth "hot spots", they can be moved to a <128 value to cut a byte off their representation. The other thing I would probably do with protocol numbers would be to define a different f(N) that would grow the size of the integer representation more slowly. For that matter, it would be possible to map f(0) -> 1, f(1)->2, f(2)->2, f(3)->2, f(4)->3, etc; this would complicate some of the math, but would allow packing more values into 2 bytes. (The number of values represented by the second and third 2-byte representations would be half or a quarter of what the first 2-byte representation supported.) In a case like this, I would probably only define f(N) for the values of N I actually expect to use, and extend the definition as need arose.

Network protocols is another case where the unique nature of the representation is important. When you are dealing with systems that you want to secure (such as a network protocol), you do not want the ambiguity in the encoding process that a non-unique encoding implies. You want one and only one representation of each possible value so any attacker has no flexibility in doing something strange like using 256 bytes to represent the number 0.