Wednesday, February 15, 2012

With this post I want to ask a question: what does it mean for code to be "too dense?" This question has implications on everything from languages to APIs to coding style.

I've seen debaters defending Java's verbosity precisely because it isn't "too dense." They say the sparsity of the code makes it easy to understand what's going on. Similarly it's common to bash programmers for playing "golf" when their code is dense. But if we're allergic to density then why do programmers seem to prefer to use tools that create code density when there are fairly straightforward ways to create less dense code?

For an example I'm going to use regular expressions(1) since just about every programmer knows what they are, they're very dense, they exist in direct or library form for every general purpose programming language, and they are easy to replace with "normal" code.

Regexes are tight little strings that have very little in the way of redundancy. They're frequently accused of being "write only" - impossible to read and maintain once written. They are the poster children for "too dense" if anything is.

With the modern-ish focus on refactoring and the understanding that code is read far more often than it is written then if regexes are too dense you'd think programmers would be eager to replace those dense strings with more standard code just to improve readability. After all, a regex encodes a simple state machine or perhaps something a bit stronger if the common Perl-ish extensions are used, so replacing them is easy.

Yet it doesn't happen, at least not much. Regexes remain a mainstay. New regexes are continually written and old ones aren't ripped out and rewritten as loops and if statements just to gain some more readability. They're expanded for performance reasons or when the logic needed exceeds the power of regexes, but they almost never get replaced with an explicit state machine just to improve maintainability.

Why is that? We can't blame a few bad programmers. Regexes are far too widely used for that simple cop out.

What regexes and our use of them suggests is that we're not allergic to density in information per character but to something else. One culprit is is simply unfamiliarity. Regexes are okay because we're familiar with them, other forms of density are bad because we're not familiar with them.

But maybe it's even stronger than that. Perhaps the familiarity with regexes makes us aware of a different kind of density/sparsity trade off. A regex's information density may make it slower to read in terms of characters per minute but we know that expanded code would be slower to read in terms of concepts per minute.

In this post, I picked on regular expressions because they're so widely known and used but the bigger question is in the design of languages, APIs, and coding conventions. This article started with a question and will end with more. Are regular expressions outliers, unusual in creating value out of density? Is there some optimum relationship between frequency of use and density where something becomes too dense if we don't use it often enough? If we create dense languages, APIs, or coding conventions are we creating impenetrable barriers to entry for newbies? If we don't create dense notations are we providing a disservice to those who will use the notation often? Is there any hope that a designer of a language, API, or coding convention can find a near optimum density for his or her target audience that remains near optimal for a long time over patterns of changing usage?

19 comments:

I like the idea of compactification/expansion of code as an automated refactoring. When you don't want to change something much, you optimise down to a compact form that expresses the idea more succinctly. (Point-free style, say.) If you're less sure about the permanency of the design, you can keep some degrees of freedom hanging about. Most important, when you want a degree of freedom you've optimised away, you should be able to automatically inflate the code back to something more redundant.

Of course that takes some pretty neat refactoring tooling. But I think in principle we can have it both ways.

I don't know if regular expressions are a good example or not. I hate 'em, but I use 'em. Creating an alternative would cause even more confusion than the [insert number here] variants of regex already cause. Besides, my alternative wouldn't replace regexs; I'd still need to use them everywhere except in my own bespoke code. Since I already need to know all of the regex variants, creating an alternative just increases my workload.

[The chances of my creating an elegant alternative to regex that would take the world by storm and swiftly eliminate the scourge of regex variants from the Solar System are somewhere around 10^-42.]

On the opposite extreme of density is XML, but despite a number of nice alternatives that aren't nearly so verbose, we're still stuck with XML. So it's not just density.

I don't know that it's "density" so much as "obscurity" that's a problem. I've been trying to convert some old BASIC games to Scala, and I'm thoroughly frustrated with trying to remember what the variables are because they're named A through Z. Most of us practicing programmers learned a long time ago that except for loop indexes of i, j, etc., we should use variable names that are half-way meaningful. A lot of functional programmers haven't gotten the word and their lists end up being called "l" and "xs".

On the other hand, we've also learned that naming a variable "lpszSrc" increases obscurity by being too verbose with stuff that we don't care about.

For the same reason (obscurity), user-defined operators are a challenge. They are guaran-freakin-teed to steepen the learning curve of an API. I'll grant that there are sometimes reasons to use an operator —I'm sure not interested in writing "x.assign(x.plus(5))" instead of "x+=5" — but I ask API designers to really think about the names and operators they expose to their users.

There is, of course, no "bright line" between too terse and too verbose. In particular, idioms serve a good purpose in communicating a larger concept in a smaller bit of code. But idioms that aren't used often enough to be remembered are just confusing. As Liz Keogh wrote in Ten Tips for the Agile Martian Coach: "On no account should you tell your Martian that there’s more than one way to skin a cat."

I like the topic of this post a lot. It's been on my mind quite a bit because the "regex" argument has come up more than once when showing people some very compact, point-free code in the functional style.

But I think regexes are not the best representative of dense code, and they become a straw man.

Firstly, most people get distressed by the "leaning toothpick" problem of delimiting, but this is not really the fault of regexes, so much as implementation by some languages/platforms (like Java).

The next problem with regexes is one of type-safety and correctness, which isn't necessarily a failing of all dense code. Dense code can be type safe. Otherwise, we can lean on tests to assert correctness.

Finally, density doesn't mean we have to throw away comprehensibility. We always have names for binding to help describe semantics. Rather than worry about using low-level structural constructs like conditions and loops, perhaps all we need to do is break the regex up into named pieces (we might be able to reuse) and compose them. functionally.

Regexes are usually a foreign element in a sea of other code but it's immediately obvious that "this is a regex" so you can put your mind in "understanding regex" mode. It's like a free rider that benefits from the code around it being less dense. Make everything as dense as a regex and any value in its density would be lost.

I spend most of my days switching between ruby (fairly dense), objective c (very verbose) and c#/Java (somewhere in the middle). Ruby manages to be terse and expressive. Written well, it's extremely readable. The designers of Objective C (and its libraries) went out of their way to make it verbose and descriptive but it ends up being much harder to read than ruby.

My conclusions:regexes are a special case indicative of little.ruby is a better case study in "programmers like dense".

I think the difference is that regexes draw from a small, mostly well-understood vocabulary that requires little understanding of the surrounding context. The problematic kind of dense code is usually both dense and obscure in that it draws from a larger and not as well understood vocabulary (unfamiliar functions, libraries and so on). I'd guess that the combination of dense and obscure makes the code harder to read for people unfamiliar with the context because a dense syntax has fewer cues that tell people "ok this is a function, this is an operand, this is an object apply(), this uses an implicit, and so on."

In fact, if you really measured the entropy / information content within a regex versus a piece of dense and obscure Scala code, I'm not sure that they'd be comparable. Dense code requires an understanding of the libraries being used.

I am a huge user of regexp. I use them at least 20 times a day. More than regexp we can also thing about mathematical formulæ. And I believe it is a better example. Why? Because, once used, you can read aloud a mathematical formula. I don't think this is as simple with regexp.

I believe the density isn't the real problem. It is more the readability. How easy it is to grasp the idea behind a representation.

Some representation are very helpful. For example, if you take a look at the Objective-C syntax, it has the great advantage to be very easy to be read aloud. It is almost like reading English. On the other hand, Objective-C is very verbose. Thus in the end it is harder to read than Python and Ruby.

Python and Ruby are quite terse and very easy to read.

On the other side there are languages such as (the worst of all) XSLT, Perl and Haskell.

XSLT has both disadvantages of being ridiculously verbose and hard to read.

Perl is not terse but can be very hard to read. In fact readable perl is verbose, terse Perl is mostly unreadable.

Haskell is terse and can be very hard to read. Haskell is more like Math. As Perl, it is very easy to make Haskell both more verbose and more readable. As with Math, while writing a scientific article it is preferred to use as much as English as possible to make the paper more readable.

Why at this game Ruby and Python are the winner (go to codegolf.stackexchange.com for the proof)? I don't have an answer. But for me they reach the best balance between terseness and readability. In one word, syntax efficiency.

"A language should be designed in terms of an abstract syntax and it should have, perhaps, several forms of concrete syntax: One which is easy to write and maybe quite abbreviated; another which is good to look at and maybe quite fancy, but after all, the computer is going to produce it; and another, which is easy to make computers manipulate."--John McCarthy, http://www.infoq.com/interviews/Steele-Interviews-John-McCarthy

Short regexes like .*\.scala are easy to read; it's hard to see how to improve much on them. However, longer regexes with 5-10 tokens inside I find very difficult to read. Trying to debug such a regex involves a lot of finger pointing, mouse highlighting, and spacing out of code.

Math formulas are similar. A short formula like (x**2 + y**2)/2 reads quite well. However, a formula 3-4 times larger than this starts to get very difficult to read. It needs to be separated out and some local variables inserted.

Overall, I kinda like your idea of density in information per character. To refine it a tad, perhaps the best measure is tokens per square centimeter. Too few, and the reader has to scroll up and down a lot to do anything useful. Too many, and the reader really has to squint.

Nice post. Personally I never understood what all that fuzz about regex is about. They have a very limited vocabulary, are easy to get somewhat comfortable with and even can be commented in implementation that support the 'x' modifier.

Also — and this also comes from personal experience and teaching programming courses for just over 3 years now — a lot of the accusation about functional code being "too dense" I heard from people who are not willing to leave familiar territory and exposing themselves to new concepts. Sure, code golf has no point in production code, but I'll take the declarative nature of nice functional code over imperative-style loops any day.

Regexes themselves are a class of compaction of information which requires the programmer to disassemble it in his mind. Even as they attempt to solve another problem of compaction (often created by a non programmer), that of compacting multiple pieces of information into one string.

But it really comes down to the fixed cost of learning the method of compaction and unraveling. I spent the time attempting to understand regexes early on and that helped a lot. And the same is true about many of the dense functional expressions as well. Learn them well, and they are easy to work with.

The issue really is not with dense code. It is with "learn it well". There are 100 things that compete for a programmers learning attention. For some it is important to learn it well and then deal with it competently. Some find it not so important given other competing priorities and choose to learn just enough to move on.

Most implementations do support some form of comments and allow to ignore spaces, but I think the reason this isn't used very often, is because we try to keep the expressions short and single-use. People don't usually write complete parsers in regex (even though they could), but combine regexes for floating-point numbers etc, where a comment "this recognizes something with an exponent" is sufficient…

I don't think the problem is pure density. I think the problem is something more like accessibility or perhaps a too-large difference in scale between what's in a person's head and what's expressed.

A well-written regular expression expresses a particular pattern that we as humans notice. It may or may not be dense; a=~ /^hello$/ is about as dense as a=='hello'. But you can definitely have regular expressions that are too dense; any serious regex user has had WTF moments when dealing very complicated regexes. The common solutions there are to break parsing up into stages or move some of the magic out of the regex. Both operations reduce density but increase clarity.

For me, "too dense" means that the work needed to mentally unpacking the dense code is larger than the work saved through the density. It's an edge I continually seek by pushing for density and then pulling back a bit when I've gone too far. I think the right level of density for a particular code base depends a lot on the people and the situation; I don't think there is a universal optimum.

Because of that I think the answers to your questions are much more about cultural choices than technical ones.

I don't have a problem with density, as long as it results from a clear train of thought aimed at solving a specific problem, rather than a process of trial and error in which one ends up addressing orthogonal concerns with a densely convoluted piece of code.

For example, let's say that I want to accomplish two things:

1. Validate than an input is a non-negative integer formatted using commas as a separator for thousands

2. Extract the last 4 digits of the input

I could either use two regexes, namely:

[0-9]{1,3}(,[0-9]{3})*

.*(.{4})

or I could use a single regex like the following (though someone else might be able to do it more elegantly)

(([0-9]{1,3}(,[0-9]{3})*,[0-9]{2})|([0-9]{0,2}))([0-9],[0-9]{3})

In an actual program, the second approach may appear slightly more "dense" because I only perform one matching operation, but to my eye the first approach is much clearer and more comprehensible.

The first approach feels "right" to me, while the second approach does not. The single expression was difficult to develop, it required lots of trial and error, and even now that I've written it, I would have a difficult time giving a coherent explanation of what it actually means. If I had to defend it in a code review, I would have to rely on the old standby "well, it works ...". The first approach not only works, but is reusable because each regex can be composed with other regexes for validating new kinds values. The first approach also responds well to likely requirements changes like "we want only the last 3 digits of the number" or "we want to make commas optional".

So, while I think that density is interesting, I do not believe that it is a key determinant of clarity. Just like any other code, dense code may be either clear or convoluted depending on what it actually does.

Perhaps "too dense" means, for the writer, "I'd make this clearer, but didn't/couldn't because of X". Common X include "the tool wouldn't let me", "the possible approaches cost more clarity than they contribute", "it would have been too hard", and so on.

For regexps, the programmer might think "this is getting too dense.. let me add /x, break it up with clarifying whitespace, name the chunks, and maybe add comments". But if your regexp implementation doesn't permit these, then you're stuck with "the regexp is too dense".

For readers, there's "I don't feel like I was your target audience", and "why didn't someone force you to target me?!". Java for instance, was up-front about excluding language features which, while permitting denser, clearer, and more maintainable programs, were thought too audience-excluding.

This particular paper studies use of contractions, but I've seen corpus and experimental studies on information density in speech rate, pro-form and ellipsis, reduced relative and complements and so forth. And the conclusions are all the same: in high information contexts, the more verbose forms are preferred (full NPs, no ellipsis, complementizers and full relative clauses, slower speech rates, more highly articulated allophones/allomorphs, etc). In low information contexts, the more terse forms are preferred (pro-forms, ellipsis, reduced relatives, complementizer drop, contractions, etc).

The conclusion is that there seems to be an optimal information rate range, and we exercise the options in our language to stay within that range.

I suspect the same is true of programming languages. In this light, I think Ruby and Python are both popular because they are closer to many programmer's optimal information density. Java is becoming less popular because it is too verbose. Perl lost some popularity because it was often too terse, and the Modern Perl movement in Perl 5 seems to move toward Ruby-like information density. Coffeescript is popular because many javascripters find it too verbose. Q will never be popular even if kbd were free, just because it's too information dense for most programmers. Etc.

(NB: I am not claiming this variable explains all or most of the patterns in language popularity. There are clearly other factors.)

Given this, here's how I'd answer your questions:

> Is there some optimum relationship between frequency of use and density where something becomes too dense if we don't use it often enough?

Yes, certainly. The less you use a pattern, the more complex it is (that is the higher Information), and so more redundancy should be inserted to help manage that complexity. In the best cases, we do this by breaking it down into more ubiquitously useful (and thus less complex) components. Or we give longer, more descriptive names.

No. What regexs tell us is that the task of finding and substituting patterns in text is ubiquitous in many, many programming communities. Which you already knew.

>If we create dense languages, APIs, or coding conventions are we creating impenetrable barriers to entry for newbies?

Not if those dense APIs or conventions are ubiquitous. Their repeated use compensates for the added complexity of their terseness.

>If we don't create dense notations are we providing a disservice to those who will use the notation often?

Yes. But often you can count on those disserved to create a more terse API for their own consumption.

>Is there any hope that a designer of a language, API, or coding convention can find a near optimum density for his or her target audience that remains near optimal for a long time over patterns of changing usage?

Good question. I'd look to the past for this. How much does notation in various math and programming communities change over the decades and centuries? I don't know the answer.

Functional programmers have gotten the word. They just tend to write a lot of abstract code, so you see a lot of code that works over lists of any kind of thing at all.

I find "xs" to be an incredibly descriptive name for "more than one x", and "x" to be a very descriptive name for "any kind of thing at all". What more descriptive name would you suggest?

When you see more specific code, code that works over certain kinds of lists, you tend to see functional programmers use more descriptive names. A list of lines of text is way more likely to be called something like "lines" than "xs".

Or to put it another, getting an x from xs is the functional equivalent of an index in a loop. So if you're ok with "i" and "j" for loop indices, why aren't you ok with "x" and "xs"? Why the double standard, man?