Eric Lippert's blog

Main menu

Post navigation

Lowering in language design, part two

Last time on FAIC I described how language designers and compiler writers use “lowering” to transform a high-level program written in a particular language into a lower-level program written using simpler features of the same language. Did you notice what this technique presupposes? To illustrate, let’s take a look at a line of the C# specification:

the operation e as T produces the same result as e is T ? (T)(e) : (T)null except that e is only evaluated once.

This seems bizarre; why does the specification say that a higher-level operation is exactly the same as a lower-level operation, except that it isn’t? Because there is no way to express the exact semantics of as in legal C# in any lowered form! The feature “evaluate a subexpression to produce its side effects and value once and then use that value several times throughout its containing expression” does not exist in C#, so the specification is forced to describe the lowering in this imprecise and roundabout way.

A nice principle of programming language design that C# does not adhere to is that all the higher-level features of the program can be expressed as lower-level features. Why is this nice to have? Well, not only is it a big convenience for writers of the specification, it also is a big convenience for writers of language analyzers. If you can programmatically transform a high-level language into an exactly equivalent program in a language with far fewer complicated concepts then your analyzer gets simpler.

For example, suppose you are writing code that wishes to determine if a particular line of code is reachable. There are two basic approaches. You could write an analyzer that understands the control flows associated with try, catch, finally, throw, goto, if, else, return, using, lock, for, foreach, do, while, switch, yield break, break and continue keywords. Or, you could write a lowering pass that turned using, lock, for, foreach, do, while, switch, yield break, break and continue into their lowered forms; all of these can be expressed by some combination of the others. Now your analyzer has fewer things to analyze.

This is why I am very pleased by the proposed addition of two kinda weird little features to C# 6. The first is introduction of new variables inside expressions, and the second is sequential composition of expressions.

Today in C# the only two ways to introduce a new local variable within the body of a method are to make a declaration statement:

int x = 123;

or to introduce a parameter to an anonymous function

M( (int x) => x + 1 );

The former requires a statement; the latter brings the variable into scope only inside the anonymous function’s body. (A third technique is to use a let clause in a query expression, but this does not actually produce a variable as it cannot change.) By itself this feature is not particularly useful, but when combined with the sequential composition operator, it becomes very powerful. The semantics of the sequential composition operator are very straightforward: take two expressions, separated by a semicolon. The operator evaluates the left, discards the value if there was one, evaluates the right, and uses the value as its result.

The sequential composition operator is known as the “comma operator” in C, C++ and JavaScript, but I’ve never liked that syntax as it is too easily confused with the comma used to separate arguments in a function invocation. (C# and Java can use a comma to separate loop variable declarations in a for loop, but this feature is not in widespread usage.) Since the semicolon is already used to mean sequential composition of statements, it makes sense to use it as also the sequential compositiion operator for expressions.

This means that the specification can now be reworded as

the operation e as T where e is of compile-time type V produces the same result as (V temp = e ; temp is T ? (T)(temp) : (T)null).

There is no longer any need to qualify the statement; this is the lowering of the operator, end of story. Well, almost: of course temp is understood to be an unique compiler-generated name unused by the method.

I think it is a really nice property for a language to have, that its high-level features can be entirely specified in terms of its low-level features. The addition of these two new features to C# 6 doesn’t get it quite all the way there; we’d also need to add the ability to make a ref local variable in an variable declaration expression for some features, as well as a few other low-level abilities that are not exposed by the C# language. But this is definitely a step in a good direction.

Can’t say that I’m thrilled by those new C# features. I don’t really want to see what creative mess my co-workers will be able to do with “clever” 1-liners now.

You know, C# compiler and tools could always have had a more permissive C# language internally (such as those new C#6 features) and work on that, while still exposing a “clean” language. Kinda get the best of both worlds that way.

In many cases, “clever” one-liner hacks get used not because a language is expressive enough to allow them, but rather because it’s insufficiently expressive to let the same thing be said better. The more accurately a language lets a programmer express what is actually desired, the less need there will be for such hacks.

While I certainly abide by the principle that languages should not needlessly drive programmers to use high-level features when low-level features are in fact a better fit for their needs, and low-level features should when practical be designed to make high-level use straightforward, high-level features should when practical focus on what programmers are trying to do. For example, I would consider

if ((myFlags & OperationModes.PlaybackOrRecord) != 0)
.. dosomething

to be really ugly. The literal zero isn’t really a “number”. Changing it to any value other than zero will make the program brittle in a way it wasn’t before. Even if the lower-level features don’t exist that would allow user code to write an efficient helper method which would work nicely with all “flags” enum values, that’s hardly an argument against adding compiler-intrinsic `HasFlag`, `HasAny`, and `HasAll` pseudo-members (the first would require that the argument be a compile-time constant power of 2; the latter would avoid ambiguity when it isn’t).

Limiting a language in an effort to make it hard to write bad code doesn’t work. Such limitations often impair the writing of good code, and end up creating a need for bad code which would otherwise not have existed.

“Limiting a language in an effort to make it hard to write bad code doesn’t work”

But both Java and C# definitely try to make it hard to write bad code – the often mentioned “pit of success” comes to mind.

If you want to have a language that doesn’t consciously try to do so, write in C++ – it certainly allows you to write good and clever code that you can’t (or just hardly) write in Java/C#, but there’s a big downside to that.

That said the example you’ve picked is definitely true – it’s very sad that C# didn’t go with an enum implementation ala Java, which would allow us to write such Flags code very nicely and cleanly.. maybe in C# 7.

That C++ doesn’t try to prevent bad code doesn’t mean it provides all the features necessary to write good code cleanly.

With regard to how enum is implemented, bitmapped integers are a better storage format for flags than are object references. The problem is that there’s no way for enum types to define semantics beyond saying whether things might be combined, and “Here’s a set of values that might mean something”, and there’s no way to define compile-time constants of more sophisticated types.

As I see your argument is that not providing features that can easily be abused does not only *not* lead to less bad code but also less good code.

Since C++ tries hard to provide every feature one can think of (sure still missing some, transitive const would be nice) especially with template metaprogramming, that should mean that the quantity of bad code in C++ should be similar to Java/c# and the number of good code should be higher.

Not my experience is all I can say to that – there are so many features that can sometimes be used to good result that are 99% of the time abused by people who don’t know better.

About flags: A large use case for flags is passing a single “instance” around to other functions, where the overhead is completely uninteresting. Sure if you want to store ten million flags in an array you may want to go for the low-level solution memory efficient solution, but otherwise that’s hardly important.

And the largest use case for flags that I can see (especially in the Win32 APIs) is having a set of enums that themselves have specific meaning. So enums should have one specific meaning and you should have the possibility to define a set of enums and do the usual queries/updates on that.

The fact that a language provides a means to do something does not mean that it provides a means of *cleanly expressing what one wants to do*. From what I can tell, the design of C++ seems to have historically been more focused on the former than the latter, and the need for compatibility has required many kludges which were added to address the former to remain in place. Further, once certain bad idea takes hold (e.g. overloading `<<` for stream output) it can be hard to banish them. All the historical baggage C++ carries around would encourage the writing of bad code in that language no matter how many ways it might offer to write better code.

With regard to "flags" enumerations, an `enum` which only encapsulates a one-of-N selection could be about as efficient as an `int`, but when it's necessary to pass a combination of flags, passing an `int` and having a method test its bits it is going to be *much* cheaper than passing any sort of list of flag values.

I’m not sure whether your comment about me is meant to be sincere or sarcastic. Perhaps you think I’m trying to imply that “my” clever code is all justifiable based on the lack of certain language features. That’s not my intention.

Fundamentally, I think the best way to discourage “clever” code is to allow a programmer sufficient means of expression that **the compiler’s understanding of the program’s internal workings can exceed that of the programmer**. If a value needs to be able to hold values 0-65535 and needs to fit in two bytes, but calculations exceeding the range 0-65535 should be considered erroneous, let the programmer say that, thus letting the compiler use two bytes of memory or a 32-bit register as convenient, and letting it either trap or ignore overflow, rather than requiring it waste code on unwanted wrapping behavior.

I spotted both of these new features in the Roslyn documentation recently. I’d have to say that I both agree and disagree with commenter bystander – I see these as useful-when-you-need-them, but offering the potential for a lot of abuse. In fact the Roslyn documentation itself contains a similar comment regarding these features.

“Well, almost: of course temp is understood to be an unique compiler-generated name unused by the method.”

Is it actually necessary? I don’t quite remember the shadowing rules of C#, but if they are what I think they are, temp may as well be an already existing variable mentioned in e, and it still wouldn’t cause any ambiguity.

The original feature could have been expressed in a lowered form: e as T is equivalent to AsT(e) where AsT is a function defined as T AsT(E e) { return e is T ? (T)e : (T)null; }, where E is the type of e. (Lambdas didn’t exist at the time, so we couldn’t have used those to define the “as” operator.) Similarly, (a; b; c) is lowerable as ((System.Func)(() => { a; b; return c;}))(), where T is the type of c.

Hi Eric, In Roslyn I came across some code – for( ; ; ) – what’s with that? Is that sequential composition operator? It doesn’t look to be a comma used to separate loop variable declarations in a for loop as you mentioned in your article as not widely used.

All three parts of the “for” operator—initializer, condition, iterator—are optional. The omitted loop condition is taken to be “true”. So “for (; ; ) { … }” has the same meaning as “for (; true; ) { … }” which has the same meaning as “while (true) { … }”.

And even though “while (true) { … }” is more clear, many people coming from C/C++ still write “for (; ; )” — the reasons they are tought so are mostly historical. You see, C/C++ compilers used to produce (or maybe still are producing, I don’t know) more efficiend code for “for (; ;)” than for “while (true)”.

@Joker_vD. Thanks for your explanation. I had no idea that all three parts were optional. In the code I was looking at, this “for (; ;)” loop was immediately followed by a “do { … } while (condition)” loop. Indeed I wonder if there is some compiler optimization for this.