I've been taking a look to Roslyn CTP and, while it solves a similar problem to the Expression tree API, both are immutable but Roslyn does so in a quite different way:

Expression nodes have no reference to the parent node, are modified using a ExpressionVisitor and that's why big parts can be reused.

Roslyn's SyntaxNode, on the other side, has a reference to its parent, so all the nodes effectively become a block that's impossible to re-use. Methods like Update, ReplaceNode, etc, are provided to make modifications.

Where does this end? Document? Project? ISolution? The API promotes a step-by-step change of the tree (instead of a button up), but does each step makes a full copy?

Why they did they make such a choice? Is there some interesting trick I'm missing?

1 Answer
1

Great question. We debated the issues you raise for a long, long time.

We would like to have a data structure that has the following characteristics:

Immutable.

The form of a tree.

Cheap access to parent nodes from child nodes.

Possible to map from a node in the tree to a character offset in the text.

Persistent.

By persistence I mean the ability to reuse most of the existing nodes in the tree when an edit is made to the text buffer. Since the nodes are immutable, there's no barrier to reusing them. We need this for performance; we cannot be re-parsing huge wodges of the file every time you hit a key. We need to re-lex and re-parse only the portions of the tree that were affected by the edit.

Now when you try to put all five of those things into one data structure you immediately run into problems:

How do you build a node in the first place? The parent and the child both refer to each other, and are immutable, so which one gets built first?

Supposing you manage to solve that problem: how do you make it persistent? You cannot re-use a child node in a different parent because that would involve telling the child that it has a new parent. But the child is immutable.

Supposing you manage to solve that problem: when you insert a new character into the edit buffer, the absolute position of every node that is mapped to a position after that point changes. This makes it very difficult to make a persistent data structure, because any edit can change the spans of most of the nodes!

But on the Roslyn team we routinely do impossible things. We actually do the impossible by keeping two parse trees. The "green" tree is immutable, persistent, has no parent references, is built "bottom-up", and every node tracks its width but not its absolute position. When an edit happens we rebuild only the portions of the green tree that were affected by the edit, which is typically about O(log n) of the total parse nodes in the tree.

The "red" tree is an immutable facade that is built around the green tree; it is built "top-down" on demand and thrown away on every edit. It computes parent references by manufacturing them on demand as you descend through the tree from the top. It manufactures absolute positions by computing them from the widths, again, as you descend.

You, the user, only ever see the red tree; the green tree is an implementation detail. If you peer into the internal state of a parse node you'll in fact see that there is a reference to another parse node in there of a different type; that's the green tree node.

Incidentally, these are called "red/green trees" because those were the whiteboard marker colours we used to draw the data structure in the design meeting. There's no other meaning to the colours.

The benefit of this strategy is that we get all those great things: immutability, persistence, parent references, and so on. The cost is that this system is complex and can consume a lot of memory if the "red" facades get large. We are at present doing experiments to see if we can reduce some of the costs without losing the benefits.

And to address the part of your question about IProjects and IDocuments: we use a similar model in the services layer. Internally there are "DocumentState" and "ProjectState" types that are morally equivalent to the syntax tree's green nodes. The IProject/IDocument objects you get are the red node facades for these. If you look at the implementation of Roslyn.Services.Project in a decompiler, you'll see that almost all of the calls forward to the internal state objects.
– Jason MalinowskiMay 2 '12 at 20:03

@Eric sorry for the remark, but you're contradicting yourself. The expense and difficulty of building a complex persistent data structure doesn't pay for itself. ref: stackoverflow.com/questions/6742923/… If you had high performance goals why did you make it immutable in the first place? Is there only other reason apart from the obvious ones? e.g. easier to make threadsafe, to reason about etc.
– Lukasz MadonMay 28 '12 at 18:56

2

@lukas You're taking that quote out of context. The previous sentence was "Because when you look at operations that are typically done on strings in .NET programs, it is in every relevant way hardly worse at all to simply make an entirely new string." OTOH, when you look at operations that are typically done on an expression tree -- e.g. typing a few characters into the source file -- it is significantly worse to build an entirely new expression tree. So they only build half of it.
– TimboJun 13 '12 at 18:20

1

@lukas My guess: Given that Roslyn is supposed to operate on background threads, immutability allows multiple threads to analyze the same source code at the same time without worrying that it will be changed when the user presses a key. In response to user input, immutable trees can be updated without stopping analysis tasks that are running. So I imagine that immutability's main goal is to make Roslyn easier to write (and perhaps easier for clients to use).
– QwertieJun 22 '12 at 20:41

3

@lukas Persistent data structures are more efficient than copying, when the data structure is typically much larger than the changes to it. Your point, if you have one, is lost on me.
– QwertieJun 29 '12 at 21:31