Learn Roslyn Now: Part 8 Data Flow Analysis

Writing this blog post has been really painful. It’s been three months since I last published my introduction to the semantic model and I’ve been putting off this post for as long as I could. I started a new series called Learn Roslyn Now Quick Tips, I helped build Source Browser, and I even submitted a small pull request to clean up the analysis APIs. Basically, I’ve done everything but learn and write about these APIs.

I’ve struggled to imagine how one would use them in an analyzer or extension.

They’re weird, unintuitive and they frighten me.

I put out a tweet asking how others were using them, and it appears they’re only really used within Microsoft to implement the “Extract Method” functionality. A handful of questions on Stack Overflow have mentioned these APIs, so I’m sure someone out there is putting them to good use.

Data Flow Analysis

This API can be used to inspect how variables are read and written within a given block of code. Perhaps you’d like to make a Visual Studio extension that captures and logs all assignments to a certain variable. You could use the data flow analysis API to find the statements, and a rewriter to log them.

To demonstrate the capabilities of this API, we’ll be looking at a modified piece of code posted on Stack Overflow. I’ve cleaned it up slightly, but it shows a number of interesting behaviors consumers of this API should be aware of.

Perhaps the most important property on this object is Succeeded. This tells you if the data flow analysis completed successfully. In my experience the API has been pretty good at dealing with semantically invalid code. Neither invocations to missing methods nor use of undeclared variables seemed to trip it up. The documentation notes that if the analyzed region does not span a single expression or statement then analysis is likely to fail.

The DataFlowAnalysis object exposes a pretty rich API for uses to consume. It exposes information about unsafe addresses, local variables captured by anonymous methods and much more.

DataFlowAnalysis.VariablesDeclared – The set of local variables that are declared within a region. Note the region must be bounded by a method’s body or a field’s initializer, so parameter symbols are never included in the result.

To refresh, the code on which we’ve analyzed is displayed below. The region we’ve declared interest in is the for-loop.

The results from analysis are as follows:

AlwaysAssigned: indexindex is always assigned to as it is contained within the initializer of the for-loop, which runs unconditionally.

WrittenInside: index, innerArrayBoth index and innerArray are clearly written within the loop.

One important point is that outerArray is not. While we’re mutating the array, we’re not mutating the reference contained within the outerArray variable. Therefore it does not show up in this list.

WrittenOutside: outerArray, thisouterArray is clearly written to outside of the for-loop.

However, it surprised me that this showed up as a parameter symbol within the WrittenOutside list. It appears as though this is passed as a parameter to the class and its member, which means that it shows up here as well. This appears to be by design, although I suspect most consumers of this API will be surprised, and likely ignore this value.

ReadInside: index, outerArrayIt is clear that the value of index is read within the loop.

It was surprising to me that outerArray is considered to be “read” inside the loop as we’re not reading its value directly. I suppose that technically we must first read the value of outerArray in order to calculate the offset and retrieve the correct address for the given element of the array. So we’re performing a sort of “implicit read” inside the loop here.

VariablesDeclared: index, innerArrayThis is fairly straightforward. index is declared within the loop initializer and innerArray within the body of the for-loop.

Final Thoughts

The general weirdness of the data flow analysis API has long kept me from writing about it. The issues with this and what’s considered a read vs. a write is pretty offputting to me. I suspect these kinds of issues will prevent a lot of people from taking advantage of this API, but I could be wrong. It’s difficult to say this early in the game and I have not seen very much discussion about this API and the above problems.

Thoughts on this being written inside. This information could be used to determine if the current code is instance code or static code, thus determining if you can suggest making a method static if this is WrittenOutside but not ReadInside or WrittenInside.

From that post, which is really great, how can I make the difference between a variable not being used and a variable which has been assigned but is not being used? How can I check that a variable has been assigned? With readInside?

var model = syntaxNode.SemanticModel;
var result = model.AnalyzeDataFlow(methodBody);
The result breaks the code for some reason : System.ArgumentOutRangeException Index was out of range Must be non negative and lesser than the size of the collection
The thing I don’t get is the test code works fine :

However I don’t see anything weird about outerArray being read in the loop – as you mentioned, it has to be implicitly read to access it. If, for example, you were to lazy-load a field, it certainly would be nice to be informed that your field is read and thus initialized.

“It was surprising to me that outerArray is considered to be “read” inside the loop as we’re not reading its value directly”
In that case it is only compiler that is doing out of bounds check whenever you try to access by index… but it is easy to imagine a situation where some folks would modify some state on indexed getter of some other variables.