Erik Doernenburg on Software Visualization

Recorded at:

Bio The first piece of software Erik Doernenburg created was a little game that he wrote in Basic on the Commodore VIC-20 in 1983. He has worked in the field of enterprise applications for the past 10 years, and is now employed by ThoughtWorks. His latest passion is software visualization strategies at a medium level of abstraction.

I work as a consultant for Thoughtworks and I generally specialize in the complicated technical problems that we are facing. My fairly broad range of knowledge of different technologies, I can help make choices to pick the right technology for a given project to make it successful, but one thing I would stress is that I am not running away after a choice has been made; the key is really to stay around and actually implement that and create a successful project is based on those technologies.

That is something I have become interested in over the past 2 years, when I started realizing that a lot of the problems we are dealing with are becoming very large. Especially large organizations have huge portfolios of software, and it is very difficult to get an understanding of what they are and how they work together. You generally only get the 30000 foot view where every system has a small little box or you have to ground level view where you are looking at actual source code, but that doesn't help the developers, that doesn't help the management and that doesn't help the enterprise architects. So we are looking for a way to actually find some middle ground to get a 100 foot view, to get a better understanding of software systems.

I think I can visualize almost anything. It really depends on the question you're having. I mean you have to have a hunch on what you are looking for; you might say: "It takes us a very long time to refactor things, why is that the case?" or you might be saying "We are running some metrics over our software and we generally accept that our classes are too big but where are the big classes, how can I find them?" You generally create diagrams but the trick in it is an interesting thing is that what you want to visualize is generally a metric so the process often starts with finding metrics in your software and then creating diagrams from it.

That is true and sometimes even I don't even like metrics; that is the whole thing about when people are saying you have to have a certain amount of test coverage, people come out and say 100% or 80% test covers but that doesn't mean very much in its own right and it can actually be misleading. If you for example tell somebody just the blind metric and say: "Your method can't be longer than 30 lines" and the method is 50 lines, people can split it in the middle, called a second method and they fulfill the metric; so it is more about understanding the values and having metrics guide you, and again if you are trying to use metrics to beat somebody up that will probably not work very well. The idea really is that metrics are good at spotting two things: trends and outliers. In a code base it would take test coverage, for example.

On average cross packages have 85% test coverage but in that one package it's 50%; that is an important message to me, the outlier here. Or another one is what I find very interesting is visualizations over time; you can just in the simplest form, you could plot test coverage all the time and they don't care over the values of the 80%, what they care about is that it stays level or goes up, that it doesn't go down significantly. So metrics, if you take them as numbers and try to make too much of the single number, wouldn't work, but as a starting point for making graphs or visual presentations, they are essential because you don't want to draw a pretty picture of a loop; that doesn't work. You need to have something else and if you want that 100 foot abstraction you need to aggregate a way. If you visualize the source code directly nothing will happen, you need to find an aggregation and that is the one you visualize.

One thing that we were recording on one project, we felt that we were spending too much time writing infrastructure code as opposed to business code. People were saying: "Yeah, but we have all this business code and you pointed at it and there were hundreds of lines of business code" but then you saw other packages (it was a JAVA project) with a lot of other code and we couldn't really figure out was that the right ratio and how did it look like. What we did was actually on a UNIX machine, a very simple shell script that gave us the length of the files based on the directories of package structure in JAVA that those files were in with any use of freely available open source tool called InfoVis that allow us to draw a nest of structures in rectangles. So if you imagine a JAVA project you would have your com or org, depending on what your top level structure is, as the all enclosing rectangle and inside in a nested way you have little rectangles that represent the nested structure, and the size of the rectangles would represent the amount of code that was in there and it quickly gave you an understanding of: "OK. Look at this half of the business code actually is in data holder classes." Maybe that is not such a good approach if all these classes are doing just holding data and they take up half the code maybe we should have a different approach. Maybe in that case hash maps would be a better thing. So that was an interesting visualization.

Common metrics, depending on the audience for developers, good metrics are simple metrics that check that checkstyle offers out of the box; and I am not talking about where your calibration should be, I couldn't care less and you can automatically reformat things anyway, but good metrics, for example, are: length of classes, very basic, and you can come up with your own compromises - lengths of methods; but again don't fail to build over this. Just have them running in the Eclipse for example as a constant remind that it something is too long; but more advanced metrics that are really interesting are n-path complexity, which give you a rough estimate of how many unit tests you would theoretically have to write to fully test the method and if you thing of a test coverage for example, there's test coverage that covers the lines of code, but if you have two "if" statements, if you cover every line you have 100% coverage on that, but you haven't covered 100% of the path because both "if" statements if they are independent can be either true or false which means there is 4 paths for your method and if you look at that on larger scheme you can quickly gets huge numbers; so the good metric is to look at n-path complexity to see how testable your code is. Two more that are interesting are "class find-out complexity", how many other classes do you actually depend on and the other one is "class data abstraction" which means how many other classes do you instantiate. That is a big difference. It is OK to deal with other classes and that metric could be higher, maybe somewhere 25 or 30 classes, especially if you are doing UI work, but if you start instantiating 20 classes something is wrong. That class has way too much responsibility does too much concern and are pretty much separated and this is actually a metric if you use, even without visualizing it.

It can be very important and there are a couple of really good tools that do that. Often what is more important is actually dependencies between large artifacts, something like jars. The problem is, of course, in JAVA there is nothing in the source code that tells you ultimately where a certain class would end up, in which jar. Of course you would hope, but that is not always the case, that there is a correlation between the package structure and the actual jar file, but there is not necessarily the case. You can have classes from two different packages obviously in the same jar file, but you can also have a class or classes from the same package in multiple jar files. And the trick here is that if you have questions like this, don't give up, there is not big vendor at the moment that I am aware of that sells you the tool that you actually always need, because the question can be so specific to your enterprise. What is really important to know is to have a little tool chest of little tools that can help you to draw these diagrams; then you identify the missing pieces and that is the thing that you have tried yourself.

If we go back to the last example, with the jar files, there is a very good tool called "dot"; its free software and it's so old that is predates xml and I would say in that case luckily, because the input file format is so simple, it specializes on drawing trees and graphs and has a layout engine built in. So you can just say: "A->B" and of course that looks like a little arrow so that means A depends on B, and you can create a full file of these and specify node attributes, if you wanted to, and then dot not only draws a pretty picture of it but it has a really good layout in algorithm building that understands tiers and layering and will try to layout your algorithm detecting all things. If we come back to the jar example that would of course be the tool of choice to draw the dependency graph once you have it. How do you get it? That is the part that you would probably have to write yourself and that is the point where we have to be clever. At one client what we did is, we wrote a tool that has actually decorated the JAVA compiler task in Ant, because that is the place where you know what your dependencies are. The JAVA compiler knows which classes it's compiling and it knows what's on the class path, the Jar files that you depend on. So we wrote a little decorator pattern that wrote out a text file whenever JAVA compiler was invoked that wrote that syntax I described early on, A depends on B into a text file and then call JAVA C. And at the end it was just one other tool that swept up through all the directories, collected those text files, put them into dot and created the visualization; that was very easy because it's all command-line tools to make part of the cruise control build and at the end of every built we had the picture, showing exactly what the dependencies were.

The general way of doing that is, surprisingly, either use a tool called ... or, more commonly Excel. There is a couple of libraries that allow you to create Excel very easily, especially in JAVA, this library called POI , allows you to create Excel spreadsheets or conversely you can create as spreadsheet and load data into it. What we did in one place, we wanted to visualize those metrics, and what we had is we had a little script that would run check style. And checkstyle normally is there to warn you about violations, but you can misuse it and again that is a trick you can employ. Rather than saying: "Warn me when my method gets longer than 10 lines", you say: "Warn me when it gets more than 1 line"; it warns you about everything, which means it collects all the metrics for you. Check style creates an xml file and what we've have done in that case, we did a bit of VBA that runs inside Excel, that can pass that xml file and dump it into an Excel table; we used a little pivot table to do the aggregation and then do charts. So the result, and that was another interesting case actually, was a bar chart with the classes across the x-axisand we called that toxicity score; how toxic that class was on the y-axis access and we ordered it, so the most toxic class would be on the left inside and the least toxic class on the right inside, and you have got a normal distribution that goes like this and when we showed that to people, they say: "That is interesting, but it is totally meaningless to me." What does that mean? There are so many classes on it, they have an average score and we agreed. "That doesn't tell you anything."

So another important aspect here is to get comparative data, so what we did is we collected exactly the same metrics for a number of widely used open source frameworks like Spring and Hibernate and compared that and that would give us some insight into how the quality of that code base was and whether something we should be concerned about was actually really dangerous or whether there was something that had a high score or was Ok. And of course then taking the agile principles into mind, would give your feedback loop, so you could say: "I think we can tune that metric. We were probably too strict on that one or there is aggregated score or the over emphasizes certain design qualities that were subjective opinions and weren't really that related to actually delivering software faster."

The general idea is to have a 3-step process; you need to find a way to get to data in some raw format that is easy to get from; it could be static analysis, something like checkstyle does and Dumps an xml file, and that is just the format, you accept the format that is right for the tool. You could write your own instrumentation; if you are looking at dynamic analysis you could have something that listens on message bus and dumps to message logs. You use tools that come with the vender, like in the case of messaging, or just some generic tool that you find, check style, PMD or depending on what you are doing, some tools that will collect metrics for you. The middle step is the hard one because you now need to convert the input file format you have into a format that is suitable to be visualized and that is generally what you have to write yourself and you can use all sorts of technologies for that.

As I said before it can be done in VBA in the case of Excel; we have written Ruby script for that; Ruby is very good at that because it is good at analyzing XML files, maybe even using regex over them, is really good at getting something out fast and making it work. It could be XSLT, I have used XSLT; there is a little style sheet that allows you to take a Spring context file and create an input file for this dot tool that I have described before. The Access LT sheet is about 100 lines, so again it is a very simple that converts the input file format that describes your system into the output file format that is suitable for the visualization tools. Visualizations are harder to write and you generally would fall back on existing tools. There is 3 that are particularly like, one is "Dot", another one is called "InfoVis", which allows you to couple of interesting visualizations like the tree-structures I mentioned before, test histograms over time and a couple of other things, and the third one that is somewhat commercial, it's an interesting license; you can use the stand alone tool for free,; if you want to embed it as a library for scripted use you would have to pay, and that tool is called "YED". And that has a large range of visualizations, very interesting one, some radio plots, so a tree doesn't have to be the typical tree you know with the root note and branching outside where it is going down; a tree can have the root in the center and then sprawling out.

Again if you have good layout in algorithms you can often see more and that tool encourages you once you have an input file format to actually explore your data set. And we have used that for completely different purposes than cord visualization. It can even visualize concept in a business domain. But the key is to pick one tool that easily gets you to data and you can often find something and if you don't, it's often quite easy like I described the JAVAC decorator relatively easy to write a tool to create that file format. Pick a graphic tool and then write a little bit of transformation between that does the aggregation. So these are the general tools.

One thing that those tools lack is the interactivity. Sometimes you have such a large amount of data that people need to drill down and you can't do that in what I have described so far because all it does, it generates one picture. Think about the dependency example: if you have 100 jar files that live in 23 EAR files - that is not a number that I made up, that is something that I saw on a client project - if you visualize all that in one screen you just get spaghetti. So what you want is the ability to drill down and in general you would need specialized real software tools that allow you to drill down. There are tricks to get around it, you could create scalable vector graphics, which allow you to click on things and drill in; scalable vector graphics also have hover, which is also quite good, so rather than writing the name into every box you just draw the boxes, maybe with a color indicating certain metrics about that box, but don't put the title in. That allows somebody to scan that diagram, find the outliers and if they want to know what it is, hover over it. So you can get a small degree of interactivity from even these basic tools I have been describing, but if you have a really recurring problem set then it might be the point we have to write a piece of software that allows you to really drill down, to go into detail and flick back and forth, maybe select things and so on.

There are a couple of metrics at run time and again an interesting idea is dependencies, so if you have message base system and Gregor Hophe, somebody who you are doing a talk with on this topic actually and has spent a lot of time investigating in looking at message based systems and trying to visualize the dependencies there. Inside systems there is such a number o messages flowing back and forth that it is often quite hard to understand what is going on. So hear rather than looking at the configuration file that would potentially describe where the messages are going, you are just listening in; if you have a publish subscriber system you could just subscribe all the messages and then dump those into a file format and then plot dynamically what is going on. If you have a point-to-point system you will probably have to have a different way, you probably need to get into the software at some point and put a little T in there, so you can splice the message, sent it to the normal destination, as well as to a control bus that you then monitor, that is very interesting topic. Visualizations that gets actually more into diagnosis, but of course running systems everybody uses these things I guess profilers, you look at the memory distribution you see how your heap memory changes when you do web base applications to gage how many threats you should have concurrently, these have visualizations but I wouldn't really count that into metrics and visualizations field.

I think they are very important for project managers, before we touch on that it is interesting for developers. Developers love this stuff, it gives them a better understanding of the software they are writing, it helps them write better software, draws our attention to the points we might want to make improvements, things we weren't aware of. Since you have asked about project managers they generally like that a lot because it gives them a better understanding and that comes back to what I said early on about the 30000 foot view.

A lot of project managers in the fields that I work in, do actually have a technical background, but they are so far removed now that if you drop them in the ground 0, they just see code and programming languages, they have never learned they have daily concepts to them, they probably learned programming or used programming when they were writing C; if you show them an aspect or enter a program or closures in ruby that is very confusing, but that is not the level of information they really want to see. So what the visualization does that helps aggregate information and like in the initial example I gave, it shows them there is so much code of infrastructure package, there is so much code in the business package, that gives them something to think about or it shows them that is the layering or this is the way we have structured access to the data base and that is some important information that they can then use to inform the decisions on where to spend time with the team.

It is often that technical people are saying: "We want to do some refactoring." but of course a lot of people are saying: "No, it just works. Why would you want to a large scale refactoring?" And that then again is something that helps facilitated dialogue which means the developers and their project managers. Developers are going to say: "Look, we have developed the system, it has grown over time and we are telling you we want to refactor this area and look at this: everything is nice and grey and about that size but that thing is that big, black box in the corner." and that really helps to convince a project manager say: "Ok, I see where you are coming from now and I can see the investing time in splitting that up will give me a return investment to make my software more modifiable in the future." So it's a communication.

I guess my 2 favorite computer books at the moment are: "Domain Driven Design" because I see visualizations like something that helps communicating and Domain Driven Design is something that really helps communicating between the developers and the business people and it is a really useful technique that I have used on every single project since I read that book. The other book that I really like at the moment is a book on software metrics: "An object oriented metrics and software visualization" that was written by somebody in academia who is plowing this field and I think there is a lot of things we can learn from those guys and that is something I am really interested in at the moment.

....these have visualizations but I wouldn't really count that into metrics and visualizations field.

I would argue that visualizations in some performance management tools such as JXInsight exceed those of most other project or development centric tools on the market were the norm is for a dashboard with various metrics presented in green and red circles. Some tools related to structure analysis do a pretty good job at displaying a large number of inter-dependencies but timeline analysis is a much hard area to crack especially when trying to visualize millions events across a cluster of servers, across process and thread.

you are talking extensively about dependency and visualisation. I think you need to do some catching up to do. Many of the problems you express were already solved years ago when the DSM methodology was introduced. First in the mechanics problem domain. Since Lattix introduced this in 2004 an excellent visulisation appeared in the software industry.

There are more tools starting to use DSM, but Lattix is the most advanced. It even allows you to change the code structure before you change the code. It even does not only work for Java (although it includes Spring, Hibernate and Oracle).

For more information www.lattix.com. Just look at the demo and I am sure that you find what you are looking for.

I thought the important field of software visualization should be visual modeling, not the report of code.but his idea is fun, how about store our code in DB(instead of store in plain text file), and using SQL to get some reports?

I am aware of tools such as Lattix LDM and Structure 101 by Headway Software. In my opinion DSM tools are useful and solve a certain class of problems well, but that doesn't mean that we can now sit back and declare dependency management solved.