Finding Bad Build Culprits…Who Broke the Build!

I found an interesting Google Talk on finding culprits automatically in failing builds – https://www.youtube.com/watch?v=SZLuBYlq3OM. This is actually a lightening talk at GTAC 2013 given by grad students Celal Ziftci and Vivek Ramavajjala. First they gave an overview of how culprit analysis is done on build failures triggered by small test and medium sized tests.

CL or change list is a term I first heard in “How Google Tests Software” and refers to a logical grouping o changes committed to the source tree. This would be like a git feature branch.

Build and Small Tests Failures

When the build fails because of a build issue we build the CLs separately until a CL fails the build. When the failure is a small test (unit test) we do the same thing. Build CLs separately and run the tests against them to find the culprit. In both cases, we can do the analysis in parallel to speed it up. This is what I covered in my post on Bisecting Our Code Quality Pipeline where git bisect is used to recurse the CLs.

Medium Tests

Ziftci and Ramavajjala define these tests as taking less than 8 minutes to run and suggest using a binary search to find the culprit. Target the middle CL, build it and if it fails, the culprit is most likely to the left, so we recurse to the left until we find the culprit. If it passes, we recurse to the right.

CL 1 – CL 2 – CL 3 – CL 4 – CL 5 – CL6

CL 1 is the last known passing CL. CL 6 was the last CL in the failing build. We start by analyzing CL 4 and if fails, then we move left and check CL 3. If CL 3 passes, we mark CL 4 as the culprit. If CL 3 fails we mark CL 2 as the culprit because we know that CL 1 was good and don’t need to continue analyzing.

If CL 4 passed, we would move right and test CL 5 and if it fails, mark CL 5 as the culprit. If it passes, then we mark CL 6 as the culprit because it is the last suspect and we don’t have to waste resources analyzing it.

Large Tests

They defined these tests as taking longer than 8 minutes to run. This was the primary focus of Ramavajjala and Ziftci’s research. They are focusing on developing heuristics that will let a tool identify culprits by pattern matching. They explained how they have a heuristic that will analyze a CL for number of files changed and give a higher ranking to CLs with more files changed.

They also have a heuristic that calculates the distance of code in the CL from base libraries, like the distance from the core Python library for example. The closer it is to the core the more likely that it is a core piece of code that has had more rigorous evaluation because there may be many projects depending on it.

They seemed to be investing a lot of time into insuring that they can do this fast. They stress caching and optimizing how they do this. It sounds interesting and once they have had a chance to run their tool and heuristics against the massive amount of tests at Google (they both became employees of Google) hopefully they can share the heuristics that prove to be most adept at finding culprits at Google and maybe anywhere.

Thoughts

They did mention possibly using a heuristic that looks at the logs generated by build failures to identify keywords that may provide more detail on who the culprit maybe. I had a similar thought after I wrote the git bisect post.

Many times when a test fails in larger tests there are clues left behind that we would normally manually inspect to find the culprit. If the test has good messaging on their assertions, that is the first place to look. In a large end to end test there may be many places for the test to fail, but if the failure message gives a clue of what fails it helps to find the culprit. Although, they spoke of 2 hr tests and I have never seen one test that takes 2 hours so what I was thinking about and what they are dealing with may be another animal.

There is also the test itself. If the test covers a feature and I know that only one file in one CL is included in the dependencies involved in the feature test, then I have a candidate. There is also application logs and system logs. The goal as I saw it was to find a trail that led me back to a class, method, or file that matches a CL.

The problem with me trying to seriously try and solve this is I don’t have a PhD in Computer Science, actually I don’t have a degree except from the school of hard knocks. When they talked about the binary search for medium sized tests it sounded great. I kind of know what a binary search is. I have read about it and remember writing a simple one years ago, but if you ask me to articulate the benefits of using quad tree instead of binary search or to write a particular one on the spot, I will fumble. So, trying to find an automated way to analyze logs in a thorough, fast and resource friendly manner is a lot for my brain to handle. Yet, I haven’t let my shortcomings stop me yet, so I will continue to ponder the question.

We are talking about parsing and matching strings, not rocket science. This maybe a chance for me to use or learn a new language more adept at working with strings than C#.

Conclusion

At any rate I find this stuff fascinating and useful in my new role. Hopefully, I can find more on this subject.