How to improve build stability

Original Author: Andre-Dittrich

Build stability is always an important topic for us but once a game production has entered the production phase in earnest the stability of the game and the tools becomes one of the more important aspects for the tech team. The simple reason for this is that the number of people relying on this is highest at that point and any time these people have to wait for a bugfix or missing tools potentially means a lot of money wasted. So keeping your build as stable as possible is important.

And now for the bad news: I do not have the “This Solves All Our Problems” recipe. I want to share some of measures we have applied in our projects. If you have other measures you have taken to ensure build stability please tell me. I am always interested in doing more.

Iteration time rules

Having a stable build is very important – yes, but you cannot ruin the iteration time for your team. There will always be that level designer that requests a small feature, a small change or simply needs a critical bug fix really fast (usually yesterday) to finish the mission for the next milestone. You do not want him to wait for a week for that change. With 10, 20 or even more engineers working on your code base at the same time the chance is high that there is always at least one that has added a bug that makes it impossible to release the next engine version to the team at least if you do not take some measures that help to keep the build stable. The problem of course is that the measures you take cannot add so much overhead that they become a reason for slow iteration times. So everything you do needs to strike a balance between overhead and improved build stability.

Automated build systems

CIS – continuous integration server: you need this! It is bad enough if “real” bugs trouble your build – it is far worse if simple bugs destroy it. Ever come into the office in the morning to find out that you cannot compile the game? A typo, a file that had not been checked in, a bad merge? How many people lost how much time during this one morning? This is totally avoidable. The main function of our CIS is to continuously build the engine whenever somebody checks in a change. This makes sure that the engine and tools at least compile. Of course we also run a few easy and fast smoke tests that also make sure that you can at least start the engine.

But you can do even more. During the day the focus is on getting the engine build as fast as possible and run smoke tests. During the night we can do a lot more. We run automated tests to get statistics for memory usage and performance in test levels and game levels. These statistics are made available as graphs on an internal website. These graphs are an enormous help to recognize and track down sudden jumps in either performance or memory as well as gradual development. Together with good check in comments (see below) you can prevent this from breaking the game before it actually becomes a problem or you at least recognize the problem very fast and efficient (without TAs or programmers spending time to find out why MissionXY is not running any more).

When I talk about automated tests I guess I have to talk about unit tests as well. I have some experience with it though I have to admit that most of it is about how not to do it. We integrated a unit testing framework into the Unreal Engine on the Unreal Script and Kismet level pretty early in the production process. We started to use it for the AI code mostly as this was mainly written by us and not relying too much on middleware code (except pathfinding). The main mistake we made was that we ended up with actually doing integration tests and maintaining those takes a lot of time. For some time we even made it part of the process to have “unit tests” for every feature we did. At some point we started spending more and more time on fixing the tests which were failing because of changes in other systems and not because of bugs in the tested code – we stopped doing it. For next projects I want to do actual unit tests to test critical parts of our code. Integration tests is something that should be used for finished features that are not very likely to change a lot and I guess that means you have to keep that for a later time in the production. If you have experience with successfully applying either I would like to hear about your experiences.

Peer reviews

This is one of the best tools in our belt to improve build stability. It does not only give you a substantial improvement in build stability it also fosters communication within the team and distributes knowledge (win – win – win).

The idea is pretty simple: When ever someone wants to check in a change he needs to get this change reviewed by one of his colleagues. Of course this will only work if it is taken serious. The goal of a review should be that the reviewer has a good understanding of what the change is, how and why it was done. There are no dumb questions during a review. If you do not understand something while you do a review, ask. This goes especially towards seniors or leads that sometimes might feel they should not ask dumb questions. If you think you even need another ones opinion get it. You may and should criticize style and details. Ask for additional or improved comments if you think they might help. This is not only about making sure the change works it is about sharing ideas and knowledge as well.

So what do you get in the end? Reviews will easily spot obvious issues or problems with the idea of how to solve the issue at hand. They will rarely spot really intricate bugs or side effects. By that it will remove quite a number of bugs that would have been found later by the automated systems, by the QA or even worse by somebody trying to use a broken tool. What you also will get is people learning from each other, people looking into parts of the system they would not see usually. At least 2 people know the change that has been made in detail, so people getting sick or leaving the company becomes less of an issue. You get a culture of talking about your work and making sure work is actually done before the checkin (it is pretty embarrassing if obvious flaws are discovered by your reviewer in a piece of code that you actually considered worth checking in). People in your team talk, they develop a common language, they understand weaknesses and strengths of the team members.

A few things to keep in mind to make peer reviews work:

– it costs time – make sure everybody knows that this is time well spent and factor it into your estimates

– every checkin is reviewed – a lot of mistakes are made with “easy” or “small” checkins

– people should be available for a review – nothing is as annoying as not being able to checkin just because nobody has time therefore you should have a damn good reason to refuse a review

– add the information about who did the review to the checkin comment – reviews will be taken a lot more serious that way and if you hunt a bug caused by a checkin you know the two guys you should talk to to help you

Checkin comments

It might not be very clear initially how checkin comments can improve build stability because once the bug is checked in it is in. Good checkin comments make it a lot easier to track down an issue. Applying a structured format makes it even easier. Just imagine you sitting in front of the screen scanning through a list of 100 checkin comments to find out which change could cause your AI getting stuck while trying to vault over a cover. The easier it is to read the information and the better the information is in it the faster you will be. We fixed quite a lot of our “hard” bugs that way.

But actually checkin comments (if they are well done) have even more uses. You can subscribe to your source control system (we use perforce) to get an automated mail for every checkin in areas you are interested in and stay up to date with what is checked in by whom. This is not only a useful tool for a Lead it is also interesting for other programmers, QA or producers to know what actually is checked in.

Testbuilds

This is something that is not easy to do and it requires substantial inhouse QA resources and some additional tool support. The basic idea is again pretty simple: Before you check in in a change that you are not so sure about – test it. I guess everybody knows this bad feeling when he is changing something in a very old part of the code and this code also touches a lot of other code (maybe the guy who originally wrote it is not even there any more or you have to change code in your middleware). You are just not sure about the side effects and yes there is no automated testing around that part of the code. Basically the only way to find out what your change does besides what you intend it to do, is testing it. The best people you have for testing are QA people (some of our QA guys find the strangest bugs and more importantly reliable repros for really hard ones – amazing). So the idea is to create a local build of the game (or representative part of it) and send that to the QA team to test your change. While you are waiting you can shelf your change and continue with something else. To make this a viable option you need really great tools to make the whole process as easy as possible. We are using the Unreal Engine with their build tools. It is easy to create a local build of the game for any platform using the Unreal Frontend. This tool is used to cook the game for the platform you need it for. Out of this tool we can push a build on to a central server (the prop server). The QA can get this build by using a simple web frontend and have it copied to their PC or XBox. Yes you could cook a build copy it into some network folder and write a mail to the QA where to find it. But the easier the whole process is the more likely it is that people are actually using it and do not find excuses to not do it or get frustrated because they have to. We also established a bit of a strict workflow around it to make the whole process even smoother.

Even applying all of the things I explain above perfectly will not give you zero bugs but it will allow you to spend your time on the important and interesting bugs and what is even more important – adding cool shit to your game.

To not lengthen this lengthy blog entry further I kept the individual parts pretty short. If you are interested in the details of how we exactly do certain things – let me know. I could make one of my next blog entries cover this in more detail.

Things to look at next

Static code analysis – I have seen a pretty interesting talk about this on the GameFest 2011 in London and after that I hoped to try this pretty soon on our project. After John Carmacks Keynote during the Quakecon this has just become a lot more interesting. If you are interested watch the second half of his 90min talk it should be easy to find on Youtube.