Rapid-fire debugging thoughts

Just a collection of assorted things that have been running through my mind during the past week and a half of marathon debugging...

Unless you have a really darn good reason, don't use different coordinate systems for different parts of your game, and especially not if you're using floating-point representations. Conversion back and forth between coordinate spaces will inevitably introduce floating-point drift, and if your data flow involves conversions, you will start to see discrepancies between systems that use the different coordinate spaces. This can be a nightmare to figure out.

If you're dumping a lot of stuff to a log using printf()-style format commands, it pays to be sure you're dumping an actual float and not a SIMD float wrapper struct. What appears to produce sane numbers on one machine/architecture can start spewing random stack/heap gibberish on another machine just because of alignment coincidences. In general, when using any printf()-style formatter, be sure you're actually passing what you think you're passing.

Resist the urge to treat any system as a properly debugged black box. If you watch the event sequence A, B, C and something goes wrong between A and C, suspect B no matter how thoroughly debugged you think B actually is. Nothing sucks like wasting hours scouring A and C only to find out that B was actually the problem all along.

The corollary to this, of course, is that if B really is trustworthy, the bug might just be in the way you're using it. The OS and compiler are (probably) not broken, but you can make them look that way by violating their usage contracts.

Adding debug logging is great unless you're chasing something that might be timing-related. If you start adding logging and the repro conditions change, that's a good sign you have a Heisenbug lurking in the code. These are scary because the closer you look the less chance you have of landing on the bug; I personally am not real good at figuring these out yet. I think it's a combination of intuition and study of the code flow without actually running it - the kind of stuff you normally need for concurrency issues and so on. Seeing a Heisenbug appear in what should be fully serial deterministic code is a bit terrifying.

Above all else, try not to be superstitious. There are logical reasons for everything going on in your program, if you look far enough. They may not be easy to explain, but they're there. It's easy to start suspecting really bizarre crap when you're at the limits of your understanding of a system; the solution is to understand things better, not resort to gross speculation.

Don't underestimate the importance of stepping away for a while. I often find that walking away for 20-30 minutes and coming back can be deeply refreshing and serves to help break out of the ruts of assumptions that are built up when staring too closely at something. Escaping for a bit forces you to flush out the things you think you know and rebuild your contextual picture of the problem; for a lot of hard bugs, finding the problem is more a matter of what you ignore (which you shouldn't be ignoring) than a matter of solving some kind of mystery.

You're probably going to have to learn other people's code. This is not such a big deal if the author in question is still accessible for questioning and enlightenment; if they're not reachable, however, you're in for a rough ride. Resist the urge to just poke at things until stuff changes. Expend the effort to understand the actual code and why it is the way it is.

It's tempting to suspect The Other Guy's code, especially if said Other Guy is no longer available to defend himself. Fight this urge for as long as you can, because it leads to assumptions about where the problem lies. Good debugging is all about eliminating your assumptions and replacing them with verified factual knowledge.

One of the hardest tricks in debugging a complex system is to know when to broaden your search and when to narrow it down. If you have a problem in a narrow area of code, the bug might be inside that area itself, or well outside it and just happens to appear in that particular place. Knowing how to identify when a bug is in a piece of code and when it's just manifesting there is a black art, but well worth mastering.

Build good debugging tools into your program sooner rather than later. It's much easier to find and fix bugs if you have good unit tests, visualization tools, and automated regression tracking. Once you get far enough into a project, it might be too late to go back and do those things, which means you're stuck having to resort to staring at tens of thousands of lines of floating point coordinates looking for discrepancies in the 8th decimal position. You don't want to be stuck there, believe me.

You say, "Unless you have a really darn good reason, don't use different coordinate systems for different parts of your game" - This is the problem I'm running into right now, though I'm using integers for the coordinate system (2D game) and so I'm not getting floating point drift. I've bumped into several mistakes where I think a function is expecting a Point(x,y) from one coordinate system (the world, for example), when it's really expecting a Point(x,y) from another coordinate system (like the _loaded_ world, origin at the center of the current map chunk).

My current solution, which I'm the middle of implementing across my entire project, is to make sure the various coordinate systems each have their own non-implicitly converted version of Point() with a unique name and explicit conversion (through well-named non-member functions).

For example, I have "Point mousePos" and "Point virtualMousePos" (for the mouse's position in the actual window resolution vs game virtual resolution, again: 2D game). I sometimes mistakenly pass in mousePos to functions expecting virtualMousePos or vice-versa, so I'm creating identical "WindowPos" and "VirtualWindowPos" versions of "Point" that can't implicitly convert between each other.

What do you think of this method - do you think I'm going about it wrong? (Or am I completely misunderstanding you mean by 'coordinate system'?)

That's about as good as you can get in a language like C or C++ that doesn't support strong typedefs. Thankfully C/C++ are also not structurally typed so you can get away with having a PointFoo and a PointBar that are identical save for their coordinate system.

One trick to smooth that out a bit is to use a Point base class and derive your specific PointFoos and PointBars from that. Saves a lot of implementation overhead and still gets you a bit of type safety.

I'm guilty myself of several of these - especially when a particularly odd bug occurs that "shouldn't be possible". It's easy to assume it has to be some dangling pointer somewhere corrupting memory - which could be called a "programmer superstion" (blame the pointer gods). Time after time there turns out to be a perfectly logical reason for the new odd behavior, and nearly every time it was in code I recently added or changed.

I love how you talk about stepping away, many of my problems or bugs or inefficiencies have been solved by simple, stepping back, grabbing a Popsicle and reading some manga (or any other literature for that matter).

don't get too attached to a code/function. Sometimes (for me) completely removing (including backups) of a buggy function and rewrite it from scratch (after the "stepping away" method) does the trick. It's extremely hard (for me) to let go, but in one occasion, I completely erased a big geometrical monster function I was debugging for 2 or 3 days and rewrote it from zero (after stepping away) in 20 minutes without the bug.
It's hard to debug and abstract an algorithm when you are starring at the implementation...