The Value of a Sanity Check

Some random bits scribbled by Jeremy Zawodny

A funny thing happened today. Something we can all learn from.

In the last week, I've been helping some folks at work do some
performance testing and tuning with MySQL. One group's problem seems
to be solved. The other, however, was running into pretty poor
performance. Today one of them IM'd me (is that a verb now?) with
some concern. He was seeing swapping on the machine. And it was
really slow.

After being interrupted by a few phone calls, I asked how much
memory was in the box. 5GB he tells me. Okay, that should be more
than sufficient. At that point we talked about his memory settings in
MySQL. He had a reasonably sized innodb_buffer_pool. I
think it was 1.5GB or so.

After a bit of thinking, I realized that there was something really
wacky going on. There had to be. He sent me the output of
top and it showed that mysqld was indeed using about 1.4GB of
RAM. Not much else. Hmm.

That blew my only theory. I figured that there were some other
random memory intensive processes running on the box. But no,
nothing.

It was at this point that I was completely out of ideas. The data
made no sense, so he was clearly not telling me something. Not
because he was hiding information, but he simply wasn't
seeing it and I was mostly relying on his descriptions..

So I got a login on the machine... and found the problem in about
45 seconds.

The machine had 512MB (or 0.5GB) of RAM, not 5GB. It
swapping because, really, that's what he had told it to do.

I started by verifying the basic assumptions. I looked at what
processes were running, how much disk space was on the box, how much
physical RAM it had, and... that was it. I was done.

(If you think I'm picking on this guy or making fun of him, you're
going to completely miss the point, so stop now and leave no
comments please.)

The Moral of the Story

We've all been there before. You know, things simply don't make a
damned bit of sense when you're debugging some weird ass problem or
piece of code. That's when you really need a second set of eyes,
ears, or both.

A tactic I've used before (when facing many strange
problems in my code) is to bug someone else to come over so that I can
explain to them how it works. Four times out of five, as I'm
explaining it I figure out the bug. The other one time? The guy (or
gal) I'm explaining it to finds some really stupid, basic thing I'm
doing wrong. (Like a misreading memory info.)

We all do this.

These sanity checks (or something like them) are vital to figuring
out computer-related problems. And I'm sure they're just as critical
in so many other detail-oriented pursuits: science, engineering,
medicine, detective work, and so on.

The biggest problem that I seem to have with them is
not doing them soon enough.

Are there other sanity check strategies you've found useful? I'd
love to hear about 'em...

Somewhere I had heard this being described as teddy bear debugging. Yeah, that's right. The idea of course is that you stick a teddy bear (substitute with animal or person of choice) and explain the problem to it/him/her; given that you have to explain in externally understandably terms, the problem that you are missing will hit you right away. Or so it is said.

Remind me to tell you the story of sprintf and performance sometime, that's a nasty tale....

I found that it also works if there's no-one locally to bounce ideas off: the other day, I IM'd (let's assume that's a verb) a friend to take a look at some code. As I was sending it over, I realised that there was another source file involved that I hadn't looked at. Sure enough, opened it up, there was the problem, and I had it fixed before my friend had finished receiving all my source files!

There is a wonderful book entitled "Debugging: The 9 Indispensible Rules for Finding Even the Most Elusive Software and Hardware Problems" by David Agans which actually has an entire chapter on this topic: Rule #8: 'Get a Fresh View'. Not only is asking someone else, even just to listen, a good idea - so is the need to report only the symptoms, not theories (you don't want to drag down the other person into the same rut you are in nor possibly hide some key details). Asking for help is also not a sign of incompetence (which may be why we delay doing so) but rather a sign of true eagerness to solve the problem.

Anyhow, highly recommend the book which is short, very readable, entertaining, and actually applicable to a wide range of activities.

A great book that also talks about this (though they call it the bouncing duck) is the pragmatic programmer. I can not reccomend this book highly enough as a 30k ft programming book. No language in particular, more about how to go about programming.

"...he simply wasn't seeing it and I was mostly relying on his descriptions."

For me, this says it all. While the reliance on assumptions is the key to moving forward with all software projects, in my former life as a programmer, I found that the ability to question our own assumptions in a VERY egoless way (and/or getting someone else to question them) is key to solving system and software debugging problems.

Whether getting a good night's sleep in order to get a fresh perspective or getting that other person involved, sanity checks ultimately involve intellectual honesty and the questioning of assumptions.

Same thing goes for the memory. Usually I end up asking the same question as many ways as I can figure out without the person understanding that it's the same question.

Or you bring a system up, you change one config file, and you have to have the remote person turn off the machine again for the 3rd time in 10 minutes. but they don't do it the 3rd time, becuase you obviously have no clue what you're talking about.

Then again, these are mostly phone support issues. But such is my life.

I can't agree more. Explaining what the problematic system is supposed to be doing, or what is going wrong with it, almost always help find the culprit. If you don't have a human to explain it to, trying jotting it down on paper.

I'm a software engineer who specializes in automating tests. In the course of my job I've written a complete automation framework for the products under development at my company. I support about 50 people programming under the framework every day, and run into issues like this all the time.

I've found that one of my gifts it to be able to look at someone else's code and figure out the problem very quickly, but generally I try to talk the person through the problem in the hope that they will find it for themselves and learn where they tend to make mistakes.

All of our stuff is written in Perl and I serve as the companies' expert. That means that when I have a problem with my code most of the folks here can't really help me with syntax due to a lack of experience with the language.

This has really helped me, because it forces me to describe the problem I'm seeing in non-perl ways. All of the people I work with are incredibly smart and very good programmers (usually C, C++, Java) so I have to describe my problem in computer science terms instead of language-specific terms.

Working through this translation layer has really forced me to 100% understand the concepts I'm talking about, and usually as I'm explaining the problem I realize what went wrong. As a result of this experience, the number of times I've gotten into really tough problems, and the number of times I've needed to drag someone else in has decreased.

A long time ago, when I was just a kid and I liked tearing apart and fixing TVs with tubes and taking the parts down to Woolworth's to use the tube tester, I read a fascinating article on how to repair TVs. This was back in the days when most TVs were too big to take to a shop so TV repairmen made house calls. Anyway, the repairman who wrote the article said that over half his repair calls turned out to be false alarms caused by unplugged power cords.
I never forgot that article, and it came in useful when I did phone support for computers. Whenever someone called with a dead computer or printer, I always started by asking if they would check their power cords. Some people were recalcitrant but I held my ground and refused to proceed until that fundamental test was performed. And of course it was the people who were most certain that everything was plugged in that were usually the ones with a pulled plug.

It's usually some little minor stupid mistake like that that costs loads of fucking time. It's almost always the case that you can spend days going back and forth on the phone, email, IM, etc. or identify and solve the problem in minutes flat if the client just sends you the login so you can log on. Never trust a fucking thing clients say - they're usually too fucking clueless.

I can't even begin to say how another pair of eyes have helped in fixing a problem.

However, before I distrurb programmers who are in their own hell zone (and ready to snap back at anything), the first thing I do is to make it (the problem) seem as uncomplicated as possible. Kinda like..ok, what does this program do, what did I just add to break it. I rename stuff, test simple concepts, and then eventually I may find my error. It usually works. I usually find something like a misspelling or a missed variable name. Stuff like that.

For computer errors, I start with the basics as always. It should be a problem solving rule. Basics first.. :)

How many times have I could be found telling myself that within 5 minutes... within the hour.. or within what time I'll have a certain problem solved. And then I find myself back after having invested 4 times as much time as planned. Better having someone around asking to have a second look at your code (this looks like pair programming in Xtreme Programming, but I don't have any eXPerience with the methodology).

Anyway, asking a college can help, but also just stop working on the specific problem, get some fresh air, of a big cup of Java, and go back to the problem the next day / hour can do the trick!

Sigh - I should have taken your advice. I've been setting up my MySQL servers to use sockets recently to reduce their presence on the network. I just spent a good half-hour trying to debug a config file which, void of all reason, didn't seem to be taking the changes I was making. Lo and behold, I was editing the wrong one. As soon as I changed the one in the cvs directory, it worked without a hitch. Grrr...development servers.

"Fighting a bug is like playing chess with Murphy. When you correct the bug, it's a checkmate to Murphy."

enough for today :-@

on January 15, 2004 04:12 PM

Disclaimer: The opinions expressed here are mine and
mine alone. My current, past, or previous employers are not responsible for what I
write here, the comments left by others, or the photos I may share. If
you have questions, please contact
me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses
or IP addresses of anyone posting a comment here without consent.
However, I do reserve the right to remove comments that are spammy,
off-topic, or otherwise unsuitable based on my comment
policy. In a few cases, I may leave spammy comments but remove any
URLs they contain.