Search This Blog

Saturday, 24 May 2014

When debug logs aren't good enough

Normally, when you want to get more information about what is going on inside a HP Data Protector process, you simply turn on debugging: edit /opt/omni/.omnirc on Linux/Unix or C:\ProgramData\Omniback\omnirc on Windows and add a line:

OB2DBG=1-200 myproblemname.txt

All processes read this file when they start. You might want to be aware of OB2DBG_DIR so that you can direct the (massive) files somewhere with a lot of room. (For example, if you put your debug logs onto the same drive as your internal database, you might start getting internal database errors because you have run out of disk space.)

And generally, there's more than enough information in there to figure out what's going on.

But this week, I hit something a bit more obscure. I had a VEAgent backup (which was supposed to be backing up VMware) hang very early in the session. Even more bizarrely, it was hanging before it even initiated a connection to the VSphere server.

The debug logs were uninformative. I really need to know what it was hanging on.

Fortunately, almost all processes running on cell clients (i.e. almost everything that isn't part of the cell manager's infrastructure) are launched from either the Windows omniinet process or inetd or xinetd on Linux / Unix. The main exception is the StoreOnce processes, which start at boot time.

I was able to reproduce the hang on both the initial discovery (when you try to browse for a backup) and starting an actual backup, so I decided to focus on the initial discovery first, which is done by the vepa_util.exe process. (It's even got the .exe ending on Linux!)

So I move the vepa_util.exe process out of the way...

mv /opt/omni/lbin/vepa_util.exe /opt/omni/lbin/vepa_util.exe.bin

And then created the following replacement:

#!/bin/bash

exec strace -f -ff -o /tmp/tracelog \

/opt/omni/lbin/vepa_util.exe.bin $*

This script instead launches the original binary (with all command-line arguments) under strace, where each trace file gets the PID of the traced process appended to its filename.

Then I tried to start creating another backup. The GUI froze, as before, but this time I could look at the last line of the trace file to know which system call was blocking.

read(10,

Reading from file descriptor 10. But what was file descriptor 10? A quick lsof -p command on the vepa_util.exe process ID turned up a socket connection back to the cell manager:

vepa_util.exe .... TCP vmclient:52395->cellmanager:2315

And even more curiously, when I logged into the cell manager and looked at the socket there, there were bytes waiting to be sent: the send-queue (send-Q in netstat) wasn't empty.

It turned out to be a firewalling issue (which I'll try to blog about later), but there would have been no way of seeing that from the Data Protector debug logs.