Hunting down the Ghosts in our machines. Chris Beal's Weblog

Comparing dtrace output using meld

Comparing dtrace and other debug logs using meld

meld is a powerful OpenSource graphical "diff" viewer. It is available from the OpenSolaris IPS repositories so can bee installed from the packagemanager in OpenSolaris or simply by typing

$ pfexec pkg install SUNWmeld

It is very clever at identifying the real changes within files and highlighting where the difference start and end.

This example is using the output of two dtrace runs, tracing receive network activity in the a development version of the nge driver, once when it works, and once when it doesn't, trying to identify a bug in a development version.

First off the dtrace script is

$ cat rx.d

#!/usr/sbin/dtrace -Fs

fbt::nge_receive:entry
{
self->trace=1;

}

fbt::nge_receive:return

/self->trace==1/

{

self->trace=0;

}

fbt:::entry

/self->trace==1/

{

printf("%x",arg0);

}

fbt:::return

/self->trace == 1/

{

printf("%x",arg0);

}

This very simply traces all function calls from the nge_receive() function.

So I ran it twice, once when the network was working, and once when it wasn't and simply ran meld over the two files.

$ meld rx.out rx.works.out

This throws up a gui as seen here

It's worth loading full size. What you see is on the right a large area of probes that have fired that do not exist within the one on the left. That implies a lot of code run in the working case that is missing from the failing case.

Looking at the meld screenshot you can see the big green area starts at mac_rx. So in the failing case nge_receive() doesn't call mac_rx() (that'd explain why it fails to receive a packet).

Why doesn't it? Well it implies that nge_recv_ring() returns NULL. nge_recv_ring() is supposed to return an mblk, and it hasn't. Why is that? well looking in to the blue and red highlighted area in the meld window, we see another area in the working case that is missing in the failing case. Hey presto, this bit is the call to allocb(). allocb() is used to allocate an mblk.

So we know in the failing case the nge_recv_ring() function fails to allocate an mblk. Now just need to work out why.

I found this a powerful way of viewing complex data and quickly homing in on differences.