Oracle Blog

Blog for hema

Monday Nov 19, 2007

Here is another example of how libumem can help us detect memory corruption..

The problem reported by the customer was that the Sun Web Server 6.1 protected by Sun's Access Manager Policy Agent running in production was crashing. This problem was not showing up in their staging environment, it was only the servers in production that were crashing.

Customer sent in several core files for us to analyze. The stack trace in the first core file that I looked at showed that we were aborting in libmtmalloc.so.1.

I vaguely remembered that there was a bug in mtmalloc that returned an already freed pointer. I searched our bug database and found the bug . As I read through the synopsis of the bug it became clear that we were NOT running into this bug. The bug was about : libmtmalloc's realloc() returning an already freed pointer. I don't think, we were doing a realloc here. I ruled out that bug and just to be sure, I checked the mtmalloc patch level on the system and found that they were running the latest mtmalloc patch.

I looked at the other core files sent by the customer, and this time, the crash was somewhere else. As I analyzed the core files and pstack output I noticed that the crashes were random in nature. The randomness of the crash indicated memory corruption. I initially suspected that this could be double-free type of an error.

I requested the customer to use libumem , the customer obliged (BIG Thank you) and sent us three corefiles generated with libumem enabled.

I opened the corefile in mdb, ran ::umem_verify and to my surprise, the integrity of all the caches came up as clean. Where do I go from here ?

Okay, I ran ::umem_status command and that printed the exact nature of corruption, including the stacktrace of the thread that last freed this buffer and the offset at which someone wrote to this buffer after it was being freed.

This information was sufficient for Sun Access Manager engineering team to come up with a fix and release hotpatch 2.2-01 against Policy agent.

How good is that ! I told you, libumem is so powerful yet so simple and easy to use.

Sunday Nov 18, 2007

I am a big fan of libumem . I've been using it for years to debug application crashes reported by our customers. I've found it very useful in isolating the source of corruption . I thought I'll share some of my experiences here.

I will use some examples from real cases although I might obscure the names of some of the libraries.

One of our elite customers reported that their application was crashing and he suspected that java is the cause of the crash. This was a pretty complex java application that involved few native libraries as well.

One of the challenging part of a support job is isolating the problem and even more challenging is convincing the customer that the problem is elsewhere and not where he thinks it is.

But, you know what, with libumem you'll see how easy this could be. Okay, I asked the customer to run their application with libumem enabled and send us another crash and so they did.

The first thing to do of course is open the corefile in mdb and run umem_verify command. This prints the name, address and the integrity of the cache. Take the address of the cache containing corrupt buffer and run umem_verify against it. This gives you the address of the corrupt buffer.

Let's dump the buffer:

The contents of the buffer (hightlighted in green) indicate that the application has written 14 bytes. The content is actually hexadecimal ascii equivalent of IP address: 172.23.170.77 followed by a NULL character.

From the redzone data, let's find out the actual number of bytes that this application allocated.

Redzone data is 8 bytes that follows the buffer. When a buffer is allocated with libumem, the first 4 bytes of redzone contains the pattern 0xfeedface and the last 4 bytes contains the encoded value of the actual size of memory allocated by the application. Do the following math to find the actual size allocated by the application.

0x1498 == 5272t5272/251 = 2121 - 8 = 13 bytes

Aha, someone allocated 13 bytes and wrote 14 bytes into it -- that explains it all.

Now, let's see who allocated this buffer itself:, to do that, take the value following the redzone data and run the command bufctl_audit against it:

As you can see above, this buffer is getting allocated in the native library libobscure.so via java native interface. When they allocated the memory to store the IP address, they did not the NULL character into account and therefore were writing beyond what they actually allocated. This information was enough to convince the customer that the corruption was not in java but instead in a native library that they were using.