Tuesday, July 29, 2008

10.2.0.4 Post Mortem

In the end, my 10.2.0.4 SGA problem ended up being the _db_block_cache_protect parameter. Seems as though setting that parameter in 10.2.0.4 maps the SGA to /tmp/shm instead of real memory. The immediate cause of my:ORA-27102: out of memoryLinux-x86_64 Error: 28: No space left on device

was that I didn't have enough space mounted at /tmp/shm. When I did allocate enough space, I got an ORA-07445 [skgmfixup_scaffolding()+129].

Anybody that deals with Oracle Support knows that when you get a different error message, that's essentially the kiss of death for your TAR. True to form, a new TAR was created with my new error message and the attempts to close the original TAR started. In the end, they would rather provide a workaround (don't use _db_block_cache_protect, 10.2 is much more stable) than a solution.

Unfortunately, the only help I can provide you is to not use _db_block_cache_protect in 10.2.0.4 with big SGAs.

Ah yes, the initial block corruptions. Way back when we first moved to X86_64 we were on 9.2.0.X. We would intermittently get ORA-00600 or ORA-07445 error messages that Oracle Support tracked down to corruptions in the db block cache. When the dbwr wrote the block out to disk, dbwr recognized the block as corrupt, threw the error, and turfed the instance. We applied patch upon patch both to the database and the kernel until we finally had about 6 patches on top of 9.2.0.8 and another 20 or so patches to the OS. During the diagnostics process, Oracle suggested we turn on block checksum checking using the three aforementioned parameters and that almost eliminated the problem. The parameters stayed in the init.ora as we upgraded to 10.2 because we had no confidence that these bugs were fixed.

I read the article on the 10.2.0.4 NUMA and db_block_checking saga and i have a similar but peculiar problem. Maybe I can offer some twist as well. I have similar issues. My issue though has to do with pre page sga- true. I can in fact boot any size SGA so far as I get the huge pages proportionately higher than sga. The ratio is a mystery to me at this moment but hoevers around 65% of the SGA. That is if the huge page is 35% higher than size of SGA.

Take a look at my test results below

Test Results

With pre_page_tureSGA=12g,13g, 14gHuge Pages= 16GB

I get the ora-00443 background process "PMON" did not start

With pre_page_tureSGA=11g and below Huge Pages= 16GB

Works okay

With pre_page_tureSGA=10g , 9g 8gHuge Pages= 11GBI get the ora-00443 background process "PMON" did not start

Problem resolved when sga dropped to 7g

Final series of tests

With pre_page_sga =trueSGA=21gHuge Pages= 24GBI get the ora-00443 background process "PMON" did not start

With pre_page_sga =falseSGA=21gHuge Pages= 24GB

No issues

With pre_page_sga =trueSGA=21g,23g 24gb Huge Pages= 30GB

No issues.

I have a Suse 10 WITH 98g of memory and able to bring up 70G of sga with huge pages With a single trand for shared segment…and the best thing is it is 10.2.0.4!!!wow the sage continues. Why should one SGA with lower memory start with multiple shared segment ....

In all cases the NUMA optimization is set off. With NUMA optimization set to true any size SGA can be booted and use huge pages regardless of the size of the huge pages provided it is bigger than the SGA…even by a few Mbytes