Blogs

About this blog

AIXpert Blog is about the AIX operating system from IBM running on POWER based machines called Power Systems and software related to it like PowerVM for virtualisation, PowerVC for Deploying VM's and PowerSC for security plus performance monitoring and nmon

Links

Tags

Recent tweets

I got asked these questions recently and had to go look the subject up ... again! I seem to have forgotten some of the details and then I thought I would use some new features of AIX for the second part. In the distant past there was various way to stop core files being dumped in to the current working directory of the program that failed. In AIX 5.3, AIX 6 and 7, the "chcore" command does all the hard work for us by letting us

Choose a specific directory for core files - which is best in it own filesystem so it can't effect important files if it gets full (options -p on and -l directory)

Get the AIX kernel to rename the core file to include the process ID and time stamp (option -n on)

Compress the core file - those big ones can be very big so it makes sense (option -c on)

One final point - you need to login again for subsequent core files to get effected by these new settings.

For the second part of the question, we want to be quickly notified when a core is created. Normally, a core file is a catastrophic failure of an application which can cause user problems with very strange annoying user experiences or unexplained batch errors to log files. Rather than ignoring these symptoms, we should attempt to find out why the application failed and where in the program it failed - this is what core files are all about and then go fix it.

In AIX 6 (from TL6 - I think) and AIX 7, we have this new monitoring sub-system called the AHA filesystem. This does all sorts of monitoring and alerting and we can use it to pretty nearly instantaneously alert us on core files. If you updated to an AIX level that supports AHA you may need to install it from the AIX media. Fresh installs will get it installed by default. Fortunately, there are examples of how to use the /aha files. Check out /usr/samples/ahafs and particularly the ones used below are in /usr/samples/ahafs/bin. Here we have a file called aha.pl which is a Perl script, which can take command line options or options from a file (which we use here). I created a file called /etc/corefile with the following contents (the first three lines ar comments which help get the layout right):

The first large filename string means monitor directory content for created, removed files and then specifically the directory /corefiles. I have no ideas what the .mon is about :-) The CHANGED column = YES means we will monitor for directory changes. The INF_LVL = 2 it the information level of the output. Level 1 = does not include the filename involved and level 3 has a stack trace - which is very cool as it means you don't have to run the debugger to list the stack trace to find the function we failed in and how it got there. The other parameters are defaults that work. While trying to get this working, I found one set that generated 500 emails a second, so be careful.

Next prepare the /aha file which tells the kernel about the new event to be monitored:

# touch /aha/fs/modDir.monFactory/corefiles.mon

You get an error about not being able to set the file update time which is normal as it is not a regular file but a device driver like you find in the /proc file system. Now you start the Perl script to report core files arriving in the /corefiles directory with:

# /usr/samples/ahafs/bin/aha.pl -i /etc/corefile -e nag@blue.ibm.com

On the output I get the following at the start up time:

Attempting to open the AHAFS configuration file "corefile".
Monitoring the AHAFS event "/aha/fs/modDir.monFactory/corefiles.mon".

Now to test this alerting system, I just copied a file to /corefiles with: cp myfile /corefiles/testing

Note: the "testing" in the output and email tells us about the new file including the name.

Next, I used a special program that core dumps itself after a second or two. Yes, I wrote it and it was hard work too - none of my programs normally core dump. No, honest :-) I can run from any directory and the kernel redirects the core dump to /corefiles - I switched to Information Level (INF_LVL) = 3, so we get a stack trace in the output like the below:

The program is called "coredumper". The core file is renamed to "core.16056558.31153508.Z" - which is PID=16056558 and date time=31153508 (May 31st then Greenwich Mean Time = 15:35 but running British Summer Time = 16:35 and 8 seconds) and compressed .Z plus the rest is the purple part is stack trace = the memory fault signal arrived in the "main" function.

The only thing left is the to run the aha.pl Perl script from the /etc/rc* files or from inittab.

Note: this method does not require polling or crontab periodic checking of the /corefiles directory = zero CPU time.

Core dump notifications also get put in to the AIX Error Report - errpt like

And these can be redirected into the System log and transported remotely off machine - of course, you would then have to be monitoring the system log for core dump creation events and would not be near instantaneous.