today I'm going to write about a situation every sysadmin has already encountered. The sysadmin gets a new version of some type of software and should install it on a server. After some hours of trying he calls the developer and tells him he's not getting the application to start. The first answer all of us get: "But it is running on my PC." Let the discussion start. ;)

In my opinion it mostly a problem of proper communication. I have also seen (not only once) different types of development environments (mostly Windows) and production/test/... environments (be it Linux, AIX, HP-UX, ...). This could also be a reason for problems, but this is enough for an other topic. So let's come back to the communication.

As communication always two-sided, so is this problem. There can be different sources:

The admin has changed a default value of a configuration.

The developer has changed some classes and now needs other permissions or files.

The admin has installed an update and the application is installed on this test system.

The developer uses a new library which is not installed on the server.

...

There can be a lot more other reasons why software fails, but I think you get the idea. Most of these supposed problems can easily be solved by proper communication. When the admin updates a system (be it security patches, os service packs, ...), just write a short e-mail to the users of the system and explain them in short words what you have done and what might be affected by the update. When a developer changes something in the code, keep a changelog. But as a developer do me a favour and do not mail the sysadmin the complete changelog. When it is to long or there are to many terms related to the business logic or how you changed some algorithm to get some more performance, he won't read it. The changes might also be interesting for the sysadmin, but mostly he will not have the time to read it all and get the parts interesting for him. I know developers do not have unlimited time, but for them it is much easier to find the parts affecting the sysadmin, because they (hopefully ;))understand the complete changelog.

In a perfect world we would have a change management which includes development and system administration, but as this will not always be present, just take the short track and write an e-mail, use the phone or do a short(!!!) meeting when anyone knows about changes, which could affect the release. Some people will now starting rolling their eyes and ask themselves who is not doing so already. It's sad but true there are a lot of people out there.

This way your releases will run a lot smoother and every side gets more understanding for the other side which will positively affect other parts of your daily work.

for some time I had the problem, that taking Java heap dumps with jmap took too long. When one of my tomcats crashed by an OutOfMemoryException, I had no time to do a heap dump because it took some hours and the server had to be back online.

Now I found a sollution to my problem. The initial idea came from this post. It had a solution for Solaris, but with some googling and try and error I found a solution for linux too.

create a core dump of your java process with gdb

gdb --pid=[java pid]gcore [file name]detachquit

restart the tomcat or do whatever you like with the java process

attach jmap to the core dump and create a Java heap dump

jmap -heap:format=b [java binary] [core dump file]

analyze your Java heap dump with your prefered tool

When you get the following error in step three:

Error attaching to core file: Can't attach to the core file

This might help:
In my case the error apeared because I used the wrong java binary in the jmap call. When you are not sure about your java binary, open the core dump with gdb:

gdb --core=[core dump file]

You will get an output similar to this one:

GNU gdb 6.6Copyright (C) 2006 Free Software Foundation, Inc.GDB is free software, covered by the GNU General Public License, and you arewelcome to change it and/or distribute copies of it under certain conditions.Type "show copying" to see the conditions.There is absolutely no warranty for GDB. Type "show warranty" for details.This GDB was configured as "i586-suse-linux"...(no debugging symbols found)Using host libthread_db library "/lib/libthread_db.so.1".warning: core file may not match specified executable file.(no debugging symbols found)Failed to read a valid object file image from memory.Core was generated by `/opt/tomcat/bin/jsvc'.#0 0xffffe410 in _start ()

There are different check_jmx versions (ME, NE1, NE2 and CG) on NagiosExchange, MonitorExchange and Google Code but it seems none of them is still maintained. I tried to reach one author but got no reply. So I decided to put my modifications on the net. I also merged some other changes in this new release of check_jmx.

To be sure other people can continue development, should I'm not be reachable, I uploaded the source to gitorious. There is a new repository for Nagios plugins which is maintained by some community members.

check_jmx is a Nagios plugin to monitor your JVM, e.g. your Tomcat or JBoss Installation. It is possible to get data about your heap, gc, .... It is also possible to query MBeans which are part of your application. check_jmx also returns performance data.

For this release I merge the original check_jmx release with additions to support Longs instead of integers for the warning and critical value. I added authentication for connections to the JMX server.

and now the matching presentation to the podcast at redmonk I mentioned earlier. Their are some very good statements in this presentation. The tools section is not so important for the mentioned tools, but for the statements that are combined with the tools:

single click build

single click deployment

monitoring for app and systems

understanding for all metrics independent of app or system

dark launches

But as the last slide says, it is not easy. Try to start with one point or in one project and try to establish it. When it is working take the next point or the next project.

Very good are the slides about culture. One I specially want to mention is slide 57 "Don't just say 'No'". I don't know what was said during the presentation, but my understanding of this sentence is as follows:
You can say 'No', but when explain why and give alternatives. When you don't have alternatives, just say it and try to find alternatives together.
The slides about fingerpointing don't need any comment. Just take a look at them and you know everything.

But there are also slides I do not fully agree with. I don't think dev should have full access to all systems. They definitly must have access to a test environment which is almost the same as production, but they do not necessaries need access to production. They should have access to logs, but not the wrights to restart services or change anything. In my opinion this would be the same as ops changing some code. This can work in small organisations where most people do have more than one role, but not in bigger organisations.