Do you want to monitor any process which may have a memory leak (The top N memory hogs) or are you looking to monitor a defined set of processes (e.g. Apache webserver and a Tomcat process)? The latter is doable with some simple Nagios or Cacti plugins. The former is more difficult. You should clarify this.
– Stefan LasiewskiJul 29 '10 at 4:01

I already clarified it in the post but to clarify again: I want to know the state of the system when it goes down due to swap death. I want to know who the worst offenders are. And btw, it doesn't have to be a memory leak - just an influx of traffic, or whatever causes high memory usage. So, again, no advance knowledge of binary names should be configured.
– Artem RussakovskiiJul 30 '10 at 3:25

11 Answers
11

It you want just the top offenders, consider running top with a relatively long interval (60 seconds plus) in batch mode. You may need more than one top running to capture the top offenders on multiple resources. I have configured systems to run top for a few cycles when a resource was being over used.

Consider running sar in batch mode to capture resource utilization. I realize this is server based, but it useful to determine times when problems are occurring.

Run munin and enable notifications. This may give you a chance to get in and watch the server going down. You may be able to correct the problem before it goes down.

For memory leaks, a steady increase in swap usage indicates a problem. I once watched a server slowly die over a period of days. The problem service was a program monitoring other processes for memory leaks. The system admin kept insisting the increasing swap usage was not a problem, right up until the server stopped responding.

You may find that cfengine's anomaly detection can be used to trigger a script to capture the system state when things go wrong. You may want a lot of information besides just the processes using the most resources. For a sudden influx of usage you may want a list of network connections (by address not name). Memory usage is also useful.

This is where you should start. You can't know where to start an examination until you know where you might have the best chances. Sysstat is what you are looking for (also has pretty graphs). Once you know more use systemtap.
– AllenAug 3 '10 at 16:20

"Atop is an ASCII full-screen performance monitor that is capable of reporting the activity of all processes (even if processes have finished during the interval), daily logging of system and process activity for long-term analysis, highlighting overloaded system resources by using colors, etc. At regular intervals, it shows system-level activity related to the CPU, memory, swap, disks, and network layers, and for every active process it shows the CPU utilization, the memory growth, priority, username, state, and exit code."

atop doesn't seem to have a report that would provide me with what I wanted. Please correct me if I'm wrong.
– Artem RussakovskiiJul 27 '10 at 9:45

It takes care of your first two bullet points (memory/cpu by process). You can use the library to gather these stats and then do your history / graphing based on the data.
– NinjaCatJul 28 '10 at 14:25

4

@artem-russakovskii - By default atop logs data to a file every ten minutes. If your server crashed at 3:45 you could start atop with atop -r log_filename, press m to switch to the per-process memory usage view, and then press t to move forward in 10 minute increments until 3:40. You can read more about the basics of using atop at lwn.net/Articles/387202 and see an example of identifying a memory leak at atoptool.nl/download/case_leakage.pdf
– sciurusMar 1 '11 at 19:45

Collectd is very lightweight, not too difficult to set up, and will let you see memory/swap growth over time. It will not pinpoint the offending processes, though -- but maybe you'll be able to notice and catch the memory growth in time and inspect the situation manually with top.
– Marius GedminasJul 30 '10 at 11:34

1

I have to say that i didn't try that plugin, but reading from the manual of process plugin of collectd: "If processes are selected the following information is gathered. All this information is aggregated by the process name. Its Resident Segment Size, Used user- and system-time, The number of processes by that name, The number of threads (summed up over all the processes), The number of major and minor page faults. Rough I/O-numbers (bytes written and read due to syscalls by the process).
– PiLJul 30 '10 at 12:02

You can select the processes or by name or by regex.
– PiLJul 30 '10 at 12:03

Centreon on top of Nagios, Nagios coupled with NRPE. You can then write custom scripts to report data in ANY format you wish to NRPE. Nagios then polls the data from remote servers with NRPE and Centreon makes a pretty graph and adds a ton of user flexibility. We use it over at http://beyondhosting.net I have a VZ Container template with centreon+nagios setup already if you want it.

nmon is a great tool that does what you're looking for. Developed for AIX and Linux. Produces a ton of detailed output and easy to put into reports. If you google it, there is an IBM wiki that has a bunch of documentation and additional utilities for parsing the data.

I use it on one of our production servers and am very happy about it. It's top feature is the ability to view charts, click on a peak and see the server CPU/Memory consumption at that current time, including all running processes. They call it snapshots.

It's constantly improving. One of the latest features is anomaly detection, which allows you to easily detect anomalies. You can also setup various tresholds

Ah, I forgot to mention the little part where I'd prefer it to be free, and open source, if possible. Over $100 per server is not really what I'm looking to spend (and I only have 1 server, not 5). serverdensity.com/pricing
– Artem RussakovskiiJul 30 '10 at 3:28

I use collectd to record system load amongst a number of other parameters. It stores the data in RRD stores that can be graphed and otherwise analysed using the many available tools and scripts. I use a modified version of this script for my graphing (sample output).

Collectd has plugins for monitoring lots of stuff (everything commonly asked for and a few things on top), and creating your own shouldn't be difficult if you need something specialised, so makes for a very flexible tool. Configuring the graphs in rrd.cgi is a very manual process, though not difficult, though you might well find a more convenient tool for working with the RRD files maintained by collectd.