What sucks, who sucks and you suck

Performance Graphing on Solaris

2003-05-30

Problem: you need to get an overall view of the performance of your Solaris servers. Something with pretty graphs. That doesn’t require hours of configuration, the installation of complex, insecure client-side agents and a significant overhead in data collection and handling.

Solution: nothing. But you can get close…

My first thought was to use SNMP: it’s bundled with Solaris, it’s lightweight, it can be locked down (read-only access) and it’s designed for this sort of thing (pulling system data from remote nodes). My first discovery was that you should forget about SNMP. Sun’s SNMP daemon returns zero for most of the crucial MIB entries. NET-SNMP can return almost any data you wish (with the minor overhead of installing it, but then you use CfEngine, don’t you?), and it’s portable. However, the available freeware SNMP consoles are unwieldy, complex beasts with fairly obscure configuration methods; most of them are targeted at network administrators rather than system admins. The latest beta versions of Cricket (1.0.4+) contain a SystemPerf utility that builds configurations for hosts running NET-SNMP, but it doesn’t scale well without some scripting wrapped around it (NB. I have corroborative statements on this).

About a million people on the SAGE members mailing list said, “Try Orca”. Orca graphs arbitrary data from files using RRDtool. The orcallator plugin uses the SEtoolkit for Solaris to record OS performance data which Orca can graph using a canned configuration. The setup advice assumes you mount a directory exported from the central Orca host on each remote Orcallator client via NFS. Those managing ecommerce or other secure sites will probably want to steer well clear of NFS. You could therefore either use a secure distributed filesystem like SFS or pull the data at regular intervals with rsync. For the latter, here are some additional instructions:
1. (Rsync and SSH are givens, right?)
2. Install RRDtool and RRD.pm on your monitoring host/web server. (Note: RRD.pm must be built with the version of Perl you will use to run Orca. If you have a locally-built Perl, ensure it appears first in your PATH and set the PERL environment variable to point to the binary before building, otherwise it will pick up the Solaris version. I had problems building the version of rrdtool included with Orca, so I downloaded and installed the latest release separately.)
3. Install Orca on the monitoring host and create the directory tree for its RRD & HTML files (plus an orcallator/ subdirectory). When configuring the Orca build, set var-dir to somewhere local to each host (e.g. /var/orca).
4. Install the SE Toolkit and orcallator.se on all your hosts. You also need S99orcallator and start_orcallator from your installed Orca kit; it’s easier just to install the whole thing on each client (you do have a software distribution mechanism?) Start Orcallator ASAP to begin gathering data immediately, even if Orca itself isn’t ready yet.
5. Set up transparent (key-based) SSH access from the monitoring host to each remote host, using a non-root user account. We need this for rsync (you can probably restrict it to rsync-only via the SSH config files if you’re truly paranoid).
6. Write a small script to rsync the local Orcallator data directory from each host to the main Orcallator data directory (i.e. the Orca subdirectory) on the monitoring host. Run this script via cron at the same interval as your Orca graphing and Orcallator data recording (default is five minutes). Insert “sleep 15;” in the crontab entry to avoid pulling the files while Orcallator is updating them. Example script(updated 2003-11-21, important fix!).
7. Copy orcallator.cfg somewhere sensible (e.g. your main orca directory) and edit as required. You will need to extend late_interval by about 60 seconds to avoid warnings about out of date files. Note that despite the name, this file configures Orca to graph orcallator data files; it doesn’t configure Orcallator itself (do that by editing orcallator.se if required).
8. Perform any required web server configuration for the HTML output files (e.g. a URL alias).
9. Start Orca (orca -d orcallator.cfg). In my experience, with a dozen hosts Orca consumes most of a single CPU on an E450 while updating. The web pages can be slow to render due to the number of graph images present. You may wish to run the Orca process with nice to prevent it undermining the general responsiveness of the system. [Using this method, I receive warning mails from Orca about the data files being out of date during the rollover at midnight, so the timing may still require some fine-tuning. In fact, eventually I set warn_email to the empty string to disable the warnings.]

As a companion or an alternative, you might also look at SARGE, the SAR Graphing Engine. This is a bit outdated (1998) and suffers from lack of maintenance, but it’s handy if you’ve already got SAR running. I made a few local fixes and improvements (IMO) to it; download the patches here.

There is a commercialised version of SARGE available. I haven’t tried it; it seems to run and generate graphs locally on each host which are then accessed via a shared network drive (grouch) by a central CGI program.

Orca vs. SARGE

SARGE doesn’t support RRD and only shows the last 30 days of data. Orca keeps short and long range periods.

SARGE uses remote shell calls to run SAR and gather SAR files on each host; this is a clunky, distributed overhead that might be better handled by rsync’ing all the SAR files to a single place and processing them centrally (like Orca).

SARGE makes it easy to compare performance across several hosts (in a SARGE “group”); the volume of information and the lack of grouping makes this more difficult with Orca.

Orca can graph arbitrary data, if you work out how to configure it. My conclusion is that SARGE would be the better tool (on the grounds of simplicity) if it supported RRD and was updated. However, Orca tells you everything you might ever need to know.

The Ultimate Solution:
* Requires no extra client-side software (unlikely unless vendors standardise and improve their performance utilities).
* Uses RRD.
* Gathers remote data via SNMP or similar - no remote shells.
* Can handle arbitrary data sources.
* Does not assume the use of shared network drives, which are a whole other can of worms.
* Scalable, perhaps via the ability to distribute graphing across several hosts or subnets. If Cricket’s config system was overhauled or automated, it would come very close. ;-)