2006

2005

2004

While preparing for my RHCSA exams, I was in dire need of a Linux playground. At first I could make do with virtual machines running inside Parallels Workstation on my Macbook. But in order to use Michael Jang's practice exams I really needed to run Linux as the main OS (the tests require KVM virtualization). I tried and I tried and I tried but CentOS refused to boot, mostly ending up on the grey Tux / penguin screen of rEFIt.

On my final attempt I managed to get it running. I started off with this set of instructions, which got me most of the way. After resyncing the partition table using rEFIt's menu, using the rEFIt boot menu would still send me to the grey penguin screen. But then I found this page! It turns out that rEFIt is only needed in order to tell EFI about the Linux boot partition! Booting is then done using the normal Apple boot loader!

Just hold down the ALT button after powerin up and then choose the disk labeled "Windows". And presto! It works, CentOS boots up just fine. You can simply set it to the default boot disk, provided that you left OS X on there as well (by using the Boot Disk Selector).

Yesterday I received an email from the NLUUG (dutch Unix users group) conference bureau:

Dear Thomas,

We have received your abstract for the NLUUG spring conference 2010. We would like to thank you for your submission and your patience.

It is our pleasure to inform you that your abstract has been chosen by the program committee to be presented on May 6th.

Holy carp! This means that I'll be getting on stage, in front of 50-200 Unix admins, my peers if you will. The last time I got in front of a big crowd it was thirty high school juniors, so this is going to be just a -little- bit different. =_=;

Why the frag does the IPv4 networking setup on Red Hat and Fedora Linii need to be so damn complicated? I've just spent half an hour Googling to find the right commands to ensure that my Fedora 12 VM in Parallels configures its eth0 at boot time. Seriously, compare the two:

Solaris 10:

1. Enter hostname and IP in /etc/hosts

2. Enter hostname in /etc/hostname.ni0

3. Enter network base IP and netmask in /etc/netmasks

Fedora 12:

1. Run system-config-network. Fill out all details.

2. Enter hostname and IP in /etc/hosts

3. Enter hostname in /etc/sysconfig/network

4. Set "ONBOOT" to yes in /etc/sysconfig/networking/devices/ifcfg-eth0

5. Run: chkconfig --level 35 network on

Seriously, who the fsck comes up with that last line?! I already have the network startup scripts in /etc/init.d and everything in /etc/sysconfig is set up and I -still- need to enable the network config to be loaded at boot time? WTF?! A few years back I had the same fights with setting up static routes that needed to be carried over reboots.

It took us over three weeks of mailing back and forth, but finally the Parallels team were able to both reproduce the issue and to provide a fix. Here's the summary of what tech support told me.

Cause of the problem

Parallels Desktop 5 now supports 64-bit operating systems. Furthermore, it will now by default boot any capable OS into 64-bit mode. This means that all of my Solaris 10 VMs that had been running in 32-bit mode all of a sudden switched over to 64-bit. This also means that any 32-bit only drivers are rendered unusable. This is what broke the usage of Parallels' virtual network interfaces.

Solution 1: forcing the OS back to 32-bit mode

1. Stop the VM

2. Go to VM configuration -> Hardware -> Boot order.

3. In the "boot parameters" field enter: devices.apic.disable=1

3a. Alternatively, add the following to /etc/system: set pcplusmp:apic_forceload = -1

4. Boot the VM.

5. As root run: /usr/sbin/eeprom boot-file="kernel/unix"

Solution 2: recompiling the RTL3829 drivers as 64-bit

1. Start the VM

2. Remove the old drivers. Run: rem_drv ni

3. Mount the Parallels tools CDROM ISO image on /cdrom.

4. Run: cd /cdrom/Drivers/Network/RTL3829

5. Run: cp -rp SOLARIS /tmp

6. Run: cd /tmp/SOLARIS

7. Edit the network.sh file and add the following lines right before the echo of "Compiling driver".

PATH=$PATH:/usr/sfw/bin

rm $driver/Makefile

ln -s $tmpdir/$driver/Makefile.amd64_gcc $tmpdir/$driver/Makefile

8. Save the file and run: ./network.sh

9. Answer the usual questions to configure the NIC. Then reboot.

I went with the second solution (might as well stay running in 64-bit mode now that I can). I can confirm that it works and that my NI interface is now back. You may find that network.sh configured the ni0 interface, while it's actually called ni1. Reconfigure if needed by moving /etc/hostname.ni0 to /etc/hostname.ni1.

As said my script only checks the basics of CNR to ensure that the required daemons are running. It does not actually check any of the functionality, though at a later point in time it may be expanded to include this.

Usage of check_cnr

Output

Depending on which mode you've selected the output of the script will differ slightly.

In Tivoli mode the output will be limited to a numerical value as the script is to be used as a "numeric script". 0 = OK, 1 = WARNING/UNKNOWN, 2= SEVERE. The exit code of the script will be identical to this value.

In Nagios mode the exit code of the script will be be similar to Tivoli's, with the exception that the value 3 portrays an unknown state. The output on stdout includes the service name and state (CNR OK/NOK) and a helpful error message.

Limitations

This script has currently only been tested on Solaris and Linux.

The plugin will only check for the required daemons. It currently does no functional checks, though the framework for these checks is already in place.

Since I've joined $CLIENT in October my life has been nothing but BoKS, BoKS, BoKS. It's great to be working with FoxT's security software again :) A lot of things have changed over the years, though the software is still very, very familiar.

One of the things that's made me happy is that Fox Tech have -finally- made an official logo for their BoKS products! I find it odd that they've been marketing this software for over ten years and that their last logo dates back to the nineties. Said decrepit logo hasn't been used in ages and henceforth BoKS was just known by that: a plain text rendition of the name. By request of $CLIENT, Fox Tech have gotten of their hineys and created a new logo that matches their corporate identity.

As a side note: over the past few weeks I've seen a lot of in-depth troubleshooting and I've decided to share some of the stuff I've learnt. Hence you'll find that the BoKS part of the sysadmin section has been revamped :)

Yesterday I'd spent an hour or two writing a PHP+SQL script for one of my colleagues, so he could get his hands on the report he needed. We have this big database with statistics (gathered over the course of a year) and now it was a matter of getting the right info out of there. Let's say that what we wanted was the following:

For four quarters, per host, the total sum of the reported sizes of file systems.

Now, because my SQL skills aren't stellar what I did was create a FOR-loop on a "select distinct" of the hostnames from the table. Then, for each loop instance I'd "select sum(size)" to get the totals for one date. But because we wanted to know the totals for four quarters, said query was run four times with a different date. This means that to get my hands on said information I was running 168 * 4 = 672 queries in a row. All in all, it took our box fifteen minutes to come up with the final answer.

On my way to work this morning a thought struck me: I really ought to be able to do this with four queries, or even with -one-! What I want isn't that hard! And in a flash of insight it came to me!

SELECT hostname, date, SUM(size) AS total FROM vdisks WHERE (date="2007-10-03" OR date="2008-01-01" OR date="2008-04-01" OR date="2008-07-01") GROUP BY hostname, date;

The runtime of the total query has gone from 15 minutes, to 1 second. o_O

It takes at least a couple of years for a fledgling sys admin to build up his or her experience to a level where people will say: "Yeah! He's a good sysadmin.. He knows his way around the OS."

Most of the time of your first two or three years (assuming that you start admining in college) will be spent either with your nose in the books (learning new stuff) or with your nose to the grind stone (practicing the new stuff). A lot of time will be spent on basic grunt work, combined with maybe a couple of nice projects and some programming. But at some point in time a dreaded new word will drop on you like a brick from up on high... Management level that is...

_certification_

At first official vendor certification may seem like a humongous task! Especially if you take a look at the requirements that the vendor publishes on its website and at the sheer volume of the prep-books available. I had the same problem! One day may Field Managers mentioned that official certificates would look good on my resume and that I should go order a book or two... Which I did... And I subsequently try to read three times over... And just could not get through...

You see, I made the fatal mistake of wanting to cram everything in my head before even setting a date for the exam. This gave me way too much slack, causing me to lose interest at least two times over. So, after a bit of coaching from one of my friends/colleagues I came to the following conclusion on how to prepare for certification.

1. Get some experience :) Don't try to get certified immediately after being introduced to a new OS. 2. Take a look at the vendor's requirements for the certificate. These are usually published on their website.3. Order one, maybe two good study books. I've created a small list of which books are good and which ones should be avoided.4. Make a rough guestimate on how long you'll think you'll take studying. Don't make this any longer than two months, else you'll simply lose interest.5. Order an exam voucher from your vendor.6. Schedule the exam.7. Start studying.

There's also a couple of other things that can really help you get the knack of things, ensuring that you'll be absolutely ready for the exam:* Ask your employer to provide a sandbox system: a simple, small server which you are free to tinker with, configure, play with and break. This is an invaluable study tool!* Purchase an account for a practice exam website (or get your employer to pitch in). The guys at Unixporting.com provide damn good test exams for Solaris Sysadmin 1 and 2, at a low price!

Most important of all: don't sweat it! A little excitement or a couple of shivers are good, but honestly: the fate of the world does not lean on your shoulders. If you don't make the exame, try, try try and try again. :)

When getting certified, one of the most important tools are your cram sessions. With books.. You know: dead trees? treeware? those big leafy things which you read?... But you gotta know which ones are good and which ones to avoid like the plague.

Sun Solaris SCSA 1 and 2

Avoid this one. This is the book it bought at first as it got some good reviews at Amazon.com. It was also the one that I tried getting through three times over *ugh* Honestly, the book is written in a very dull style but worst of all: it really isn't that much of a cram book since the author misses almost all of the important stuff for the exams. Way too little detail, so I wouldn't recommend it to anyone, but the starting Solaris sysadmin who needs to find a start.

Now _this_ is what I'm talking about! My colleague Martijn recommended this book and it really _does_ cover everything you need to know to ace the exame, plus a little more. The authors don't brush over any subject and take on each and every topic in detail. Yes, it's a big book and it may take you a while to get through it, but it's worth it. The exames included on the CD are a bit dodgy and are only good for one, maybe two attempts. In any case I recommend that you go out and get an account for a trial exame site.

Sun SCNA (TCP/IP Network admin)

Martijn also tipped me off about this book; apparently he aced the test with this book. I have to admit that the book _does_ take its time in explaining everything to you and that Rick doesn't leave out any details. I have to warn you though that the author also made a couple of mistakes, that he likes repetition (sometimes a little too much) and that at times he underestimates the exame (tells you that you don't need to know what he's about to explain, when you do). All in all a good book, but I'm not too crazy about it.

In 2007 I got my LPI-1 certification. This certificate requires one to take two exams: LPI-101 and LPI-102. I've studied hard for both exams and created summaries of all of the stuff I had to learn. I thought I'd share my summaries with all of the other LPI students. I hope they are useful to you!

Back in 2004, when I originally studied for my SCNA certification, I wrote a big summary based on the course books. I thought I'd share this summary with the rest of this world's students. Even though it was meant for the Solaris 8 SCNA exam, it should still be useful.

In 2007 I got my LPI-1 certification. This certificate requires one to take two exams: LPI-101 and LPI-102. I've studied hard for both exams and created summaries of all of the stuff I had to learn. I thought I'd share my summaries with all of the other LPI students. I hope they are useful to you!

EDIT: 23/11/2004
DO NOT USE THE FOLLOWING PROCEDURE! IT HAS PROVEN TO BE FAULTY AND SHOULD ONLY BE USED AS A GUIDELINE FOR MAKING YOUR OWN PROCEDURE!
I will try and correct all of the mistakes as soon as possible.. Please be patient...

At some point in time it may happen that your NIS+ master server has become too old or overloaded to function properly. Maybe you used old decrepid hardware to begin with, or maybe you have been using NIS+ in your organisation for ages :) Anywho, you've now reached the point where the new hardware has received its proper build and that the server is ready to assume its role as NIS+.

Of course you want things to go smoothly and with as little downtime as possible. Of course one of the methods to go about this is about to use the other procedure in this menu: "Rebuild your master". That way you'll literally build a new master server after which you reload all of the database contents from raw ASCII dumps.

The other method would be by using the procedure below :) This way you'll transfer mastership of all your NIS+ database from the current master to the new one. I must admit that I haven't used this procedure in our production environment as of yet (15/11/04), but I will in about a week! But even after that time, after I've added alterations and after I've fixed any errors, don't come sueing me because the procedure didn't work for you. NIS+ can be a fickle little bitch if she really wants to...

This procedure requires that your new NIS+ master server is already a replica server. There are numerous books and procedures on the web which describe how to promote a NIS+ client into a replica, but I'll include that procedure in the menu sometime soon.

Before you begin, disable replication of NIS+ on any other replica servers you may have running. This is easily done by killing the rpc.nisd process on each of these systems. Beware though that all of the replicas do need to remain functioning NIS+ clients! This ensures that their NIS_COLD_START gets updated.

Log in to both the current master and the replica server you wish to upgrade. Become root on both systems.

Kill all NIS processes on both the master and the replica in question. Then restart on the replica using:
# /usr/sbin/rpc.nisd -S 0
# /usr/sbin/nis_cachemgr -i

Verify that the replica server is now recognised as the current master server by using the following commands.
# nisshowcache -v
# niscat -o groups_dir.`domainname`.
# niscat -o org_dir.`domainname`.
# niscat -o `domainname`.

If the replica system is not recognised as the master, re-run the for-loop which was described above. This will re-run the nismkdir command for each table that isn't configured properly.

In some cases you're going to want to use Net-SNMP on your Solaris hosts, while still being able to monitor Sun-specific SNMP objects. It took me a while to get all of this to work and it's a bit of a puzzle, but here's how to make it work.

In our current environment at $CLIENT we want to standardise all of our UNIX hosts to the Net-SNMP agent software. This will allow us to use a configuration file which can be at least 60% identical on each host, making life just a little bit easier for all of us. Unfortunately Net-SNMP isn't equipped to deal with all of Sun's specific SNMP objects, so we're going to have to make a few big modifications to the software.

Of course packaging all these changes into one big .PKG is the nicest way of ensuring that all required changes are made in one blow, so that's what I've done. Unfortunately I cannot share this package with you, since it contains quite a large amount of $CLIENT internal information. I may be tempted at another time to recreate a non-$CLIENT version of the package that can be used elsehwere.

Re-compiling Net-SNMP

The latest versions of Net-SNMP comes with experimental LM_Sensors support for Sun hardware. Oddly, I've found that you need to drop one version below the latest version to get it to work nicely with Solaris 8. So here's the steps to take...

Download the source code for Net-SNMP version 5.2.3 from their website.

Move the .TGZ to your build system and unpack it in your regular build location. Also, building Net-SNMP successfully requires OpenSSL 0.9.7g or higher, so make sure that it's installed on your build system.

Run "make", "make test" and "make install" to complete the creation of Net-SNMP. If "make test" fails on every check, it is likely that your system is unable to find the requisite OpenSSL libraries. This may be solved by running:

After "make install" has finished all the Net-SNMP files have been installed on your build system. Naturally it's important to know which files to include in your package. To help you, I've created a list of the files that are installed.

Installing SUNWmasf and its components

PLEASE NOTE: SUNWmasf will currently (july of 2006) only get useful results on the following models: V210, V240, V250, V440, V1280, E2900, N210, N240, N440, N1280. On other systems you may have more luck using the LM_Sensors pieces of Net-SNMP. They have been tested to work on E450, V880 and 280R.

As I mentioned earlier Net-SNMP with LM_Sensors can only gather limited amounts of Sun specific information. That's besides the fact that it is also still an experimental feature. So we're going to need an alternative SNMP agent to gather more information for us. Enter the SUNWmasf package.

SUNWmasf and its components may be downloaded from the Sun Microsystems website. Either use this direct link (which may be subject to change), or go to www.sun.com/download and search for "Sun SNMP Management Agent".

You can opt to install SUNWmasf manually on each of your clients, but it would be much nicer to include it into your custom made package. To have a full list of all the files and symlinks that you should include, you can take a peek at the prototype file I made for the package. It includes all the files required for Net-SNMP.

Installation of the software couldn't be easier. Just run the following command, after extracting the .TAR.Z file that contains SUNWmasf.

Configuring SUNWmasf

Configuring Net-SNMP

The configuration file for Net-SNMP is located in /usr/local/share/snmp. You will need to make a whole bunch of changes over here that I won't cover, like security ACLs, SNMP trap hosts and bunches of other stuff. However, you _will_ need to add the following lines to allow Net-SNMP to talk to SUNWmasf.

Starting the software

Since SUNWmasf relies upon Net-SNMP, it will need to be started after that piece of software. The prototype file I mentioned earlier already takes this into account, but if you're not going to use it just make sure that /etc/init.d/masfd gets called _after_ /etc/init.d/snmpd during the boot process.

Also, I've noticed that SUNWmasf will need about thirty seconds before it can be read using commands like snmpget and snmpwalk.

Reading values from the agents

As you may well know, SNMP is a tangly web of numerical identifiers. I will make a nice overview of the various useful OIDs that you can use for monitoring through both LM_Sensors and SUNWmasf. However, I will put these in a seperate document, since it falls outside the scope of this mini-howto.

Unfortuntately I can currently only list details for two of the supported models, since I do not have test boxen for the other models. The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:

snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2

Details about the structure of the various MIBs can be found in other articles in the Sysadmin section of my website. Just browse through the menu on the left. Point is that the lists below only list the OID _within_ the specific sub-trees (for example: .1.3.6.1.2.1.47.1.1.1.1.2.46). As I said: details on actually _reading_ these values will be contained in another document.

Unfortuntately I can currently only list details for two of the supported models, since I do not have test boxen for the other models. The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:

snmpwalk -c public -m ALL localhost .1.3.6.1.4.1.2021.13

Details about the structure of the various MIBs can be found in other articles in the Sysadmin section of my website. Just browse through the menu on the left. Point is that the lists below only list the OID _within_ the specific sub-trees (for example: .1.3.6.1.4.1.2021.13.16.5.1.2.9). As I said: details on actually _reading_ these values will be contained in another document.

Sun Fire V240

Object

Description

Unit

2.1.2.1 and .2

CPU[0-1] Core temperature

Integer *

2.1.2.3

SYSTEM Enclosure temperature

Integer *

5.1.2.2

SYSTEM Service required indicator

Integer

5.1.2.5

PSU[0-1] Service required indicator

Degrees

5.1.2.10 .12 .14 and .16

HDD[0-3] Service required indicator

Integer

5.1.2.18

Keyswitch

Integer

5.1.2.4 and .7

PSU[0-1] Activity (power?)

Integer

*: In order to get the real temperature, you will need to divide the integer contained within this variable by 65.526. For some odd reason Net-SNMP does not store the real temperature in degrees Centrigrade.

Sun Fire V440

2.1.2.1 .2 .3 and .4

CPU[0-3] Core temperature

Integer *

2.1.2.5 .6 .7 and .8

CPU[0-3] Ambient temperature

Integer *

2.1.2.9

SCSI temperature

Integer *

.10

MOBO temperature

Integer *

.98 .100 .102 and .104

CPU[0-3] Core temperature

Degrees

.106

MOBO temperature

Degrees

.107

SCSI temperature

Degrees

5.1.2.2

SYSTEM Service required indicator

Integer

5.1.2.6 and .10

PSU[0-1] Service required indicator

Integer

5.1.2.12 .14 .16 and .18

HDD[0-3] Service required indicator

Integer

5.1.2.20

Keyswitch

Integer

5.1.2.4 and .8

PSU[0-1] Power OK

Integer

*: In order to get the real temperature, you will need to divide the integer contained within this variable by 65.526. For some odd reason Net-SNMP does not store the real temperature in degrees Centrigrade.

I have to admit that figuring out how all the parts of SNMP on Sun stick together took me a little while. Just like when I was learning Nagios it took me about a week of mucking about to gain clarity. Now that I've figured it out, I thought I'd share it with you...

First off, everything I will describe over here depends on the availability of two pieces of software on your clients: Net-SNMP and SUNWmasf. See the article on combining the two for further details on installing and configuring this software.

We should begin by verifying that you can read from each of the important pieces of the SNMP tree. You can verify this by running the following three commands on your client system. Each should return a long list of names, numbers and values. Don't worry if it doesn't make sense yet.

Please keep in mind that you should replace the word "public" in all the examples with the community string that you've chosen for your SNMP agents. It could very well be something other than "public".

Which witch is witch?

Now that we've made sure that you can actually talk to your SNMP agent, it's time to figure out which components you want to find out about. The easy way to find out all components that are available to you is by running the following command.

snmpwalk -c public localhost .1.3.6.1.2.1.47.1.1.1.1.2

Let me explain what the output of this command really means... The SNMP sub-tree MIB-2.1.1.1.1 contains descriptive information of system-specific SNMP objects. Each object has a sub-object in the following sub-trees (each number follows after MIB-2.1.1.1.1).

Sub-OID

Description

Sub-OID

Description

.1

entPhysicalIndex

.9

entPhysicalFirmwareRev

.2

entPhysicalDescr

.10

entPhysicalSoftwareRev

.3

entPhysicalVendorType

.11

entPhysicalSerialNum

.4

entPhysicalContainedIn

.12

entPhysicalMfgName

.5

entPhysicalClass

.13

entPhysicalModelName

.6

entPhysicalParentRelPos

.14

entPhysicalAlias

.7

entPhysicalName

.15

entPhysicalAssetID

.8

entPhysicalHardwareRev

.16

entPhysicalIsFRU

In this case all the sub-objects under .2 contain descriptions of the various components that are human readable. What you need to do now is go through the complete list of descriptions to pick those elements that you want to access remotely through SNMP. You will see that each entry has a number behind the .2. Each of these numbers is the unique component identifier within the system, meaning that we are lucky enough to have the same identifier within other parts of the SNMP tree.

Getting some useful data

Aside from the fact that the sub-OID we have found for our object is used in other parts of the tree, there's another parameter that makes its return. The character string in .7 is reused in the SUN MIB as well, as you will see in a moment.

Let's see what happens when we take our sub-OID .98 to the SUN MIB tree...

Take a look at 2.1.5.98... Looks familiar? At least now you're sure that you're reading the right sub-object :) The list in the example above looks quite complicated, but there's a little help in the shape of a .PDF I once made. This .PDF shows the basic structure of the objects inside enterprises.42.2.70.101.1.1.

You should immediately notice though that the returns of the command are divided into three groups: ...101.1.1.2, ...101.1.1.6 and ...101.1.1.8. Matching these groups up to the .PDF you'll see that these groups are respectively sunPlatEquipmentTable (which is an expansion on the information from MIB-2), sunPlatSensorTable (which contains a description of the sensor in question) and sunPlatNumericSensorTable (which contains all kinds of real-life values pertaining to the sensor).

In this case the most interesting sub-OID is enterprises.42.2.70.101.1.1.8.1.4.98, sunPlatNumericSensorCurrent, which obviously contains the current value of the sensor readings. Putting things into perspective this means that the core temperature of CPU0 at the time of the snmpwalk was 41 degrees centigrade.

Going on from there

So... Now you know how to find out the following things:

What Sun-specific system components are at your disposal?

What unique identifier is used to refer to the component in question?

What is the current value of the component in quesiton?

You can now do loads of things! For example, you can use your monitoring software to verify that certain values don't exceed a set limit. You wouldn't want your CPUs to get hotter than 65 degrees now, do you?

For some reason unknown to me Sun has always kept their MIB file rather closed and hard to find. There's no place you can actually download the file. You will have to extract the file from the SUNWmasf package if you want to take a look at it.

To help us sysadmins out I've published the file over here. I do not claim ownership of the file in any way. Sun has the sole copyright of the file. I just put it here, so people can easily read through the file.

Monitoring Dell and HP systems through SNMP is as big a puzzle as using SNMP on Sun Microsystems' boxen. Luckily I've come a long way into figuring out how to use Net-SNMP together with HP's SIM and Dell's OpenManage.

Just like with our Solaris boxen, we want to use the Net-SNMP daemon as the main daemon on our Linux systems. At $CLIENT we use Red Hat ES3 on a great variety of Dell and HP hardware. And as was the case with SUNWmasf on Solaris, we're going to need both Dell's and HP's custom SNMP agents to monitor out hardware-specific SNMP objects. Enter SIM and OpenManage. In the next few paragraphs I'll tell you all about installing and configuring the whole deal.

Naturally it would be great if you could package all of these files into one nice .RPM, since that'll make the whole installation process a snap. Especially if you want to roll it out across hundreds of servers. I'll be making such a package for $CLIENT, but unfortunately I cannot distribute it (which is logical, what with all the proprietary info that goes into the package). Maybe, some day I'll make a generic .RPM which you guys can use.

Installing HP SIM and its components.

Just like everyone else HP also chooses to hide the installer for their SNMP agent quite deeply into their website. You will need to go to their download site and browse to the software section for your model of server. Once there you choose "Download drivers and software" and you pick your Linux flavour (in our case RHEL3). From there go to "Software - Systems management" where you can finally choose "A Collection of SNMP Protocol Tools from Net-SNMP for $YOUR_FLAVOUR". *phew* To help you get there, here's the direct link to the RHES3 version of the package.

As the file name (net-snmp-cmaX-5.1.2) suggests, this package is a modified version of the net-SNMP daemon which has added support for a whole bunch of Compaq and HP stuff. But as you can see the version of net-SNMP used is way behind today's standards, so it's wisest to use this daemon while proxied through a more current version of net-SNMP. The crappy thing though is that HP's package installs their net-SNMP in exactly the same location as our own net-SNMP. Don't worry, we'll get to that.

The download page doesn't make this immediately clear, but you'll need to download five (or six if you want the source) files. For your convenience, HP has decided to put all files into a pull-down menu, with one "Download" button. Yes, very handy indeed. =_= Another neat thing is that, for some reason, the combination Safari+Realplayer decides that -they- need to open the .RPM file that's loaded. Very odd and I've never encountered this before with other RPMs.

Because we're going to use two versions of net-SNMP that use the same locations on your hard drive, we're going to have to fiddle around a bit.

First copy these two RPMs to your system: net-snmp-cmaX and net-snmp-cmaX-libs. Install them using RPM, starting with libs and ending with the basic package. Now do the following.

You've now made sure that all parts that are required for the HP SNMP agent are safe from being overwritten by the "real" net-SNMP.

You can now install net-SNMP using the instruction laid out in the following paragraph.

Re-compiling Net-SNMP

PLEASE NOTE: If you're going to use HP SIM, please install that -first- before proceeding. See below for details.

Basically, recompiling Net-SNMP for your Linux install follows the same procedure as the recompilation on Solaris.

Download the source code for Net-SNMP version 5.2.3 (or a newer version, if you wish) from their website.

Move the .TGZ to your build system and unpack it in your regular build location. Also, building Net-SNMP successfully requires OpenSSL 0.9.7g or higher, so make sure that it's installed on your build system.

Run "make", "make test" and "make install" to complete the creation of Net-SNMP.

After "make install" has finished all the Net-SNMP files have been installed on your build system. Naturally it's important to know which files to include in your package. I will make a full listing of all files RSN(tm)..

Installing Dell OpenManage and its components.

I had a hard time finding the installer files for Dell OM on Dell's download site, util I finally figured out how their "logic" works. :D You can get Dell OM 4.5 for Linux through this direct link (which can be changed at any time by Dell), or you can search their downloads page using the term "openmanage server agent". Adding the key word "linux" seems to confuse it though, so you're going to have to manually search through the list.

Unfortunately I never did get around to using Dell OpenManage, so I cannot give you the installation instructions ;_;

Configuring HP-SIM

The configuration file for HP's version of net-SNMP is stored in /etc/snmp, unlike the version that'll be used by our own net-SNMP. Edit HP's config file and remove all the current content. Replace it with the following:

You will not have to make any further changes. The init-script and such can remain unchanged.

Configuring Dell OpenManage

Again, unfortunately I cannot give you instructions on working with OpenManage since I ran out of time.

rocommunity public 0.0.0.0
agentaddress 1163

Configuring Net-SNMP

The configuration file for Net-SNMP is located in /usr/local/share/snmp. You will need to make a whole bunch of changes over here that I won't cover, like security ACLs, SNMP trap hosts and bunches of other stuff. However, you _will_ need to add the following lines to allow Net-SNMP to talk to HP SIM and/or OpenManage.

Starting the software

Make sure that you start Net-SNMP before OpenManage or SIM. These sub-agents rely on Net-SNMP to be running, so that one needs to go first. Take care of this order using the RC scripts of your particular Linux flavour.

Right now I've only got a very limited amount of different models to test all this stuff on, so bear with me :) The following lists are only a small selection from all possible objects, that we found interesting. A full list of available options can be obtained by running:

snmpwalk -c public localhost .1.3.6.1.4.1.232

I've tried my best at making the more interesting parts of the HP and Dell MIBs legible. The results can be found in the PDF, in the menu on the left. But once again, these lists are only a small subset of the complete MIB for both vendors. You won't know all that's available to you unless you start digging through the flat .TXT files yourself. Unlike Sun, HP and Dell -do- publish their MIB files freely, so you'll have no trouble finding them on the web.

On the monitoring of disks.

Unfortunately, HP and Compaq have made it impossible to monitor hard disk statuses without add-on software. The plain vanilla SNMP agent has no way of filling the relevant objects. Instead it requires the CPQarrayd add-on.

If you do choose to install this piece of software, you can find all the objects regarding -internal- drives under OID .1.3.6.1.4.1.232.3.2.5.2 (cpqDaPhyDrvErrTable). Refer to CPQIDA.MIB.txt for all relevant details and a full listing of the appropriate OIDs.

Currently I have no way of making sure, but I assume that the alert message for HDD[0-7] can be found in .1.3.6.1.4.1.232.3.2.5.2.1.15.[0-7]. Any value above 0 is indicates a failure.

Basic Object Identifiers

All object IDs below fit under .1.3.6.1.4.1.232. These objects should be usable on every HP system in the DL/ML rangen, although I have only tested the on DL380, DL385, DL580 and ML570.

Object

Description

Values

.1.2.2.1.1.6.OID

CPU[0-3] status

1/2 = ok, 3 = warn, 4 = crit

.3.2.2.1.1.6.OID

HDD controler

1/2 = ok, 3 = warn, 4 = crit

.3.2.3.1.1.11.OID

LDD[0-X] status

1/2 = ok, 3 = warn, 4 = crit

.3.2.4.1.1.6.OID

Hot spare HDD status

>2 =crit

.3.2.5.1.1.37.OID

HDD[0-X] status

1/2 = ok, 3 = warn, 4 = crit

.5.2.2.1.1.12.OID

SCSI controler status

1/2 = ok, 3 = warn, 4 = crit

.5.2.3.1.1.8.OID

SCSI LDD[0-X] status

1/2 = ok, 3 = warn, 4 = crit

.5.2.4.1.1.26.OID

SCSI HDD[0-x] status

1/2 = ok, 3 = warn, 4 = crit

.6.2.6.7.1.9.OID

Fan status

1/2 = ok, 3 = warn, 4 = crit

.6.2.6.8.1.4.1

CPU0 temperature

Contains current temperature

.6.2.6.8.1.4.4

CPU1 temperature

Contains current temperature

.6.2.6.8.1.4.5

PSU temperature

Contains current temperature

.6.2.9.3.1.4.0.OID

PSU[0-X] status

1/2 = ok, 3 = warn, 4 = crit

.14.2.2.1.1.5.OID

IDE HDD[0-X] status

1/2 = ok, 3 = warn, 4 = crit

Fan and sensor placement

As I already said, most of the OIDs from the tables above can be used to monitor vanilla HP systems (with the exceptions of the hard disks). The biggest difference lies in the placement of certain fans and sensors. The table below outlines the various locations, depending on the model.

Each system contains multiple fans and temperature sensors and will thus have multiple instances of these objects in its SNMP tree. The locations for each of these instances can be read from .6.2.6.7.1.3.OID (fans) and 6.2.6.8.1.3.OID (temperature sensor). The $OID part of these numeric sequences are always .1.1, .1.2, .1.3, .1.4 and so on.

It's been way too long since I used these scripts. I believe they stem from 2003.

I'll need to write some more about them later. For now, know that I used these scripts to prepare for data migrations between local systems and SAN boxen. We moved from local to EMC2, then we moved from EMC2 to HP XP-1024.

At $CLIENT I've built a centralised logging environment based on Syslog-ng, combined with MySQL. To make any useful from all the data going into the database we use PHP-syslog-ng. However, I've found a bit of a flaw with that software: any account you create has the ability to add, remove or change other accounts... Which kinda makes things insecure.

So yesterday was spent teaching myself PHP and MySQL to such a degree that I'd be able to modify the guy's source code. In the end I managed to bolt on some sort of "admin-mode" which allows you to set an "admin" flag on certain user accounts (thus giving them the capabilities mentioned above).

The updated PHP files can be found in the TAR-ball in the menu of the Sysadmin section. The only thing you'll need to do to make things work is to either:

Re-create your databases using the dbsetup.sql script.

Add the "admin" column to the "users" table using the following command. ALTER TABLE users ADD COLUMN baka BOOLEAN;

Just today I ran into something shiny that peeked my interest. A shell script I'd written in Bash didn't work like I expected it to, with regards to the scope of a variable. I thought the incident was interesting enough to report, although I won't go into the whole scoping story too deeply.

What is basically boils down to is that there was a difference in the way two shells handle a certain situation. A difference that I didn't expect to be there. Not that exciting, but still very educational.

Scope?

Yeah. In most programming languages variables have a certain range within your program, within which they can be used. Some variables only exist within one subroutine, while other exist across the whole program or even across multiple parts of the whole.

In shell scripting things aren't that complicated, luckily. In most cases a variable that's set in one part of the script can be used in every other part of the script. There are some notable exceptions, one of which I ran into today without realising it.

The real code

My situation:
I have a command that outputs a number of lines, some of which I need. The lines that I'm interested in consist of various fields, two of which I need as variables. Depending on the value of one of these variables, a counter needs to be incremented.

I guess that sounds kinda complicated, so here's the real code snippet:

Where it goes wrong

While testing my script, I found out that $COUNT would never retain the value it gained in the while-loop. This of course led to the script always failing the check. After some fiddling about, I found out that the problem lay in the use of the while loop: it was being used that the end of a pipe.

To illustrate, the following -does- work.

let COUNT=0
while read i
do
let COUNT=$COUNT+$i
echo $COUNT
done

echo "Total is $COUNT."

This leads to the following output.

$ ./baka.sh
1
1
2
3
3
6
4
10
^D
Total is 10.

However, if I were to create a script called neko.sh that outputs the numbers one through four on seperate lines, which is then used in baka.sh... well... it doesn't work :D Regardez!

Conclusions

After discussing the matter with two of my colleagues (one of them as puzzled as I was, and the other knowing what was going wrong) we came to the following conclusions.

Bash spawns a new sub-shell when piping commands together. Since Bash is picky about scoping variables to sub-shells, my script doesn't work like I expected it to.

Korn shell does -not- create a new sub-shell when piping stuff. It is therefore better suited for my current script. Which is no problem, since it's targeted at Solaris systems, which all come with ksh per default.

This conclusion is supported by an example in the "Advanced Bash-scripting guide" by Mendel Cooper. In the following example an additional comment is made about the scoping of variables with redirected while loops. The comment warns that older shells branch a redirected while into a sub-shell, but also tells that Bash and Ksh this properly.

I guess our version of Bash is too old :3

Work around

Either get a newer version of Bash, that does support the proper variable scoping, or

Use another shell, like Ksh, that supports the variable scoping that your need.

A word of thanks

I'd like to thank my colleagues Dennis Roos and Tom Scholten for spending a spare hour with me, hacking at this problem. And I'd like to thank Ondrej Jombik for pointing out the fact that this article didn't make my conclusions very clear in its original version.

A few days ago we had a "new" TruCluster installed, running Tru64 5.1b. All of the stuff on it was plain vanilla, which meant that we were bound to run into some trouble. Case in point: the EMC/Legato Networker installation.

Upon installation setld complained as follows:

==========

Your choice:

1 LGTOCLNT999 EMC NetWorker Client

cannot be installed as required subset IOSWWEURLOC??? is not available.

==========

As the name suggests (EURLOC) the missing files involve the additional European locales that are not part of the default installation.

After fighting and searching and swearing a lot I got things sorted out as follows:

1. Get the Tru64 CD-ROM that was used for the installation. You'll need the "Associated Products 1" CD.

2. Insert the CD into your system.

3. Mount the CD: mount -r /dev/disk/cdrom1c /mnt

4. cd /mnt/Worldwide_Language_Support/kit

5. setld -l `pwd` IOSWWEURLOC540

This will install the locale I needed. Of course you are free to substitute the names of other locales as well.

At $CLIENT we've been having some nasty problems with our development SAP box. The box is part of a Veritas cluster and actually runs a bunch of Solaris Zones. The problems originally started about two months ago when we ran into a rare and newly discovered bug in UFS. It took a while for us to get the proper patches, but we finally managed to get that sorted out.

Remco installed the patches on Thursday morning, though he ran into some trouble. As always, patches can give you crap when it comes to cross-dependencies and this time wasn't any different. Around lunch time we thought we had things sorted out and went for the final reboot. All the zones were transferred to the proper boxen and things looked okay.

Until we tried to make a network connection. D:

None of the zones had access to the network, even though their interfaces were up and running. We sought for hours, but couldn't find anything. And like us, Sun was in the dark as well. In the end Remco and Sun worked all night to get an answer. Unfortunately they didn't make it, so I took over in the morning. Lemme tell you, once I was in the middle of all the tech and the phone calls and the managers, I found some more respect for Remco. He did a great job all through Thursday!

Just before lunch both Sun and one of the other guys came up with the solution. That was an awesome coincidence :) Turns out that the problems we were having are caused by timing issues during the boot-up of the Solaris Zones. Because we let Veritas Cluster handle the network interfaces things turned sour. Things would've worked better if we'd let the Zone framework handle things.

The stopgap solution: freeze all cluster resources to prevent fail-over, then manually restart all virtual interfaces for the zones. And presto! It works again!

Happily we went to lunch, only to come back to more crap!

Turns out that the five SAP instances we were running wouldn't fit into the available swap space anymore. Weird! Before yesterday, things would barely fit in the 30GB of swap space. And now all of a sudden SAP would eat about 38GB! o_O WTF?!

A whole bunch of managers wanted us to work through the whole weekend to sort everything out. Naturally we didn't feel to enthused, let alone the fact that the box's SLA doesn't cover weekend work.

In the end we tacked on some temporary swap space, started SAP and left for the weekend. We'll have to take more downtime on Monday for granted. It also leaves us with two big things to fix:

This script is an evolution of my earlier check_ntp_config. This time it's meant for use with Tivoli, although modifying it for use with Nagios is trivial. The script was written to be usable on at least five different Unices, though i've been having trouble with Darwin/OS X.

The script was tested on Red Hat Linux, Tru64, HP-UX, AIX and Solaris. Only Darwin seems to have problems.

Just like my other recent Nagios scripts, check_ntpconfig.sh comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.

Last night's planned change was supposed to last about two hours: get in, install some patches, switch some cluster resources around the nodes, install some more patches and get out. The fact that the installation involved a HP-UX system didn't get me down, even though we only work with Sun Solaris and Tru64. The fact that it involved a ServiceGuard cluster did make me a little apprehensive, but I felt confident that the procedures $CLIENT had supplied me would suffice.

Everything went great, until the 80% mark... Failing the applications back over to their original node failed for some reason and the cluster went into a wonky state. The cluster software told me everything was down, even though some of the software was obviously still running. The cluster wouldn't budge, not up, nor down. And that's when I found out that I rather dislike HP ServiceGuard, all because of one stupid flaw.

You see, all the other cluster software I know provides me with a proper listing of all the defined resources and their current state. Sun Cluster, Veritas Cluster Service and Tru Cluster? All of them are able to give me neat list of what's running where and why something went wrong. Well, not HP Damn ServiceGuard. Feh!

We ended up stopping the database manually and resetting all kinds of flags on the cluster software. Finally, after six hours (instead of the original two), I got off from work around 23:00. Yes... /me heartily dislikes HP ServiceGuard.

Oy vey! One of the folks on the Sun Fire V890 must've been mesjoge! Why else would you decided to make such a weird design decision?!

What's up? I'll tell you what's up!

For some reason the design team decided to throw out the RJ45 console port that's been a Sun standard for nigh on ten years. And what did they replace it with? A DB25 port commonly seen in the Mesozoic Era! Good lord! This left me stranded without the proper cable for this morning's installation (thankfully I could borrow one). However, it also requires us to get completely new and different cables for our Cyclades console server!

1. I'm not overly familiar with the OGB and the Open Solaris project's modus operandi. I'm going to bone up on those subjects tonight.

2. It seems that the dutch branch of the OS project doesn't even notice much of the OGB's dealings. When I asked one of "our" leading guys about some recent dealings he hadn't actually even heard of them yet.

Now... On with the show.

When it comes to the Open Solaris project I'm having mixed feelings. On the one hand Solaris and it's step-sister Open Solaris are my favourite "true" UNIX and I really want to see the OS to be a successful one. I feel at home in the OS, I admire the great improvements Sun and the community make to the OS and Solaris has almost never let me down (maybe one or two occasions).

But then there's discussions such as these: a few members of the Open Solaris community propose to build an official binary distribution (dubbed Project Indiana) and they have executive backing from Sun. The first reply is a rather constructive one: it tells what's wrong with the proposal and why it won't be accepted (in it's current form) by the OGB. But then the whole discussion derails with post upon post of bureaucracy, going back and forth about which rules should be applied to whom and what in which situations and at what times... Etc, etc...

While I'm all in favor of having strict project management and of handling your business in a organised and procedural manner, one can go too far. Linux has always felt a little bit too organic to me, although they do seem to get the job done in a rather good way. But the way the Open Solaris group works seems just way too convoluted to me. I hope that it's just a matter of streamlining things over the coming months/years and that things will loosen up a little by then.

Now that I've gotten my mits on an Intel Macbook I've also started dabbling with Parallels Desktop, a piece of software that'll let you run a whole bunch of virtual machines inside Mac OS X. For my work it's rather handy to have a spare Solaris system lying around, so I went with the Solaris Express image that I mentioned a few weeks ago. And now that it's about time for me to get started on my LPIC-2 exam it's also handy to have at least one Linux at hand.

Enter a pre-installed and configured Fedora Core 6 image for Parallels. At only ~730MB in size that really isn't that bad. Saves me a lot of trouble as well.

Just be sure to set your RAM at 512 MB. Any higher is supposed to crash FC, according to this OS X hint.

EDIT:

Tried it with my last day of the Parallels demo. It works like a charm :)

Well, I have finally unsubscribed myself from the Nagios mailing lists. It was great being a member of those lists while I was working with the software on a daily basis, but these days I've put Nagios behind me. I haven't written one line of Nagios monitoring code for months now.

The past two weeks we've been having a rather mysterious problem with one of our TruClusters.

During hardware maintenance of the B-node we moved all cluster resources to the A-node to remain up and running. Afterwards we let TruCluster balance all the resources so performance would benefit again. Sounds good so far and everything kept on working like it should.

However, during some nights the A-node would slow to a crawl, not responding to any commands and inputs. We were stumped, because we simply couldn't find the cause of the problem. The system wasn't overloaded, with a low load average. The CPU load was a bit remarkable, with 10% user, 50% system and the rest in idle. The network wasn't overloaded and there was no traffic corruption. None of the disks were overloaded, with just two disks seeing moderate to heavy use. It was a mystery and we asked HP to help us out.

After some analysis they found the cause of the problem :) Part of one of the applications that was failed over to the A-node were two file systems. After the balancing of resources these file systems stuck with the A-node, while the application moved back to the B-node. So now the A-node was serving I/O to the B-node through its cluster interconnect! This also explains the high System Land CPU load, since that was the kernel serving the I/O. :D

We'll be moving the file systems back to the B-node as well and we'll see whether that solves the issues. It probably will :)

Ever since Apple switched to Intel processors in their systems and Parallels came out with their Parallels Desktop software it's been possible to run Windows, Linux and other Unices inside virtual machines on your Mac. That's totally great, since it allows you to run various test systems without needing additional hardware!

A lot of people also got Solaris 10 to run in PD, although some ran into a little bit of trouble. Well, not anymore! Sun has created a pre-installed Solaris Express image for use with Parallels Desktop. This allows you to immediately get up and running with Solaris, without even having to go through any of the normal installation hoops.

One of the obvious down sides to using a scripting language like ksh as opposed to a "real" programming language like Perl or PHP (or C for that matter) is that, for each command that you string together, you're forking off a new process.

This isn't much of a problem when your script isn't too convoluted or when your dataset isn't too large. However, when you start processing 40-50MB log files with multiple FOR loops containing a few IF statements for each line, then you start running into performance issues.

And as I'm running into just that I'm trying to find ways to cut down on the forking, which means getting rid of as many IFs and pipes as possible. Here's a few examples of what has worked for me so far...

Instead of running:

[ expr1 ] && command1

[ expr2 ] && command1

Run:

[ (expr1) && (expr2) ] && command1

Why? Because if test works the way I expect it to, it'll die if the first expression is untrue, meaning that it won't even try the second expression. If you have multiple commands that complement eachother then you ought to be able to fit them into a set of parentheses after test cutting down on more forks.

Instead of running:

if [ `echo $STRING | grep $QUERY | wc -l` -gt 0 ]; then

Run:

if [ ! -z `echo $STRING | grep $QUERY` ]; then

More ideas to follow soon. Maybe I ought to start learning a "real" programming language? :D

EDIT:

OMG! I can't believe that I've just learnt this now, after eight years in the field! When using the Korn shell use [[ expr ]] for your tests as opposed to [ expr ].

Why? Because the [ expr ] is a throw-back to Bourne shell compatibility that makes use of the external test binary, as opposed to the built-in test function. This should speed up things considerably!

When writing shell scripts for my customers I always try to be as clear as possible, allowing them to modify my code even long after I'm gone. In order to achieve this I usually provide a rather lengthy piece of opening comments, with comments add throughout the script for each subroutine and for every switch or command that may be unclear to the untrained eye.

In general I've found that it's best to have at least the following information in your opening blurb:

* Who made the program? When was it finalised? Who requested the script to be made? Where can the author be reached for questions?

* A "usage" line that shows the reader how to call the program and which parameters are at his disposal.

* A description of what the program actually does.

* Descriptions for each of the parameters and options that can be passed to the script.

* The limitations imposed upon the script. Which specific software is needed? What other requisites are there? What are the nasty little things that may pop up unexpectedly?

* What are the current bugs and faults? The so-called FIXMEs.

* A description of the input that the program takes.

* A description of the output that the program generates.

Equally important is the inclusion of debugging capabilities. Of course you can start adding "echo" lines at various, strategic points in the script when you run into problems, but it's oh-so-much nicer if they're already in there! Adding those new lines is usually a messy affair that can make your problems even worse :( I usually prepend the debugging commands with "[ $DEBUG -eq 1 ] &&", which allows me to turn the debugging on or off at the top of the script using one variable.

And finally, for the more involved scripts, it's a great idea to write a small test suite. Build a script that actually takes the real script through its loops by automatically generating input and by introducing errors.

Two examples of script where I did all of this are check_suncluster and check_log3 with the new TEC-analysis.sh on its way in a few days.

So far, TEC-analysis.sh checks in at:

* 497 lines in total.

* 306 lines of actual code.

* 136 lines of comments.

* 55 lines of debugging code.

Approximately 39% of this script exists solely for the benefit of the reader and user.

Ruddy heck, what a day! All in all it took me around thirteen hours, but I've finally finished my LPIC-102 summary. 41 pages of Linuxy goodness, bound to drag me through the second part of my LPIC-1 exams.

Argh, now I'm off to bed. =_= *cough* Let's hope I don't get called for any stand-by work.

Today I was working on a shell script that's supposed to process multiple text files in the exact same manner. Usually you can get through this by running a FOR-loop where the code inside the loop is repeated for each file in a sequential manner.

Since this would take a lot of time (going over 1e6 lines of text in multiple passes) I wondered whether it wouldn't be possible to run the contents of the FOR-loop in parallel. I rehashed my script into the following form:

subroutine()

{

contents of old FOR-loop, using $FILE

}

for file in "list of files"

do

FILE="$file"

subroutine &

done

This will result in a new instance of your script for each file in the list. Got seven files to process? You'll end up with seven additional processes that are vying for the CPUs attention.

On average I've found that the performance of my shell script was improved by a factor of 2.5, going from ~40 lines per three seconds to ~100 lines. I was processing seven files in this case.

The only downside to this is that you're going to have to build in some additional code that prevents your shell script from running ahead, while the subroutines are running in the background. What this code needs to be fully depends on the stuff you're doing in the subroutine.

Today I faced the task of replacing a failing hard drive in one of our Tru64 boxen. The disk was part of a disk group being used to serve plain data (as opposed to being part of the boot mirror / rootdg), so the replacement should be rather simple.

After some poking about I came to the following procedure. Those in the know will recognize that it's very similar to how Veritas Volume Manager (VXVM) handles things. This is because Tru64 LSM is based on VXVM v2.

The remirroring process will now start for all broken mirrors. Unfortunately there is no way of tracking the actual process. You can check whether the mirroring's still running with "volprint -ht -g $diskgroup | grep RECOV", but that's about it.

I've never been overly fond of HP-UX, mostly sticking to Solaris and Mac OS X, with a few outings here and there. Given the nature of one of my current projects however, I am forced to delve into HP's own flavour of Unix.

You see, I'm building a script that will retrieve all manner of information regarding firmware levels, driver versions and such so we can start a networkwide upgrade of our SAN infrastructure. With most OSes I'm having a fairly easy time, but HP-UX takes the cake when it comes to being backwards :[

You see, if I want to find out the firmware level for a server running HP-UX I have two choices:

1. Reboot the system and check the firmware revision from the boot prompt.

2. Use the so-called Support Tools Manager utility, called [x,m,c]stm.

CSTM is the command line interface to STM and thank god that it's scriptable. In reality the binary is a CLI menu driven system, but it takes an input file for your commands.

For those who would like to retrieve their firmware version automatically, here's how:

...

Uhm... FSCK! *growl* *snarl* What the heck is this?! For some screwed up reason my shell keeps on adding a NewLine char after the output of each command. That way a variable which gets its value from a string of commands will always be "$VALUE
". WTF?! o_O

As I promised a few days ago I'd also give you guys the quick description of how to add a new LUN to a Tru64 box. Instead of what I told you earlier, I thought I'd put it in a separate blog post instead. No need to edit the original one, since it's right below this one.

Adding a new LUN to a Tru64 box with TruCluster

1. Assign new LUn in the SAN fabric.

No something I usually do.

2. Let the system search for new hardware.

hwmgr scan scsi

3. Label the "disk".

disklabel -rw $DISK

4. Add the disk to a file domain (volume group).

mkfdmn $DISK $DOMAIN

5. Create a file set (logical volume).

mkfset $DOMAIN $FILESET

6. Create a file system.

Not required on Tru64. Done by the mkfset command.

7. Test mount.

Mount.

8. Add to fstab.

vi /etc/fstab

Also, if you want to make the new file system fail over with your clustered application, add the appropriate cfsmgr command to the stop/start script in /var/cluster/caa/bin.

The past two weeks I've been learning new stuff at a very rapid pace, because my client uses only a few Solaris boxen and has no Linux whatsoever. So now I need to give myself a crash course in both AIX and Tru64 to do stuff that I used to do in a snap.

For example, there's adding a new SAN device to a box, so it can use it for a new file system. Luckily most of the steps that you need to take are the same on each platform. It's just that you need to use different commands and terms and that you can skip certain steps. The lists below show the instructions for creating a simple volume (no mirroring, striping, RAID tricks, whatever) on all three platforms.

I'll edit this post to add these instructions tomorrow, or on Friday. I still need to try them out on a live box ;)

Anywho. It's all pretty damn interesting and it's a blast having to almost instantly know stuff that's completely new to me. An absolute challenge! It's also given me a bunch of eye openers!

For example I've always thought it natural that, in order to make a file system switch between nodes in your cluster, you'd have to jump through a bunch of hoops to make it happen. Well, not so with TruCluster! Here, you add the LUN, go through the hoops described above and that's it! The OS automagically takes care of the rest. That took my brain a few minutes to process ^_^

This morning I went to my local Prometric testing center for my LPI 101 exam (part one of two, for the LPIC-1). On forehand I knew I wasn't perfectly prepared, since I'd skipped trial exams and hadn't studied that hard, so I was a little anxious. Only a little though, since I usually test quite well.

Anywho: out of a maximum of 890 points I got 660, with 500 points being the minimum passing grade. Read item 2.15 this page to learn more about the weird scoring method used by the LPI. It boils down to this: out of 70 questions I got 61 correct, with a minimum of 42 to pass. If we'd use the scoring method Sun uses, I'd have gotten an 87%. Not too bad, I'd say!

I did run into two things that I was completely unprepared for. I'd like to mention them here, so you won't run into the same problem.

1. All the time, while preparing, I was told that I'd have to choose a specialization for my exam: either RPM or DPKG. Since I know more about RPM I had decided to solely focus on that subject. But lo and behold! Apparently LPI has _very_ recently changed their requisites for the LPIC-1 exams and now they cover _both_ package managers! D:

2. In total I've answered 98 questions, instead of the 70 that was advertised. LPI mentions on their website (item 2.13) that these are test-questions, considered for inclusion in future exams. These questions are not marked as such and they do not count towards your scoring. It would've been nice if there had been some kind of screen or message warning me about this _at_the_test_site_.

Version 1.0 of my LPIC-101 study notes is available. I bashed it together using the two books mentioned below. A word of caution though: this summary was made with my previous knowledge of Solaris and Linux in mind. This means that I'm skipping over a shitload of stuff that might still be interesting to others. Please only use my summary as something extra when studying for your own exam.

Phew! That was a long night! I'm not used to staying up this late on weekdays =_=

I went to the first NLOSUG meeting tonight, like I said I would a few days ago. Aside from finally learning a little bit about Open Solaris (although most of it was basic community stuff) and some more in-depth stuff on ZFS, it was also very cool to meet some old acquaintances. There was a bunch of folks from Sun whom I hadn't seen in a long time, as well as Martijn and Job with whom I'd worked as colleagues a long time ago. Shiny :)

So the eve' was mostly for fun, with a little education thrown in. Well worth the hours I put in...

Many thanks to my colleague Guldan who pointed me towards a website giving a short description of using the BSD hardware-sensors daemon, together with Nagios in order to monitor your hardware. Using sensord should make things a lot easier for people running BSD, as they won't have to muck about with SNMP OIDs and so on.

Sun has made arrangements for the inaugural meeting of the Dutch Open Solaris Users Group. The meeting will be held on the evening of Thursday the 26th, at their office in Amersfoort.

Aside from the stuff you'd expect (like a few lectures on new Solaris features) you could also say it'll be a fun evening :) Meet some new people, have some food'n'drinks all mixed in with some interesting work-related stuff.

Damn! I'm really starting to hate Dependency Hell. Installing a few Nagios check scripts requires the Perl Net::SNMP module. This in turn requires three other modules. Each of these three modules requires three other modules, three of which require a C compiler on your system (which we naturally don't install on production systems). And neither can we use the port/emerge/apt-get alike Perl tools from CPAN, since (yet again) these are production systems. Augh!

While working on the $CLIENT-internal package for the Nagios client (net-SNMP + NRPE + Nagios scripts + Dell/HP SNMP agent), I've been learning about compound RPM packages. I.e., packages where you combine multiple source .TGZs into one big RPM package. This requires a little magic when it comes to running the various configure and make scripts. Luckily I've found two great examples.

Recently I've been trying to learn how to build my own packages, both on Solaris and on Linux. I mean, using real packages to install your custom software is a much better approach than simply working with .TGZ files. In the process I've found two great tutorials/books:

* Maximum RPM, originally written as a book by one of Red Hat's employees.

Boy lemme tell ya: making a nice SNMP configuration so you can actually monitor something useful takes a lot of work! :) The menu on the left has been gradually expanding with more and more details regarding the monitoring of Solaris (and Sun hardware) through SNMP. Check'em out!

After digging through Sun's MIB description (see SUN-PLATFORM-MIB.txt) it became clear to me that things are a lot more convoluted than I originally expected. For example, each sensor in the Sun Fire systems lead to at least five objects each describing another aspect of the sensor (name, value, expected value, unit, and so on). Unfortunately Sun has no (public) description of all possible SNMP sensor objects so I've come to the following two conclusions:

1. I'll figure it all out myself. For each model that we're using I'll weasel out every possible sensor and all information relevant to these sensors.

2. I'll have to write my own check script for Nagios which deals with with all the various permutations of sensor arrays in an appropriate fashion. Joy...

EDIT:

For your reference, Sun has released the following documents that pertain to their SNMP implementation. Mostly they're a slight expansion on the info from the MIB. At least they're much easier on the eyes when reading :p

Right now I'm working on getting my Sun systems properly monitored through SNMP. Using the LM_sensors module for Net-SNMP has gotten me quite far, but there's one drawback. A lot of Sun's internal counters use some really odd values that don't speak for themselves. This makes it necessary to read through Sun's own MIB and correlate the data in there with the stuff from LM_sensors.

Point is, Sun isn't very forthcoming with their MIB even though it should probably be public knowlegde. Nowhere on the web can I find a copy of the file. The only way to get it is by extracting it from Sun's free SUNWmasfr package, which I have done: here's SUN-PLATFORM-MIB.txt

In now way am I claiming this file to be a product of mine and it definitely has Sun's copyright on it. I just thought I'd make the file a -little- bit more accessible through the Internet. If Sun objects, I'm sure they'll tell me :3

Both check_log2 and check_log3 have been thoroughly debugged today. Finally. Thanks to both Kyle Tucker and Ali Khan for pointing out the mistakes I'd made. I also finally learned the importance of proper testing tools, so I wrote test_log2 and test_log3 which run the respective check scripts through all the possible states they can encounter.

Oh... check_ram was also -finally- modified to take the WARN and CRIT percentages through the command line. Shame on me for not doing that earlier.

Today I made an improved version of the Nagios monitor "check_log2", which is now aptly called "check_log3". Version 3 of this script gives you the option to add a second query to the monitor. The previous two incarnations of the script only allowed you to search for one query and would return a Critical if it was found. Now you can also add a query which will return in a Warning message as well. Goody!

I know that, the first time I started using Nagios, I got confused a little when it came to monitoring systems other than the one running Nagios. To shed a little light on the subject for the beginning Nagios user, here's a discussion of the various methods of talking to Nagios clients.

First off, let me make it absolutely clear that, in order to monitor systems other than the one running Nagios, you are indeed going to have to communicate with them in some fashion. Unfortunately very few things in the Sysadmin trade are magical, and Nagios is unfortunately not one of them.

So first off, let's look at the -wrong- way of doing things. When I first started with Nagios (actually I made this mistake on my second day with the software) I wrote something like this:

The problem with this setup is that I was using a -local- check and said it belonged to remote-host. Now this may look alright on the status screen ("Hey! It's green!"), but naturally you're not monitoring the right thing ^_^

So how -do- you monitor remote resources? Here's a table comparing various methods. After that I'll give examples on how you can correct the mistake I made above with each method.

PLEASE NOTE: the following discussion will not cover the monitoring of systems other than the various UNIX flavours. Later on I'll write a similar article covering Windows and stuff like Cisco.

SSH

Just about everyone should already have SSH running on their servers (except for those few who are still running telnet or, horror or horrors!, rsh). So it's safe to assume that you can immediately start using this communications method to check your clients. You will need to:

create a nagios user on the client,

make sure that the nagios user from the server can log in to this account without a password (through keys),

install all check script on the system, and

for each command that you want to run through SSH, create a check command definition in checkcommands.cfg.

You can now set up your services.cfg in such a way that each remote service is checked like so:

Working this way will allow you to do most of your configuring centrally (on the Nagios server), thus saving you a lot of work on each client system. All you have to do over there is make sure that there's a working user account and that all the scripts are in place. Quite convenient... The only drawback being that you're making a relatively open account which has full access to the system (sometimes even with sudo access).

NRPE

As a replacement for the SSH access method, Ethan also wrote the NRPE daemon. Using NRPE requires that you:

create a nagios user on the client,

configure inetd, xinetd, tcp wrappers, services and stuff like that,

download and install NRPE on each client,

install all check scripts on the system, and

configure the NRPE daemon to have a check command for each check you would like to perform.

You can now set up your services.cfg in such a way that each remote service is checked like so:

And in /usr/local/nagios/etc/nrpe.cfg on the client you would need to include:

command[check_root]=/usr/local/nagios/libexec/check_disk 85 95 /

Good thing is that you won't have a semi-open account lying about. Bad things are that, if you want to change the configuration of your client, you're going to have to login. And you're going to have yet another piece of software to keep up to date.

SNMP

Whoo boy! This is something I'm working on right now at $CLIENT and let me tell you: it's hard! At least much harder than I was expecting.

SNMP is a network management protocol used by the more advanced system administrators. Using SNMP you can access just about -any- piece of equipment in your server room to read statistics, alarms and status messages. SNMP is universal, extensible, but it is also quite complicated. Not for the faint of heart.

configure the SNMP daemon in such a way that the results of the check scripts are placed in custom objects, and

configure basic security on the SNMP daemon.

The reason why point C tells you to register a private EID, is because the SNMP tree has a very rigid structure. Technically speaking you -could- just plonk down your results at a random place in the tree, but it's likely that this will screw up something else at a later time. IANA allows each company to have only one private EID, so first check if your company doesn't already have one on the IANA list.

Ufortunately the check_snmp script that comes with Nagios isn't flexible enough to let you monitor custom SNMP objects in a nice way. This is why I wrote the retrieve_custom_nagios script, which is available from the menu. Your service definition would look like this:

Up to now things are actually not that different from using NRPE, are they? Well, that's because we haven't even started using all the -real- features of SNMP. Point is that using SNMP you can dig very deeply into your system to retrieve all kinds of useful information. And -that's- where things get complicated because you're going to have to dig up all the object IDs (OIDs) that you're going to need. And in some cases you're going to have to install vendor specific sub-agents that know how to speak to your specific hardware.

One of the best features of SNMP though are the so-called traps. Using traps the SNMP daemon will actively undertake action when something goes wrong in your system. So if for instance your hard disk starts failing, it is possible to have the daemon send out an alert to your Nagios server! Awesome! But naturally this will require a boatload of additional configuration :(

So... SNMP is an awesomely powerful tool, but you're going to have to pay through the nose (in effort) to get it 100% perfect.

SNMP traps

SNMP doesn't involve polling alone. SNMP enabled devices can also be configured to automatically send status updates do a so-call trap host. The downside to receiving SNMP traps with Nagios is that it takes quite a lot of work to get them into Nagios :D

To make proper use of monitoring through SNMP you'll need to:

install an SNMP daemon / agent on your system,

define all the SNMP traps you would like to send on the client,

install an SNMP trap daemon on your server,

configure the SNMP trap daemon to tell it what to do with the incoming traps,

install something that makes a translation between SNMP traps and Nagios service definitions.

There are -many- ways to get the SNMP traps translated for Nagios' purposes, 'cause there's many roads that lead to Rome. Unfortunately none of them are very easy to use.

My crappy-ass solution, that just cross-references a list of OIDs to a list of Nagios actions.

NSCA

And finally there's NSCA. This daemon is usually used by distributed Nagios servers to send their results to the central Nagios server, which gathers them as so-called "passive checks". It is however entirely possible to install NSCA on each of your Nagios clients, which will then get called to send in the results of local checks. In this case you'll need to:

make sure all your check scripts are on the client,

download and install the NSCA binaries on your client,

make a script which can be run from cron to run each script and then to forward the results through NSCA, and

For the configuration on the client side I recommend that you read up on NSCA. It's a little bit too much to show over here.

The upside to this is that you won't have to run any daemon on your client to accept incoming connections. This will allow you to lock down your system in a hard way.

Naturally you are absolutely free to combine two or more of the methods described above. You could poll through NRPE and receive SNMP traps in one environment. This will have both ups and downs, but it's up to your own discretion. Use the tools that feel natural to you, or use those that are already standard in your environment.

I realise I've rushed through things a little bit, but I was in a slight hurry :) I will go over this article a second time RSN, to apply some polish.

This script was written at the time I was hired by UPC / Liberty Global.

Improved log checker for Solaris, with state retention.

I found that the version of check_log included in the default monitor package doesn't work perfectly on Solaris: it needs a bit of tweaking... Which is what I've done for the script.

Also, I've added state retention. It's a bit of a hack, but hey! I needed a quick solution.

The original script sends a Critical when it detects the string you've queried the log file for, but it clears that same Critical immediately if the same message is not repeated once the monitor runs again. Meaning that, if there are no updates to your log file, the Critical will only be around until the next time the monitor runs.

Not very handy if the Critical occurs during the night.

This new version of the script creates a file called $oldlog.STATE in /usr/local/nagios/var (which should be 755, nagios:nagios), which contains the exit status for the last detected _changed_ status... If there are no changes detected in your log file, this old exit state is repeated.

The script has been tested on Solaris 8, Mac OS X 10.4 and Redhat ES3.

UPDATE 19/06/2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

Also stomped out a few horrendous bugs! I'm very sorry for putting out such a buggy script earlier... If you've started using the script in your environment, please download the latest version. Thanks to Ali Khan for pointing out these mistakes.

#!/bin/bash
#
# Log file pattern detector plugin for Nagios
# Written by Ethan Galstad (nagios@nagios.org)
# Last Modified: 07-31-1999
# Updated by Thomas Sluyter (nagiosATkilalaDOTnl)
# Last Modified: 19-06-2006
#
# Usage: ./check_log2 -F log_file -O old_log_file -Q pattern
#
# Description:
#
# This plugin will scan a log file (specified by the log_file option)
# for a specific pattern (specified by the pattern option). Successive
# calls to the plugin script will only report *new* pattern matches in the
# log file, since an copy of the log file from the previous run is saved
# to old_log_file.
#
# Output:
#
# On the first run of the plugin, it will return an OK state with a message
# of "Log check data initialized". On successive runs, it will return an OK
# state if *no* pattern matches have been found in the *difference* between the
# log file and the older copy of the log file. If the plugin detects any
# pattern matches in the log diff, it will return a CRITICAL state and print
# out a message is the following format: "(x) last_match", where "x" is the
# total number of pattern matches found in the file and "last_match" is the
# last entry in the log file which matches the pattern.
#
# Notes:
#
# If you use this plugin make sure to keep the following in mind:
#
# 1. The "max_attempts" value for the service should be 1, as this
# will prevent Nagios from retrying the service check (the
# next time the check is run it will not produce the same results).
#
# 2. The "notify_recovery" value for the service should be 0, so that
# Nagios does not notify you of "recoveries" for the check. Since
# pattern matches in the log file will only be reported once and not
# the next time, there will always be "recoveries" for the service, even
# though recoveries really don't apply to this type of check.
#
# 3. You *must* supply a different old_file_log for each service that
# you define to use this plugin script - even if the different services
# check the same log_file for pattern matches. This is necessary
# because of the way the script operates.
#
# 4. Changes to the script were made by Thomas Sluyter (nagios@kilala.nl).
# The first set of changes will allow the script to run properly on Solaris, which
# it did not do by default. The second set of changes will allow the following:
# * State retention. If a NOK was generated at point A in time and it is not repeated
# at A+1, then an OK is sent to Nagios. Not something that you would like to happen.
# I've added the $oldlog.STATE trigger file which retains the last exitstatus. Should
# there be no new lines added to the log, check_log will simply repeat the last state
# instead of give an OK.
#
# Examples:
#
# Check for login failures in the syslog...
#
# check_log -F /var/log/messages -O /usr/local/nagios/var/check_log.badlogins.old -Q "LOGIN FAILURE"
#
# Check for port scan alerts generated by Psionic's PortSentry software...
#
# check_log -F /var/log/messages -O /usr/local/nagios/var/check_log.portscan.old -Q "attackalert"
#
# Paths to commands used in this script. These
# may have to be modified to match your system setup.
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
PROGNAME=`basename $0`
PROGPATH=`echo $0 | sed -e 's,[\\/][^\\/][^\\/]*$,,'`
#. $PROGPATH/utils.sh
. /usr/local/nagios/libexec/utils.sh
print_usage() {
echo "Usage: $PROGNAME -F logfile -O oldlog -Q query"
echo "Usage: $PROGNAME --help"
}
print_help() {
echo ""
print_usage
echo ""
echo "Log file pattern detector plugin for Nagios"
echo ""
support
}
# Make sure the correct number of command line
# arguments have been supplied
if [ $# -lt 6 ]; then
print_usage
exit $STATE_UNKNOWN
fi
# Grab the command line arguments
exitstatus=$STATE_WARNING #default
while test -n "$1"; do
case "$1" in
--help)
print_help
exit $STATE_OK
;;
-h)
print_help
exit $STATE_OK
;;
-F)
logfile=$2
shift
;;
-O)
oldlog=$2
shift
;;
-Q)
query=$2
shift
;;
*)
echo "Unknown argument: $1"
print_usage
exit $STATE_UNKNOWN
;;
esac
shift
done
# If the source log file doesn't exist, exit
if [ ! -e $logfile ]; then
echo "Log check error: Log file $logfile does not exist!"
exit $STATE_UNKNOWN
echo $STATE_UNKNOWN > $oldlog.STATE
fi
# If the oldlog file doesn't exist, this must be the first time
# we're running this test, so copy the original log file over to
# the old diff file and exit
if [ ! -e $oldlog ]; then
cat $logfile > $oldlog
if [ `tail -1 $logfile | grep -i $query | wc -l` -gt 0 ]
then
echo "Log check data initialized... Last line contained error message."
echo $STATE_CRITICAL > $oldlog.STATE
exit $STATE_CRITICAL
else
echo "Log check data initialized..."
echo $STATE_OK > $oldlog.STATE
exit $STATE_OK
fi
fi
# A bug which was caught very late:
# If newlog is shorter than oldlog, the diff used below will return
# false positives for the query because the will be in $oldlog. Why?
# Because $oldlog is not rolled over / rotated, like $newlog. I need
# to fix this in a kludgy way.
if [ `wc -l $logfile|awk '{print $1}'` -lt `wc -l $oldlog|awk '{print $1}'` ]
then
rm $oldlog
cat $logfile > $oldlog
if [ `tail -1 $logfile | grep -i $query | wc -l` -gt 0 ]
then
echo "Log check data re-initialized... Last line contained error message."
echo $STATE_CRITICAL > $oldlog.STATE
exit $STATE_CRITICAL
else
echo "Log check data re-initialized..."
echo $STATE_OK > $oldlog.STATE
exit $STATE_OK
fi
fi
# Everything seems fine, so compare it to the original log now
# The temporary file that the script should use while
# processing the log file.
if [ -x mktemp ]; then
tempdiff=`mktemp /tmp/check_log.XXXXXXXXXX`
else
tempdate=`/bin/date '+%H%M%S'`
tempdiff="/tmp/check_log.${tempdate}"
touch $tempdiff
fi
diff $logfile $oldlog > $tempdiff
if [ `wc -l $tempdiff|awk '{print $1}'` -eq 0 ]
then
rm $tempdiff
touch $oldlog.STATE
exitstatus=`cat $oldlog.STATE`
echo "LOG FILE - No status change detected. Status = $exitstatus"
exit $exitstatus
fi
# Count the number of matching log entries we have
count=`grep -c "$query" $tempdiff`
# Get the last matching entry in the diff file
lastentry=`grep "$query" $tempdiff | tail -1`
rm -f $tempdiff
cat $logfile > $oldlog
if [ "$count" = "0" ]; then # no matches, exit with no error
echo "Log check ok - 0 pattern matches found"
exitstatus=$STATE_OK
else # Print total matche count and the last entry we found
# echo "($count) $lastentry"
echo "Log check NOK - $lastentry"
exitstatus=$STATE_CRITICAL
echo $STATE_CRITICAL > $oldlog.STATE
fi
exit $exitstatus

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

Today I made an improved version of the Nagios monitor "check_log2", which is now aptly called "check_log3". It includes all the improvements I originally added to "check_log2", so you can simply use this as a drop-in replacement.

Version 3 of this script gives you the option to add a second query to the monitor.

The previous two incarnations of the script only allowed you to search for one query and would return a Critical if it was found. Now you can also add a query which will return in a Warning message as well. Goody! :3

1st of Feb, 2006:
Kyle Tucker pointed out that he had problems running this script with bash on Solaris. The changes he suggested have been worked into the newer version. Thanks Kyle :)

5th of Mar, 2006:
I finally got round to fix the script according to all the changes Kyle (and others) suggested. So here's another try! Right now I've tested the script on Red Hat, Mac OS X and Solaris, so it should be much better than before.

19th of June, 2006:
Cleaned up the script a bit and added some checks that are considered the Right Thing to do. Should have done this -way- earlier!

Also stomped out a few horrendous bugs! I'm very sorry for putting out such a buggy script earlier... If you've started using the script in your environment, please download the latest version. Thanks to Ali Khan for pointing out these mistakes.

#!/bin/bash
#
# Log file pattern detector plugin for Nagios
# Written by Ethan Galstad (nagios@nagios.org)
# Last Modified: 07-31-1999
# Heavily modified by Thomas Sluyter (nagiosATkilalaDOTnl)
# Last Modified: 19-06-2006
#
# Usage: ./check_log3 -F log_file -O old_log_file -C crit-pattern -W warn-pattern
#
# Description:
#
# This plugin will scan a log file (specified by the log_file option)
# for specific patterns (specified by the XXX-pattern options). Successive
# calls to the plugin script will only report *new* pattern matches in the
# log file, since an copy of the log file from the previous run is saved
# to old_log_file.
#
# Output:
#
# On the first run of the plugin, it will return an OK state with a message
# of "Log check data initialized". On successive runs, it will return an OK
# state if *no* pattern matches have been found in the *difference* between the
# log file and the older copy of the log file. If the plugin detects any
# pattern matches in the log diff, it will return a CRITICAL state and print
# out a message is the following format: "(x) last_match", where "x" is the
# total number of pattern matches found in the file and "last_match" is the
# last entry in the log file which matches the pattern.
#
# Notes:
#
# If you use this plugin make sure to keep the following in mind:
#
# 1. The "max_attempts" value for the service should be 1, as this
# will prevent Nagios from retrying the service check (the
# next time the check is run it will not produce the same results).
#
# 2. The "notify_recovery" value for the service should be 0, so that
# Nagios does not notify you of "recoveries" for the check. Since
# pattern matches in the log file will only be reported once and not
# the next time, there will always be "recoveries" for the service, even
# though recoveries really don't apply to this type of check.
#
# 3. You *must* supply a different old_file_log for each service that
# you define to use this plugin script - even if the different services
# check the same log_file for pattern matches. This is necessary
# because of the way the script operates.
#
# 4. Changes to the script were made by Thomas Sluyter (cailin@kilala.nl).
# * The first set of changes will allow the script to run properly on Solaris, which
# it did not do by default. The second set of changes will allow the following:
# * State retention. In the original script, if a NOK was put into the log file
# at point A in time and it is not repeated at A+1, then an OK is sent to Nagios.
# Not something that you would like to happen.
# I've added the $oldlog.STATE trigger file which retains the last exitstatus. Should
# there be no new lines added to the log, check_log will simply repeat the last state
# instead of give an OK.
# In order for this state retention to work properly your client system MUST
# HAVE THE DIRECTORY /USR/LOCAL/NAGIOS/VAR.
# * Two queries. In the original script you could only enter one query which, when
# found, would result in a Critical message being sent to Nagios. I've added the
# possibility to add another query, which will result in a Warning message.
# * Bugfix: changed all instances of "crit-count" and "warn-count" to "critcount" and
# "warncount" after a tip from Kyle Tucker who ran into problems running this script
# with bash on Solaris.
#
# Paths to commands used in this script. These
# may have to be modified to match your system setup.
PATH="/usr/bin:/usr/sbin:/bin:/sbin"
PROGNAME=`basename $0`
PROGPATH=`echo $0 | sed -e 's,[\\/][^\\/][^\\/]*$,,'`
#. $PROGPATH/utils.sh
. /usr/local/nagios/libexec/utils.sh
print_usage() {
echo "Usage: $PROGNAME -F logfile -O oldlog -C CRITquery -W WARNquery"
echo "Usage: $PROGNAME --help"
echo "Usage: $PROGNAME --version"
}
print_help() {
echo ""
print_usage
echo ""
echo "Log file pattern detector plugin for Nagios"
echo ""
support
}
# Make sure the correct number of command line
# arguments have been supplied
if [ $# -lt 8 ]; then
print_usage
exit $STATE_UNKNOWN
fi
# Grab the command line arguments
exitstatus=$STATE_WARNING #default
while test -n "$1"; do
case "$1" in
--help)
print_help
exit $STATE_OK
;;
-h)
print_help
exit $STATE_OK
;;
-F)
logfile=$2
shift
;;
-O)
oldlog=$2
shift
;;
-C)
CRITquery=$2
shift
;;
-W)
WARNquery=$2
shift
;;
*)
echo "Unknown argument: $1"
print_usage
exit $STATE_UNKNOWN
;;
esac
shift
done
# If the source log file doesn't exist, exit
if [ ! -e $logfile ]; then
echo "Log check error: Log file $logfile does not exist!"
exit $STATE_UNKNOWN
echo $STATE_UNKNOWN > $oldlog.STATE
fi
# If the dump/temp log file doesn't exist, this must be the first time
# we're running this test, so copy the original log file over to
# the old diff file and exit
if [ ! -e $oldlog ]; then
cat $logfile > $oldlog
TEMPcount=0
let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $WARNquery | wc -l | awk '{print $1}')
let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $CRITquery | wc -l | awk '{print $1}')
if [ $TEMPcount -gt 0 ]
then
echo "Log check data initialized... Last line contained error message."
echo $STATE_WARNING > $oldlog.STATE
exit $STATE_WARNING
else
echo "Log check data initialized..."
echo $STATE_OK > $oldlog.STATE
exit $STATE_OK
fi
fi
# A bug which was caught very late:
# If newlog is shorter than oldlog, the diff used below will return
# false positives for the query because the will be in $oldlog. Why?
# Because $oldlog is not rolled over / rotated, like $newlog. I need
# to fix this in a kludgy way.
if [ `wc -l $logfile|awk '{print $1}'` -lt `wc -l $oldlog|awk '{print $1}'` ]
then
rm $oldlog
cat $logfile > $oldlog
TEMPcount=0
let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $WARNquery | wc -l | awk '{print $1}')
let TEMPcount=$TEMPcount+$(tail -1 $logfile | grep -i $CRITquery | wc -l | awk '{print $1}')
if [ $TEMPcount -gt 0 ]
then
echo "Log check data initialized... Last line contained error message."
echo $STATE_WARNING > $oldlog.STATE
exit $STATE_WARNING
else
echo "Log check data initialized..."
echo $STATE_OK > $oldlog.STATE
exit $STATE_OK
fi
fi
# The oldlog file exists, so compare it to the original log now
# The temporary file that the script should use while
# processing the log file.
if [ -x mktemp ]; then
tempdiff=`mktemp /tmp/check_log.XXXXXXXXXX`
else
tempdate=`/bin/date '+%H%M%S'`
tempdiff="/tmp/check_log.${tempdate}"
touch $tempdiff
fi
diff $logfile $oldlog > $tempdiff
if [ `wc -l $tempdiff | awk '{print $1}'` -eq 0 ]
then
rm $tempdiff
touch $oldlog.STATE
exitstatus=`cat $oldlog.STATE`
echo "LOG FILE - No status change detected. Status = $exitstatus"
exit $exitstatus
fi
# Count the number of matching log entries we have
CRITcount=`grep -c "$CRITquery" $tempdiff`
WARNcount=`grep -c "$WARNquery" $tempdiff`
# Get the last matching entry in the diff file
CRITlastentry=`grep "$CRITquery" $tempdiff | tail -1`
WARNlastentry=`grep "$WARNquery" $tempdiff | tail -1`
rm $tempdiff
cat $logfile > $oldlog
if [ "$CRITcount" -gt 0 ]; then
echo "($CRITcount) $CRITlastentry"
echo $STATE_CRITICAL > $oldlog.STATE
exit $STATE_CRITICAL
fi
if [ "$WARNcount" -gt 0 ]; then
echo "($WARNcount) $WARNlastentry"
echo $STATE_WARNING > $oldlog.STATE
exit $STATE_WARNING
fi
echo "Log check ok - 0 pattern matches found"
exit $STATE_OK

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

I couldn't find an easy way to check whether all interfaces of a host are up and running from the -inside-, so I wrote a Nagios plugin to do this.

Naturally you could also try to ping all of the IP addresses of all of these network cards, but this isn't always possible. Lord knows how many routing issues I had fight through to get our current IP set monitored. I guess using this script is a bit easier :)

The script was tested on Redhat ES3, Mac OSX and Solaris. Its basic requirement is the Korn shell (due to some conversions happening inside the script). On Linux/RH you'll need mii-tool (and sudo) and on Solaris you'll need Perl (for one lousy piece of math :p ).

EDIT:
Oh! Just like my other recent Nagios scripts, check_networking comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

At $CLIENT we've often run into problems with the NSCA daemon, where the daemon would not crash per se, but where it would also not process incoming service checks. The nsca process was still running, but it simply wasn't transferring the incoming results to the Nagios command file.

I was amazed to find that nobody else had written a script to do this! So I quickly wrote one.

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

As far as I know there was no Nagios plugin that allowed you to really check your client configuration. I mean, it would be nice to know for sure that all your systems are syncing against the proper server... Wouldn't it?

The script was tested on Redhat ES3, Mac OS X and Solaris. Its basic requirement is the bash shell.

EDIT:
Oh! Just like my other recent Nagios scripts, check_ntp_config comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

A very simply script that takes a list of processes, instead of a single processes name (as is the case with check_process). This should make monitoring a basic list of processes a lot easier. I really should change the script in such a way that it takes the process list from the command line, instead of from the $LIST variable that's defined internally. I'll do that when I have the time.

Until I've made those change, I use the script by copying check_processes to a new file which is used specifically for one purpose. For example check_linux_processes and check_solaris_processes check a list of processes that should be up and running on Linux and Solaris respectively.

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

A few of our projects and services are run on Solaris systems running Sun Cluster software. Since there were no Nagios scripts available to perform checks against Sun Cluster I made a basic script that checks the most important factors.

This script performs a different function, depending on the parameter with which it is called. This allows you to define multiple service checks in Nagios, without needing seperate check scripts for each.

EDIT:
Oh! Just like my other recent Nagios scripts, check_suncluster comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution. And like my other, recent scripts it also comes with its own test script.

This script was written while I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

One of the things we've been looking into recently, is running the standard Nagios plugins through SNMP instead of through NRPE. Putting aside the discussion of the various merits and flaws such a solution has, let's say that it works nicely.

What this does, is tell the SNMP daemon to run the check_load script when someone asks for object .1.3.6.1.4.1.6886.4.1.1 (or .2, or .3). The exit code for the script will be place in OID.100.0 and the first line of output will be placed in OID.101.1. This script retrieves those two values through SNMP and returns them to Nagios.

The "-o" parameter takes the OID you have selected for your custom check.

Now... How do you select an OID? There's two ways:

1. The WRONG way = randomly selecting some OID. You might pick an OID which is needed for other monitoring purposes in your network.

2. The RIGHT way = requesting a private Enterprise ID for your company at IANA. You are free to build an SNMP tree beneath this EID. For example, the EID 6886 mentioned above is registered to KPN (my current client). The sub-tree .4.1 contains all OIDs referring to Nagios checks performed by my department.

Before sending out that request, please check the current EID list to see if you company already owns a private subtree. If that's the case, contact the "owner" to request your own part of the subtree.

UPDATE (2006-10-02):
Thanks to the kind folks on the Nagios Users ML I've found out that my original version of the script was totally bug-ridden. I've made a big bunch of adjustments and now the script should work properly. Thanks especially to Andreas Ericsson.

At $CLIENT I've built a centralised logging environment based on Syslog-ng, combined with MySQL. To make any useful from all the data going into the database we use PHP-syslog-ng. However, I've found a bit of a flaw with that software: any account you create has the ability to add, remove or change other accounts... Which kinda makes things insecure.

So yesterday was spent teaching myself PHP and MySQL to such a degree that I'd be able to modify the guy's source code. In the end I managed to bolt on some sort of "admin-mode" which allows you to set an "admin" flag on certain user accounts (thus giving them the capabilities mentioned above).

The updated PHP files can be found in the TAR-ball in the menu of the Sysadmin section. The only thing you'll need to do to make things work is to either:

1. Re-create your databases using the dbsetup.sql script.

2. Add the "admin" column to the "users" table using the following command. ALTER TABLE users ADD COLUMN baka BOOLEAN;

Unfortunately I've been making longer days than I should this week. I mean, it's not a horrendous amount of hours, but still I'd rather be at home relaxing. This week has seen the people in charge at $CLIENT up the prio on a centralised Jumpstart/FLAR server, which I was supposed to deliver. I was already working on it part time, but now they have me working on it full time. It's quite a lot of fun, since I get to work together with other departments within $CLIENT, thus making more friends and allies ^_^

I also had to struggle with Perle IOLan+ terminal servers this week, since we need to be able to use the serial management port on our Sun servers. Yes, admittedly these boxen do work for this purpose, but I'd rather have a proper console server instead of a piece of kit which was originally meant as a dial-in box for dumb terminals or modems. Let's just say that I dream of Cyclades.

Oh! Last wednesday was my birthday by the way... I've hit 26 now :3 We went out for a lovely dinner at Konichi wa in Utrecht, since we wanted to try out a different Japanese restaurant for a change. I must say: their price/quality proportions are really good! If you ever are in the neighbourhood of Utrecht and feel like Japanese, head over there! They're at Mariaplaats 9. BTW, they don't just to Tepan Yaki... They also serve excellent sushi and will make you _ramen_ or _udon_ noodles if you ask nicely!!! My new favourite restaurant :9

This script was written at the time I was hired by KPN i-Diensten. It is reproduced/shared here with their permission.

We are currently in the process of distributing a standard set of Nagios monitoring scripts to over 300 client systems. One of the metrics we would like to monitor is the three load averages (or as Dr. Gunther calls them: the LaLaLa triplets).

Since these 300 servers aren't all alike, we are bound to run into systems with one, two, four, eight or more processors. That way there is no nice way of making one standard configuration, since you'll have to define separate LA levels for WARN and CRIT. Why? Cause a quad system can take much more load than a single core system.

One way to get around this would be by defining separate host groups, based on the amount of processors in a system. You could then define a unique check_load command for each CPU host group.

I've gone the other way around though...

My work-around for this is by replacing check_load with check_load2. This script takes no command line parameters and works on the basis of standard multipliers. We are of the opinion that the number of processors multiplied by a certain factor (150%? 200%? and so on) is a good enough way to define these WARN and CRIT levels. These multipliers can easily be modified (at the top of the script) to fit what -you- think is a worrying level of activity.

This script was tested on Redhat ES3, Solaris 8 and Mac OS X 10.4. It should run on other versions of these OSes as well.

EDIT:
Oh! Just like my other recent Nagios scripts, check_load2 comes with a debugging option. Set $DEBUG at the top of the file to anything larger than zero and the script will dump information at various stages of its execution.

Currently listening to "Press Conference rag" from the musical Chicago.

What a relief! we finally managed to move NIS+ to a new Master server. We put in about twelve hours on saturday, but we finally got that bitch tamed! :) Proper credit needs to be awarded, so I would like to say that our success was mostly due to the scripts which had been crafted by Jeroen and Roland.

Bad news for those sysadmins out there waiting for news regarding the NIS+. We tried our best yesterday, but moving NIS+ to a new Master server failed again :( This time around we used a tried and true (although much improved upon by Jeroen) procedure, which is usually reserved for worst case scenarios. Unfortunately we ran into some unforeseen problems. I'll tell you more about them when I deliver the _real_ procedure.

Holy moly, what a weekend! I can tell you guys right now that the procedure I wrote for switching NIS+ master servers is NOT fool proof! We had planned to only take about four hours at a max, for switching both NIS+ and BoKS over to a new master server. Unfortunately it turned out that we would only get to spend one hour on switching NIS+ until things went horribly sour.

In the end I spent a total of eightteen hours in the office on Saturday and Sunday. I'll spare you the gory details for now (I'll incorporate them in version 2.0 of the master switch procedure).

But God, what a weekend! And the way it looks now we'll be repeating it in a week or so...

Aniwho... I'm still trying to put as much time as possible into my work for the convention, but it's going slowly. I plan on spending every free minute of coming thursday on my Foundation work though. That should get me along the way nicely.

Finally got round to writing the "Switch to a new master" procedure for NIS+. This procedure is damn handy when you want to move your current NIS+ root master to new hardware. This is something that we'll be doing at my employer on the 20th of November, so I'll keep you guys posted. I'll also be sure to update the procedure should anything go wrong :]

In the menu of the Sysadmin section you will also find a link to a small erratum which I wrote after reading Rick Bushnell's book. As you can see I found quite a number of errors. I also e-mailed this list to Prentice Hall publishers and hope that they will make proper use of the list.

Well, it took me a couple of days, but finally it's done: my summary own the "SCNA study guide" by Rick Bushnell (see the book list). I'll be taking my first shot at the SCNA exam in about a week (the 22nd, keeping my fingers crossed), so I'm happy that I've finished the document. I thought I'd share it with the rest of you; maybe it'll be of some use.

All 29 pages are available for download as a PDF from the Sysadmin section.

All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Thomas Sluyter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.