This feature is mightier than you expect. In some extent it works like check_generic does.

It can set child specific variables, such as process_perfdata, timeout or displayed.
But you also can set attributes like critical, warning, and unknown, and this is pretty much the same as you specify state [ critical ] = <expression>, just on the child check level. It will be evaluated directly after a child check has been executed and affects the state of a particular child check.
Let me give an example:

attribute [ <variable> ] = <value>
This is an option if you want to specify all in your cmd files to not use command line parameters like '-s <variable>=<value>'. Any -s option can be added with this attribute statement.

Everybody was able to understand these two elements and started to write plugins. In all languages one can imagine and for all OS where Nagios is running on. Nagios got a famous reputation ('Yes we can plugin!'), and the only limitation was the skill of the plugin programmer.

As a side note: Some plugins actually are not of best quality, as we can see in the exchange repositories. But they cover the whole range of monitoring.

Plugin output limited

Let's talk about a small aspect of the plugin interface which is annoying and often frustrates especially Nagios beginners - the limited length of plugin output. It sounds pretty simple, but the devil is in the details.

If you take the output of the standard plugin check_disk, the length of output should not be a problem:

428 bytes instead of 83 bytes: If you still have Nagios 2 running, this would have blown the maximum length of plugin output.

Plugin buffer overflow? Nagios does not care

The plugin interface is simple, but it also means that Nagios does not care about the length of plugin output. If it exceeds the internal buffer length, nobody is informed and often nobody notices it. The content is simply cut.

Bad news for the performance data, which is appended to the output: If the buffer is not long enough to house the whole output, the performance data is missing or even worse, it is corrupted. And no warning lamp alarms the monitoring admin.

Nagios 2 allowed 332 bytes of plugin output, but for Nagios 3 this was increased drastically. Have a look on how the maximum plugin output length evolved over the Nagios timeline:
On Nagios side it's the constant named MAX_PLUGIN_OUTPUT_LENGTH:

Maximum plugin
output in bytes

Nagios version

Include file

352

1.0

common/objects.h

348

2-0

include/objects.h

332

2-1

include/objects.h

4096

3-0a

include/nagios.h

8192

3-0

include/nagios.h

Don't think that the 8K bytes are sufficient in all cases - check_multi's HTML mode can easily consume dozens of kilobytes.

Increasing MAX_PLUGIN_OUTPUT_LENGTH - and some more

In principle the idea of increasing the constant MAX_PLUGIN_OUTPUT_LENGTH is correct - increase it, recompile and restart Nagios, done.
Ethan himself gives a hint in nagios.h to also increase MAX_EXTERNAL_COMMAND_LENGTH for passive checks:

NOTE: Plugin length is artificially capped at 8k to prevent runaway plugins from returning MBs/GBs of data
back to Nagios. If you increase the 8k cap by modifying this value, make sure you also increase the value
of MAX_EXTERNAL_COMMAND_LENGTH in common.h to allow for passive checks results received through the external
command file. EG 10/19/07

One remark to the buffer size - generally it's a good idea to restrict it. But increasing it to 32K or 64K should not be a problem for modern servers in the gigabit world, even if there are runaway plugins.

Transports and oddities

Enlarging Nagios buffers is not all - since many of the plugins are running on remote machines. Their output has to be transferred to Nagios. Here several transports enter the stage:

NRPE - Nagios remote plugin executor

NSCA - Nagios service check acceptor

SSH - check_by_ssh

There are more, but these are the most important in the Nagios world. Let's take a look how they behave with large plugin output.

We will begin with the transport SSH, since it's not Nagios and in terms of transportation the simplest. I know that some people will not agree, but here are my 2 cents: if you manage the public key authentication with SSH, it's a simple, safe and robust transport. And if you transfer 10K or 100K, who cares…

NRPE is a bit more tricky, and this comes from the internal implementation. In the original version it is a one buffer transport and will fail if you don't adjust the small buffer sizes in common.h:

Ton Voon has provided an improvement which breaks this limitation. The best on Ton's patch is that you don't have to upgrade all machines at once. You can do it step by step which is helpful especially for large installations.Note: if you are running NRPE on Linux machines before kernel release 2.6.11, you will only be able to transport one buffer. This is an effect caused by the old single buffered PIPE implementation. In 2.6.11 Linus Torvalds himself inplemented a ring buffer which allowed circular pipes. With the default kernel PIPE size of 4K and 16 buffers, NRPE can now transport 64K. So if you still have problems with cut NRPE data, watch for 2.6.10 and below.

NSCA is the nasty end - and if you ask me: it needs a reimplementation. There are several implementation itches which do not fit anymore in the current Nagios world:

NSCA does not scale very well: it passes all messages to the Nagios CMD interface, which is well known for its traffic jam in large installations. And NSCA is often used just in such large installations to cirumvent the Nagios scheduling bottleneck.
There are numerous enhancements on both NSCA's sender and receiver side, but IMHO the only well performing approach to insert checks into Nagios will use the checkresults interface.

NSCA does not allow multiline: it reads the input up to the first newline and that's it.

Recommendations for check_multi?
After our small walk through the puzzling world of Nagios transports the conclusion for the use of check_multi is pretty clear: NRPE and SSH will work well, while NSCA is the black sheep in this family.
But this does not need to be a real disadvantage: in a check_multi driven Nagios infrastructure you don't need that much passive services with NSCA, because you can use active check_multi services instead.
In the end this means no more need for freshness checks and no more need for sophisticated distributed setups.

I confess - this release should have been launched much earlier. I know the OSS mantra: release early, release often. But there were so many enhancements, redesigns, fixes in a row that I really missed to shift the trunk version into a new stable release .

statusdat [ TAG ] = host:service
gather states and output from existing servicechecks and integrate this seamlessly into your existing checks. This is a good means to build Business Process Views using check_multi without the need to reexecute existing service checks.

One more for this new statusdat function:
When you specify wildcards for hosts and services, check_multi will automatically expand this to additional child checks.
And the data gathering from status.dat is done efficiently with a caching mechanism.

Support for passive feeding: there are several ways now to feed check_multi results directly into Nagios:

via check_result files (direct and very fast)

via send_nsca (needs nsca daemon on Nagios side)

as a chain of commands: one check_multi sends, the other check_multi receives and inserts all child checks into Nagios queue.

eval is not counted any more for the number of plugins
eeval is visible and therefore counted, eval was counted but not visible, and this confused some people. Now it's not counted any more.

the report options -r can be specified with a chain of plus-separated numbers instead of a sum:-r 1+2+4+8 is better readable than -r 15, isn't it?.

eval and eeval perl snippets don't need to be written with trailing \.
This allows comment lines within the code as well as direct copy and paste from perl scripts.
Nevertheless: the old trailing \ is also valid, so nobody needs to rewrite his command files.

At last: configure based installation, added tests (make test) and consolidated directory structure.
By the way: if you have a complicated configure line and deleted your config.log, no problem: call check_multi -V, and it will print the complete configure line:

This version is properly working on 1.000 european servers in the data centers of the telecommunication company I work for. If you find oddities or bugs anyway, please report them in the German Nagios-Portal or send me a mail.

Some day a customer complained that he could not access a local intranet server. But I was sure that the server is running, nagios showed green lights everywhere.
But when I remotely connected to the customers PC, I had to notice that he is right and there is no connection from his network to the intranet server due to routing problems. Whoops…

The afternoon I thought about ways to cover this situation in our Nagios monitoring. Nagios has no obvious resp. generic solution for this problem except setting up multiple checks from different hosts.

But wait a little bit - there is a conceptional problem with this. All these checks are associated to other hosts as they're really belonging to. This will confuse the whole process (and the administrator as well ). Notifications and escalations are based on the wrong host and the statistics / SLAs are also affected.

check_multi provides a simple but effective solution for this scenario: a distributed monitoring which works as a service associated to the target host.

You need:

remote access to some hosts in the wanted subnets
(i.e. normally some of your monitored servers)

a check plugin existing on each of these hosts, which can monitor the service
(e.g. check_tcp)

You will then:

schedule a check on each of these hosts towards your target host / target service

add a flexible state evaluation of your results:

either all results have to be OK for the overall state being OK

or only some results OK will set the overall state to OK

That's it!

And more: you can do this with one generic check_multi command file. Some parameters will control which hosts are to be checked and what check_command is used to examine the service.

As you can see in the source below, you can also set other parameters from command line.
But there are already some reasonable defaults available:

TIMEOUT (default: 2)
this default is shorter than the normal default of 10 seconds. This is feasible here because we have multiple checks where only some have to succeed and others may fail without influencing the whole result.

What is happening here?
The first part, the sensor command is just a plain snmpget, as you probably often use in self written plugins.

But instead of a big if-else-clause check_multi uses state expressions to assign SNMP values to the different result values OK, WARNING and CRITICAL.

UNKNOWN has a special role here. It is used, when the SNMP value is not member of a specific group of numbers.

So this example really does not more than standard plugin also can do. But it shows how fast and reliable you can do develop such a SNMP plugin with check_multi. And if you want to change a value afterwards, you does not need to bother a developer. Any administrator can do this as well.

There is often the discussion, if monitoring checks should be run remotely or locally on the server, which has to be monitored.
The decision is sometimes easy, when the ressource to be monitored is not available remotely, e.g. logfiles or disks.
But there are plenty of cases where you can do both, e.g. applications and services, which are accessed via network. Please don't think, that network services have to be monitored remotely, because otherwise there's no proof that it's working over the network. You can give exactly one proof with your nagios check, and that is for the nagios server. Where the servers users normally not reside
So why not executing all checks on the remote server?

There are indeed some reasons for this approach:

All checks are consistent.

The transport problem has only to be solved once.

The customer is paying in terms of performance for his own monitoring. The nagios server does not need to bear the load of multiple checks, it receives only the results.

But the disadvantage of this approach lies just here - when the customer comes and asks:

what the heck are all these 'nagios' processes doing on my server?

And why are they running so often?

Generally we won't find a generic solution for all cases. I have prepared a small table to help you finding the criteria for your specific solution:

1. check_multi running on Nagios server

2. check_multi running on client server

Basic concept

check_multi runs on the Nagios server and accesses the remote server with each child check independently

check_multi is running on the remote server (here called client) and executes the child services locally

Resources
Load

1. The Nagios servers bears most of the execution load and needs more power.
2. The footstep on the client server is not that big.

1. The load on the Nagios server is very small.
2. The major work is done on the client, therefore it takes the burden of its own monitoring.

Network

The network load is higher, especially with SSH transport mechanisms the effort is much higher due to multiple authentication steps.

The network load is very small, there is only one connection to trigger the startup and transport back the results over the network.

Configuration

Configuration is easily accessible, there is no need to distribute configuration files

The remote client needs configuration to be distributed and updated.

Plugins

As for configuration, its sufficient to provide plugins only on the central Nagios server.

Plugins have to be distributed and updated.

Local monitoring vs.
remote monitoring

There are particular local checks like disk monitoring, which need high efforts to be run remotely

Every remote check can also be run as a local check. That means vice versa, that all checks can be run from the remote server, even checks for network services. A plus in terms of homogenity of monitoring

The Nagios world consists of hosts and services. But what to do if the hosts do not matter? This is the case with all devices which are not necessarily up all the time, but have services need to be monitored if they are.

E.g.

Printers, which should be monitored for toner and paper

Windows clients, which patch level should be supervised

Salesmans notebooks, which are not connected most of the time.
But if they are, we want to check all we can get of them.

The trick happens in the eval line: if the ping in the command line before does not succeed, we don't bother any more about this client, pring a short message “Host offline” and exit with OK.Note: The eval command will not be shown in the normal visualization.

That's all.

BTW - users reported, that it's quite a miracle when during the late afternoon all clients problems disappeared silently host per host until everything is green