Performance graphs can experience a number of issues. Below are solutions for the most common problems. The log verbosity should be increased before troubleshooting, and should be returned to default settings once resolved.

+

+

'''Increase Performance Data Logging Verbosity'''. Edit the file:

+

+

/usr/local/nagios/etc/pnp/process_perfdata.cfg

+

+

Change:

+

+

LOG_LEVEL = 0

+

+

To:

+

+

LOG_LEVEL = 2

+

+

Save out and restart NPCD:

+

+

service npcd restart

+

+

NPCD should now log all errors and debug information to:

+

+

/usr/local/nagios/var/npcd.log

+

+

Remember to return this value to it's default setting when troubleshooting is completed.

+

+

'''Increase NPCD Logging Verbosity'''. Edit the file:

+

+

/usr/local/nagios/etc/pnp/npcd.cfg

+

+

Change:

+

+

log_level = 0

+

+

To:

+

+

log_level = -1

+

+

Save out and restart NPCD:

+

+

service npcd restart

+

+

The process_perfdata.pl script should now log all errors and debug information to:

+

+

/usr/local/nagios/var/perfdata.log

+

+

Remember to return this value to it's default setting when troubleshooting is completed.

+

+

===== Perfdata Timeout =====

+

As many installations grow, the perfdata processing timeout value may need to be increased. Check the perfdata log for any recent timeout errors:

+

+

tail -50 /usr/local/nagios/var/perfdata.log | grep TIMEOUT

+

+

If the grep found any recent errors, change the TIMEOUT by editing the file:

+

+

/usr/local/nagios/etc/pnp/process_perfdata.cfg

+

+

Change:

+

+

TIMEOUT = 5

+

+

To:

+

+

TIMEOUT = 20

+

+

As your installations grows further, this value may need to be increased even more.

+

+

===== NPCD Load Threshold =====

+

Bulk NPCD processing has a load threshold setting that is intended to halt performance processing if the system is under heavy load. Large installations will need this value increased and NPCD restarted.

+

+

Check the NPCD log for load warnings (if the log file does not exist, increase the log level, restart npcd, and wait 5 minutes before proceeding):

+

+

tail -50 /usr/local/nagios/var/npcd.log | grep "MAX load reached"

+

+

If any recent errors are found, increase load threshold by editing the file:

+

+

/usr/local/nagios/etc/pnp/npcd.cfg

+

+

Change:

+

+

load_threshold = 10.0

+

+

To:

+

+

load_threshold = 20.0

+

+

Save out and restart NPCD:

+

+

service npcd restart

+

+

For really large installations, or servers with minimal resources, you may need to increase the npcd load_threshold and perfdata TIMEOUT even more than is suggested above.

Hardware Requirements

Supported Distributions

Nagios XI is currently supported with the following Linux distributions for both 32 and 64 bit installations:

CentOS 5/6

RHEL 5/6

Installation Prerequisites

Important: Nagios Enterprises highly recommends and will only support installing Nagios XI on a newly installed, “clean” system (a
bare minimal install with nothing else installed or configured).

Attempting to install Nagios XI on a pre-existing system with other applications already installed can cause the Nagios XI installation
process to fail, critical system components and settings (e.g. database servers) to be modified in a way that negatively affects other
applications, and previously installed applications to be automatically upgraded or removed. While installing XI on a system with other
applications is possible, it is not recommended due to the possible interactions and complexity of multiple components that are required
for Nagios XI to function. If you choose to ignore these warnings, you do so at your own risk.

Is it possible to use SMS alerts for a custom SMS gateway?

Yes! Nagios XI sends SMS alerts by via email. Although we currently don't have a solution that allows users to define custom SMS gateways, the best way to get around this is to define a contact with an email address that will send the SMS message. Email address examples are as follows:

Problems Using Nagios XI With Proxies

We do not officially support Nagios XI when you install and use proxy software that restricts traffic to or from the Nagios XI server. There are several reasons for this. First, Nagios XI requires external access for package installation and updates. Package installation and updates may not work when proxies are used. Additionally, the Nagios XI code makes several internal HTTP calls to the local Nagios XI server to import configuration data, apply configuration changes, process AJAX requests, etc. These functions may not work properly when you deploy a proxy, which would result in a non-functional Nagios XI installation.

There are two things that need to be configured to make XI installation work with a proxy; the yum and wget configurations. Do both of these before starting anything about the installation process.

In /etc/yum.conf :

proxy=http://someproxyserver:port/ # Shouldn't need to be quoted, remember the trailing slash
proxy_username=myname # The username you authenticate to your proxy with, if applicable
proxy_password=mypass # The password you provide to your proxy, if applicable

In /etc/wgetrc :

http_proxy=http://myname:mypass@someproxyserver:port/ # All in one string this time
no_proxy=localhost,127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 # Hosts to exclude from proxying

Quoting is not needed (or helpful) in any of these, but if you have special characters in passwords (especially : or @) and are having problems you probably need to escape them with backslashes.

Update Check Behind a Proxy
Updates checks are known to fail for systems behind a proxy. We created a proxy component that should allow the update check to work behind most proxies. Install this component from the Admin->Manage Components page and then access the Admin->Proxy Configuration page to configure the proxy settings.
[Proxy Component]

Installation and Upgrade Problems

CentOS 6 Installation Problems

Between the the release of Nagios XI 2011R1.7 and 1.8, several changes were made to the CentOS 6 repo that created package conflicts, preventing the Nagios XI installation scripts from completing successfully. This usually becomes apparent by the "fullinstall" script failing with one of the following two messages:

Resolving "DB Connect Error [nagiosxi]: Database connection failed"

The problem we identified with gnome was that the PATH for the "service" command gets changed under gnome. This needs to be set correctly so that the scripts starting with 3-dbservers will run correctly.
You can test if the path is set correctly by trying the following commands:

service httpd restart
service postgresql restart

The important thing is that it includes the "sbin" directories. Normally it would look like this, although this isn't the only "correct" answer possible:

Resolving "NSP: Sorry Dave, I can't let you do that" Errors

Session protection was added to 2009R1.2C to prevent CSRF attacks. This code to do this caused some users to see this error. The problem was due to the user's browser caching older versions of the XI javascript code. In order to clear the cache and prevent this from happening, you need to clear your browser's cache. This is typically done (in Firefox) by holding down the shift key and clicking reload. See Other well documented procedures on clearing the browser cache.

The other possible cause of this is that the XI server's time is out of sync with the web browser. Try the following:

yum install ntp
ntpdate time.nist.gov

If that still doesn't fix the error, then you may have to specify your timezone in your /etc/php.ini file. Newer releases of PHP require this setting for your server to reflect the correct system time and timezone. To change this setting, edit the /etc/php.ini file with the following line:

date.timezone = Etc/GMT-13

Change the timezone to match your location. These zones are listed at the following URL.
PHP Timezones
After changing the setting, restart your apache server:

service httpd restart

"HTTP 500 Error"/"PHP Parse error - Unexpected $end"

For those doing manual installations, some of the tools embedded in Nagios XI use the PHP short tags feature, which is not necessarily enabled on all web servers by default. To fix this issue, locate your php.ini file (located at /etc/php.ini for CentOS installations), and verify that "short_open_tag" is set to "on." We intend to use full tags for future version, but some components and addons may still use them, so we recommend leaving this setting to "on."

"ERROR: PostgresQL not running - exiting."

This anomaly will rarely occur during a VM set up of Nagios XI. You may try restarting the server but in some cases will have to start the Nagios XI install from the beginning.

You need to add the Optional software channel so that Nagios XI can install the necessary prerequisites. To do so:

yum install yum-utils

yum-config-manager --enable rhel-6-server-optional-rpms

Red Hat Network Classic:

You need to add the Optional software channel so that Nagios XI can install the necessary prerequisites. To do so, first sign in to your Red Hat Network account at http://rhn.redhat.com/. Then click on the link corresponding to your system. Near the bottom-left corner of the page, click Alter Channel Subscriptions. Check the box labeled RHEL Server Optional and click Change Subscriptions. That's it! You should be able to run the installer again and complete your installation.

"Installation errors on customized corporate builds of CentOS or RHEL"

We have seen when companies require the use of their "standard build" of either OS Nagios XI will not be able to successfully install if there have been modification to the umask on the machine.

"Upgrade errors - root.crontab.orig: cannot overwrite existing file"

We have seen problems when upgrading and there are leftover files from previous upgrades.

This problem can be eliminated by running the following command:

cat /dev/null > /tmp/nagiosxi/uninstall-crontab-root

After this you can proceed to run the upgrade script again.

Ajaxterm Installation Aborted

Nagios XI 2012 and late versions of 2011 set up apache to be able to utilize the Ajaxterm subcomponent. This component requires a modification to the /etc/httpd/conf.d/ssl.conf file, and adds some proxy information needed specifically for ajaxterm. If apache failed to restart after making these modifications, the previous configuration gets rolled back in order to keep apache running the the system usable. The bad configuration that the fullinstall/upgrade script attempted to apply is saved to /etc/httpd/conf.d/ajaxterm.fail. This file can be debugged and once fixed, can replace the existing ssl.conf file in order to utilize Ajaxterm. Note: Remove the /etc/httpd/conf.d/ajaxterm.fail file once the issue is resolved to avoid the error message in the UI. Please contact the Nagios support team with any questions.

Configuration Problems

Apply Configuration Fails: General Troubleshooting

If you receive an error while attempting to Apply Configuration stating that the configuration verification has failed, then that means there is some sort of syntax error or configuration conflict the configuration that's been defined. You can isolate this issue by accessing the Core Config Manager->Configuration Snapshots page. You should see the most recent snapshot highlighted in red. View the text file from the snapshot to see what config file contained the error. You can then find that file in the associated tar.gz file and search for the problem based on the error message. The snapshot represents the information that is CURRENTLY in the CCM database, that Nagios attempted to save. You'll need to correct the issue through the Core Config Manager, then attempt to Apply Configuration again.

The Write Config Tool in the CCM is a manual tool for writing the DB information to the configuration files (it manually Applies Configuration). It's important to know that Nagios cannot start or restart with a bad configuration. The config verification must pass in order for Nagios to be able to restart successfully with the new configuration.

Configuration Applies, but still get "Configuration File Is Out Of Date" Error

If your configuration is applying successfully and the changes are visible in the XI interface, but you're still seeing an error message in the CCM that says "Configuration File Is Out Of Date", then you may have to specify your timezone in your /etc/php.ini file. Newer releases of PHP require this setting for your server to reflect the correct system time and timezone. To change this setting, edit the /etc/php.ini file with the following line:

date.timezone = Etc/GMT-13

Change the timezone to match your location. These zones are listed at the following URL.
PHP Timezones
After changing the setting, restart your apache server:

service httpd restart

Apply Configuration Fails, No Configuration Problems

As of 2011 R1.7, extra sanity checks were added to the Apply Configuration functionality of Nagios XI to prevent false positives and also to prevent that page from stalling out endlessly. An example error that can show up is:
"Backend login to the Core Config Manager failed"

There are a few different reasons an error like this can show up. The most common one is the use of a proxy that prevents "wget" from being able to resolve to "localhost" correctly. However, if you receive an error message when attempting to Apply Configuration other than "Configuration Error...," run the following commands and send the output file to the Nagios XI support team.

And attempt to Apply Configuration from the web interface. After the browser has returned some output to the screen, press Ctrl+C to stop the log tail, and send XI support the cmd.txt file and the reconfig.txt that was generated by the above instructions.

Apply Configuration Page Stalls Out, Never Completes

If you attempt to Apply Configuration and you're seeing the following output:

and the configuration never applies, the page may be timing out. If you've recently updated XI, try restarting the server first. If you're currently running Nagios XI 2011R1.3 there is a known bug that can cause this issue. You'll need to upgrade to the latest version to resolve the issue. If that does not resolve the issue, try editing the configuration for your PHP settings. Open /etc/php.ini file in a text editor and increase the following values.

Note: If you're running a large installation with several thousand hosts/services, you may need to increase these numbers more to allow enough time and memory for large configuration changes to take effect.

If the issue persists after the above solutions, the issue could be caused by creating a local DNS entry for the Nagios XI server, but failing to add that name entry to the Nagios XI server itself. Example, if you're accessing the XI server from the following url: http://nagiosserver/nagiosxi, you need to verify that the XI server can also resolve that DNS name correctly. The local DNS entry for the XI server needs to be added to the /etc/hosts file.

You can observe similar issues if you run out of disk space.

Configuration Applies, No Changes Take Place

This is generally due to permissions issues with the configuration file. Use the Write Config Tool in the Core Config Manager to see if you can manually write the DB information to the config files. If the Write Config Tool returns error messages related to permissions you can run the following script to correct the permission settings:

/usr/local/nagiosxi/scripts/reset_config_perms

There is a known bug in XI 1.3E and F where this script was not automatically running when configurations were applied. If you're running a Nagios XI version earlier than 1.3g, we recommend updating to correct this issue.

Modifying The Contents Of /usr/local/nagios/etc

You can keep custom configuration files in the /usr/local/nagios/etc/static directory

Don't modify config files directly in /usr/local/nagios/etc, as they will be overwritten by the Core Config Manager

Unable To Delete Hosts

Hosts can only be deleted after all of their dependent services and associated relationships have been deleted. Make sure to delete any associated services or other objects before deleting the host.

Host Still Visible After Deletion: (Ghost Hosts)

If you have successfully deleted a host and all of it's services from the Core Config Manager, but you're still seeing it in the status tables, then you most likely have multiple instances of Nagios running on your machine. To make sure all instances are stopped, type the following in the command-line.

killall nagios
service nagios start

Host Still Visible In XI After Deletion From the CCM

Go to the Core Config Manager->Write Config Tool, and use that tool to
manually write out the configuration data to file. Verify your
configuration. If it verifies, go ahead and restart Nagios.

If by chance the host and all of it's services are completely deleted in
the Core Config Manager, and the actual host config file is still there
after using the Write Config Tool, then go ahead and delete the config
file. The files will be located in the following directories.

/usr/local/nagios/etc/hosts
/usr/local/nagios/etc/services

On rare occasions the CCM will somehow lose a file, we haven't
nailed down what causes it, but it is usually related to deleting the
host.

Network status map parent/child relationship not updating(v1.3)

Underneath the Parents box in the CCM, make sure the "standard" radio button is selected. If "null" is selected your parent host selection doesn't get written to disk. We're working on a method of fixing the CCM so this doesn't happen with several fields.

Warning: Duplicate definition found for contact 'xi_default_contact'

This usually happens if you import the "static" directory config files in Nagios XI. When you try to apply configuration, you see an error, similar to this one:

Core Config Manager Problems

GUI Issues

Most of these are related to IE's implementation of JavaScript. If possible, use a browser that more closely implements the ECMAScript Language Specification.

In the event of the the Core Config Manager not visible or components missing from the page, this generally relates to a proxy and the following thread covers how to address this issue:
Nagios Core Config Manager not showing up.

Configuration Changes

If you make changes to your configuration and they are not reflected in XI, it may be due to file permissions. Here are two options to try:

Reset File Permissions

Execute the following command to reset your configuration file permissions.

/usr/local/nagiosxi/scripts/reset_config_perms

You can also view if you have any permissions related issues by accessing the Admin->Check File Permissions page in the XI interface (v1.3g+).

Restoring Default Configuration

If you've somehow messed up your configurations irreparably, or simply want to reset a test system, you can restore the configuration to the defaults as shipped with XI. To do so, download these two files and transfer them (via SCP) to your XI server:restore_defaults.shnagiosql_defaults.sql
Then, log into the console of your XI server, and in whatever directory you put those two files run these commands:

chmod +x restore_defaults.sh
./restore_defaults.sh

This will delete all of your hosts and services and reload just the demo ones that were initially set up.

Making A Mass Change In The CCM

Changing The Field Entry For A Large Amount Of Objects

Occasionally admins need to change a specific settings for a huge quantity of services or hosts, and this change can't be made from a template. Although we highly recommend the use of templating whenever possible, sometimes it's just not possible to make the change there. Our unofficial solution for this is to write a SQL query that will manually update the DB fields where you need them change. NOTE: Test your queries on a single test host/service first, and try this solution at your own risk, we are not responsible if you break something with this! Here's an example a user posted of a change made to the check_interval for all 'Disk Monitor' services.

If the change you wanted was successful, Apply Configuration to write the changes to the config files.

Using Scripts To Make Changes in the CCM

Some admins make use of internal scripts to update and maintain their monitoring environment. Although we're only able to offer limited support on a situation like this, a useful script to know about is:

/usr/local/nagiosxi/scripts/reconfigure_nagios.sh

This is the command-line version of "Apply Configuration" in the XI interface. It will write the CCM DB info to the config files and restart Nagios.

To automate importing configs using scripts, you can simply place config files in the /usr/local/nagios/etc/import directory, and then run the reconfigure_nagios.sh script. This will handle the import to the DB, writing the configs, verification, and then restarting Nagios.

Currently there is not a streamlined way to remove hosts and services from the Core Config Manager using scripts. We hope to have features like this implemented in 2012.

Performance Graph Problems

General Performance Graph Troubleshooting

Performance graphs can experience a number of issues. Below are solutions for the most common problems. The log verbosity should be increased before troubleshooting, and should be returned to default settings once resolved.

Increase Performance Data Logging Verbosity. Edit the file:

/usr/local/nagios/etc/pnp/process_perfdata.cfg

Change:

LOG_LEVEL = 0

To:

LOG_LEVEL = 2

Save out and restart NPCD:

service npcd restart

NPCD should now log all errors and debug information to:

/usr/local/nagios/var/npcd.log

Remember to return this value to it's default setting when troubleshooting is completed.

Increase NPCD Logging Verbosity. Edit the file:

/usr/local/nagios/etc/pnp/npcd.cfg

Change:

log_level = 0

To:

log_level = -1

Save out and restart NPCD:

service npcd restart

The process_perfdata.pl script should now log all errors and debug information to:

/usr/local/nagios/var/perfdata.log

Remember to return this value to it's default setting when troubleshooting is completed.

Perfdata Timeout

As many installations grow, the perfdata processing timeout value may need to be increased. Check the perfdata log for any recent timeout errors:

tail -50 /usr/local/nagios/var/perfdata.log | grep TIMEOUT

If the grep found any recent errors, change the TIMEOUT by editing the file:

/usr/local/nagios/etc/pnp/process_perfdata.cfg

Change:

TIMEOUT = 5

To:

TIMEOUT = 20

As your installations grows further, this value may need to be increased even more.

NPCD Load Threshold

Bulk NPCD processing has a load threshold setting that is intended to halt performance processing if the system is under heavy load. Large installations will need this value increased and NPCD restarted.

Check the NPCD log for load warnings (if the log file does not exist, increase the log level, restart npcd, and wait 5 minutes before proceeding):

tail -50 /usr/local/nagios/var/npcd.log | grep "MAX load reached"

If any recent errors are found, increase load threshold by editing the file:

/usr/local/nagios/etc/pnp/npcd.cfg

Change:

load_threshold = 10.0

To:

load_threshold = 20.0

Save out and restart NPCD:

service npcd restart

For really large installations, or servers with minimal resources, you may need to increase the npcd load_threshold and perfdata TIMEOUT even more than is suggested above.

Performance Graphs Are Missing Or Not Displayed

This can happen for a variety of reasons, but there are several simple solutions that resolve this issue for most people:

Make sure you're using the latest version of Nagios XI. Old releases may have issues that will not necessarily be resolved from the below solutions. Upgrading Nagios XI

Verify That process_perfdata.pl has correct permissions Make sure that the file /usr/local/nagios/libexec/process_perfdata.pl has execute permissions and is owned by nagios:nagios.

2011 R1.8 Fix There is a known bug on some XI installs for this release that have incorrect permissions for the performance data directory. This can be resolved by running the following command as the root user.

chmod -R +x /usr/local/nagios/share/perfdata/

1.6 and 1.7 RHEL/CentOS 6 Users. There were some hiccups with the repos which cause a necessary component for MRTG graphing to not be installed. This is a very simple fix. Log into the CLI of your Nagios XI server as root, and type:

yum install bc

That should fix the graphing issues. Note that this does not apply to versions of Nagios XI later than 1.8.

Run the command manually. Try running the command that Nagios XI runs to check status of a device. For instance, when monitoring a router or switch, Nagios XI uses the check_rrdtraf plugin. Test running this plugin manually by navigating to your libexec directory and running a check, similar to the following:

If those folders are not writable and readable by Nagios, then that is problem and you should set write and read access for Nagios. Please note that all files contained in these folders also needs to be writable and readable by nagios.

Reset File Permissions

Execute the following command to reset your configuration file permissions.

/usr/local/nagiosxi/scripts/reset_config_perms

You can also view if you have any permissions related issues by accessing the Admin->Check File Permissions page in the XI interface (v1.3g+).

Make sure you have not removed or renamed the nagiosadmin user. This user is the nagios equivalent to 'root user' and should never be removed.

Make sure your password for Nagios XI only contains alpha-numeric characters. Some users have reported graphs disappearing from using special characters, creating a permissions issue.

Performance graphs are pulled via an internal proxy, so users with their Nagios server behind their own proxy or using strict SSL settings may experience problems viewing graphs. If you're using an environment with a proxy or SSL and having issues viewing graphs post the problem to our support forums and specify your use of proxy or SSL right away.

Having an internal DNS hostname that is not defined on the XI server can also cause problems with internal proxy call. If you've defined a custom DNS host entry for your XI server, make sure it's defined in your /etc/hosts file as well. For further information on this, contact our support team at support.nagios.com/forum.

Network Performance Graphs Are Displayed But Have No Data

2011R3.2 and 3.3 issues graphs display but are empty. Try running the following commands to see if an excessive amount of performance data files have built up.

cd /usr/local/nagios/var/spool/xidpe
ls -f | wc -l

If the file count is very large, run the following commands, which should restore regular performance graphing.

A fuller description of this problem is when you are monitoring a switch or router, but its bandwidth graphs are always zero when you know for sure they should have data. Keep in mind, be absolutely sure that the graphs should have data.

Make sure the /var/lock/mrtg directory exists. It has been witnessed that this directory will occasionally disappear. It is a trivial matter recreating it.

mkdir /var/lock/mrtg

Make sure none of the mrtg.cfg entries are using SNMP v2c. Older verions of the Switch Wizard called mrtg with arguments for SNMPv2c, which MRTG does not use. Open up /etc/mrtg/mrtg.cfg and look for

Notice that after the multitude of colons, there is a 1, this represents the SNMP version MRTG will use to poll the device. If this is instead 2c, change it to 2 and save the file. This will need to be done to every metric that is affected by being created with 2c.

Can I Migrate Performance Data From A Different Install?

RRD performance data files are compiled binaries, so for a simple file transfer a user would have to have the architecture match on both machines. If you want to migrate files from a 32bit to 64bit machine, you'll have to convert the data to XML and import it into RRD's on the new machine. Forum user srrhd was kind enough to supply the commands used for a working migration:

Check if Notifications are enabled globally - click on the "Monitoring Process" menu on the left from the Home page, and make sure you see a green dot next to the Notifications in the "Monitoring Engine Process" window. You can enable/disable Notifications by clicking on the "Action" button on the right hand side.

Check if Notifications are enabled for the user currently logged into Nagios XI - click on the username in the upper right corner next to "Logged in as: ...", then click on "Notification Preferences" under "Notification Options" from the left panel menu. Make sure that the "Enable Notifications" check-box is checked.

Review the selected Notification Types - the user will be notified only on host/service states, that are selected.

From the same page, click on "Notification Methods" and make sure a Notification Method is selected.

3. Host/Service Notification Options

Check if Notifications are enabled for a particular host/service. If you are having issues with Notifications for a particular Host or Service, log into the Core Config Manager and click on "Hosts" or "Services" under "Monitoring" from the left panel menu. Find your Host or Service and click on the "Modify" Action button to the right. Click on "Alert Settings" tab and verify that the "on" radio button next to the "Notification Enabled" is selected.

Make sure that the Check Period under the "Check Settings" tab is equal or larger than the Notification Period under the "Alert Settings" tab on the Host/Service Management page in the CCM. If Nagios is not checking a host or service during a specific time, then it will certainly not send notification during that time.

Check the "Alert Settings" tab under the Host/Service Management page in the CCM for two things:

- Make sure "Notification enabled" is not set to "off".

- See which options are selected under "Notification options", because this will determine the states of hosts/services that you will be notified for.

Note: If you are having issues with many hosts and services, you should check the templates you are using - "xiwizard_generic_host" and "xiwizard_generic_service" should be the first ones to be checked. Any changes you make in these templates will affect all hosts and services that reference them. You can override this by modifying the host or service configuration itself. If you need to know more on the topic, please read the full explanation of Nagios object inheritance here:
http://nagios.sourceforge.net/docs/3_0/objectinheritance.html

4. Contacts

The contact must be either directly associated with the host or service or be part of a contactgroup that is connected to the host or service.

Make sure users and contacts that were added within Nagios XI are set up with the proper notification handlers:

If you are using Users, which are also Contacts (you've added a Contact to them):

If you are not receiving notifications, it also possible that the nagiosadmin user was set to use the generic_template contact template, which resulted in notifications not being controlled through the XI interface.
This can be corrected by changing the user's contact template to be xi_generic_template is the Core Config Manager. This bug was corrected in 2009R1.2 and only affects systems that had/have previous versions installed.

5. Contact Timeperiods

Each contact has a timeperiod management option that determines when they get notification. Closely review if there are any time exclusions set within contact's timeperiod. These are times that the user will not be sent notifications.

6. Acknowledgements and Scheduled Downtime

If the problem has been acknowledged or the host/service is in downtime, alerts won't be sent.

7. Testing From Host or Service (Sending Custom Notification)

If you proceed to the host or service in question on the Nagios server and then select the Advanced tab, you can send a test email (custom notification) from the specific host or service that you are testing.

8. Tracking Notifications

If you go to Home->Incident Management->Notifications you should see that Nagios is sending notification based on the settings you have chosen and to the appropriate contacts. Using this tool helps you track down if Nagios intends to notify the appropriate contact.

Test Emails Fail, "Invalid address" Error

We identified a bug in 1.9 and some earlier versions where test emails to addresses like "root@localhost" or "user@xiserver" will fail to send because they fail email address validation. The email address needs to have some sort of domain at the end of it to pass validation and send. The browser may falsely display a success message for Users testing from their "Send Test Notification" page, while the browser will get an error message if a user runs the test from the Admin->Manage Email Settings->Send A Test Email page. This bug will be fixed in R1.10, but a workaround in the meantime would be to make sure users have the Nagios XI Sending Address in the Admin->Manage Email Settings page set to an email address with a FDQN OR the address listed below will also work:

Nagios XI <root@localhost.localdomain>

Make sure initial setup for the Admin->Manage Email Settings page has been done and that you've pressed Update on the email settings.

This bug can be identified by a debug message showing up at the top of the test email page that says "Invalid address:".

This bug is specific to installations using version of PHP 5.2+.

XI Display Problems

Tables Displaying A Count, But No Results

A recent issue has been identified where characters outside of the ASCII table are being generated by some of the check plugins, which causes an issue with XI's XML generation. The result is a table with a returned count of services, but no actual table data. This issue can be verified by checking the following url:

http://<serveraddress>/nagiosxi/backend/?cmd=getservicestatus

If this XML page returns an error, it should identify the line number of the issue which can be found in the page source. Below is a code patch that will be included in the next update of XI. Paste this code as a replacement to the xmlentities() function on line 30 of the /usr/local/nagiosxi/html/includes/utilsx.inc.php

Problems with Check Commands

How To Test Check Commands From The Command-line

Okay, you'll need to go through a few steps to establish what exactly is being run. Grab some paper to note settings as you go. Start by going to the Core Config Manager (under "Configure"), under Services in the left sidebar, find the service in question, and click the crossed tools "Configure" icon. On the "Common Settings" tab, note what it says for "Command view", the values of the eight ARG variables, and anything listed under "Additional templates". Now, in the left sidebar again, click "Templates -> Service templates", and find any that were listed on the previous step. If any of the ARG variables that were blank on the first page are filled in here, write down the value on the template. Repeat this step if any of the templates in turn have templates listed on their definitions. Similarly, if the Check command and Command view were blank, fill them in from the template.

Now, starting with what you had for "Command view", replace $USER1$ with /usr/local/nagios/libexec , and replace $HOSTADDRESS$ with the IP address of the host this service is associated with.

As an example, I have a host called "Server Room", with an IP address of 192.168.5.254, and am running a simple ping check against it. For "Check command" and "Command view" they're blank, $ARG5$ = -p 5, and for templates it has "xiwizard_websensor_ping_service". The template for xiwizard_websensor_ping_service has a "Check command" of "check_xi_service_ping" and a "Command view" of '$USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$,$ARG2$ -c $ARG3$,$ARG4$ $ARG5$', with $ARG1$ = 3000.0, $ARG2$ = 80%, $ARG3$ = 5000.0, $ARG4$ = 100%, $ARG5$ = -p 8, and a template of "xiwizard_generic_service". The "xiwizard_generic_service" template has a check command of "check_xi_service_none" and a command view of '$USER1$/check_dummy 0 "Nothing to monitor"', with blank args and no additional template. Nothing gets filled in from this template because all of the values it defines are already defined in a higher-priority setting.

Here, the first step is to look at '$USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$,$ARG2$ -c $ARG3$,$ARG4$ $ARG5$'.
Step two fills in $ARG5$ from the service definition, and we get '$USER1$/check_icmp -H $HOSTADDRESS$ -w $ARG1$,$ARG2$ -c $ARG3$,$ARG4$ -p 5'.
Step three gets args 1-4 from the xiwizard_websensor_ping_service template, giving '$USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5'. The $ARG5$ is left alone because it was already set.
Step four does nothing - the last template doesn't have any new info.
Step five is to fill in the macros, so you get '/usr/local/nagios/libexec/check_icmp -H 192.168.5.254 -w 3000.0,80% -c 5000.0,100% -p 5'. That's your full check command.

Now, log into your Nagios XI server as root, either on a direct terminal or through SSH. Enclose your command in single quotes like I've been doing here, put su -c before it and nagios after it, and hit enter. It should look something like this:

Obviously that will be filled in with different details based on the check you're trying to run, but hopefully that demonstrates the progression of how to build the line.

Problems with $ Signs in the Check Command

(Solution posted by Dietmar Lang)

In your service definition file, you may need to pass a $ symbol as an argument to a service check. For example, MS SQL Server instances are named "MSSQL$INSTANCE1". Your service definition would look like this: check_command

check_nt!SERVICESTATE!-d SHOWALL -l MSSQL$INSTANCE

This will not work.

For Nagios 3, add two backslashes and a second dollar (\$) symbol, like this: check_command check_command

check_nt!SERVICESTATE!-d SHOWALL -l MSSQL\\$$INSTANCE

Windows Memory Check Values Doubled

(contributed by Forum user GreatWolfResorts)

This is a result of how the check_nt plugin calculates memory values. The preferred solution for most users seems to be to use the check_nrpe plugin to distinguish the memory types.

Quoted from GreatWolfResorts:
I essentially created the following custom command:

Note: You will need to enable some NRPE commands in the nsc.ini file on the remote device. Specifically: allow_arguments=1

Alternatively, a full understanding of the check_nt MEMUSE command helps when reviewing the values returned. Windows refers to the sum of memory and swap files, that is, the entire available virtual memory. Windows regularly swaps program and data code from the main memory, even when it still has spare reserves. In this respect the load of the entire virual memory in Windows is the more important parameter to observe over simply physical or swap.

So in the end, the values returned weren't necessarily a bug in NagiosXI or nsclient++, but rather a view of the virtual memory of the machine.

Linux Cached Memory Not Added to Free Memory

It is normal for Linux to "borrow" unused memory for disk caching. This may however create false "Warning" or "Critical" alerts, even though you are NOT low on memory. In order to fix this, we have modified the "custom_check_mem" script, part of our Linux agent install script by adding an optional flag [-n|--nocache]. Basically, cached memory is added to the free memory when you use the "-n" flag.

If you are downloading a new copy of our Linux agent, the updated "custom_check_mem" will be included.
If you already installed the Linux agent, you can just download the updated "custom_check_mem" from here.

Copy the new script over the old "custom_check_mem".

Go to the Core Config Manager->Monitoring->Services->Memory Usage->Modify and under the "Common Settings" tab modify the $ARG2$ field by adding a "-n" flag.

For example, if you had:

-a '-w 20 -c 10'

change it to:

-a '-w 20 -c 10 -n'

Click on "Save" and "Apply Configuration".

Note: One gotcha - make sure the "custom_check_mem" has Unix EOL before you copy it over.

Other Issues

Nagios did not exit in a timely manner

For use when Nagios doesn't appear to be exiting cleanly. If the run
file, lock file, or temp check files are getting left behind, try doing this
mod around line 150 of /etc/init.d/nagios. (The mods are increasing the
for loop from 10 seconds to 30 seconds). This gives the Nagios daemon more time to cleanly shut down all of it's processes and clean up after itself

Upgrade to 2011R3.x Issues

If you're experiencing any of the following issues after an upgrade from Nagios XI 2011r2.x to 3.x:

Missing hosts or services or status data

Takes a VERY long time to Apply Configuration or restart the Nagios process

Unusually high CPU load

A flood of messages in the /var/log/messages related to ndo2db

Then you may need to manually set a few kernel settings on your system. In Nagios XI 2011r3.x+ the Ndoutils subcomponent now uses asynchronous writes to log status information to the database, and these messages are sent to the Linux kernel's message queue. Our upgrade scripts will tune the kernel settings automatically as of 2011r3.2, but in the event that you see the above symptoms on your system, we recommend applying the following settings to your system.

Open /etc/sysctl.conf with a text editor. Edit the file to match the following values:

## The maximum number of messages allowed in any one message queue
kernel.msgmni = 256000

Note: If you don't have these entries in the "/etc/sysctl.conf" file, just add them to the end of the file.

After these settings are saved to the file, run:

sysctl -p

To apply the new settings. If the system still appears to be working improperly, reboot the machine.

Login Screen Keeps Redirecting To Itself

The web browser keeps redirecting to the login screen even after entering login credentials. This has been noticed in Internet Explorer.

Nagios XI uses cookies to save session state. These cookies are set to expire after 30 minutes. If the time on the Nagios XI server is incorrect, the cookies returned to the client's browser might appear to be expired due to the time difference between the client's computer and the Nagios XI server. Solution: Fix the time on the Nagios XI server to ensure it is correct.

Check Services Being Orphaned

Some users have encountered large numbers of warning messages that accumulate quickly that read as follows:

Warning: The check of service <Your Service> on host <Your Host> looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service..

This is most likely caused by multiple instances of Nagios running. To fix this kill all instances of Nagios and then restart the process.

If the issue continues to persist after reboots and restarts of the Nagios service, then the issue is most likely caused by either a memory leak in embedded perl, or system ulimit restrictions. Symptoms can include the /tmp directory filling up quickly with check* files, and the following errors in the nagios log.

[1331905537] Warning: The check of service 'SERVICE' on host 'NAMESERVER' looks like it WAS
orphaned (results never Came back). I'm scheduling an immediate check of the service ...
[1331755699] Warning: The check of service 'SWAP' on host 'nameserver' not could be due to Performed
to fork () error 'Resource temporarily unavailable'. The check will be rescheduled.

Try the following solutions:

Edit /etc/security/limits.conf

* hard memlock 128 #locked memory
* soft memlock 128

* soft nofile 4096 #open files
* hard nofile 4096

* hard nproc 4096 #max user processes
* soft nproc 4096

* hard stack 20480 #stack size
* soft stack 20480

and restart the server. Run

ulimit -a

to verify that the new settings are in place.

And also update the settings in your nagios.cfg file to match the following:

enable_embedded_perl=0
use_embedded_perl_implicitly=0

Postgresql: Postmaster CPU Is High or "Transaction wraparound limit" in log

Although Nagios XI performs routine database maintenance on the postgres data tables, if you notice either a high CPU usage for the postmaster process, or a repeated error message in the /var/lib/pgsql/data/pg_log file that says "transaction ID wrap limit is 2147484146", then you may need to perform a manual VACUUM of the postgres databases. Run the following commands from the command line:

psql nagiosxi nagiosxi
VACUUM;
VACUUM ANALYZE;
VACUUM FULL;
\q

You will see messages like the following when running the above commands:

WARNING: skipping "pg_authid" --- only table or database owner can vacuum it

This is normal. You may need to run the above commands more than once if the CPU usage from postmaster is extremely high.

Next, vacuum the tables as the postgres user.

psql postgres postgres
VACUUM;
VACUUM ANALYZE;
VACUUM FULL;
\q

XI Component/Addon Problems

Website Wizard Content Check Failure

Some users have reported website content checks being blocked by the "dotDefender" application. See the following forum thread for the solution. Website Wizard Content Check Failure

Plugin/Component/Wizard Installation Problems

When plugins, components or wizards are not installed through the proper menus, this creates problems in Nagios XI, such as "wiping out" all wizards, so they can not be viewed in the Web interface, blank pages in the Web browser and other weird behaviors.

One common mistake is installing a component in place of the wizard and vice versa.

The proper way of doing it is: download the plugin, component or wizard you need to install, go to the "Admin" menu and then select the proper sub-menu from the left panel under the "System Extensions":

Note:
Don't unzip the installation file prior to selecting it through "Browse".
Also, don't rename the installation files. This will cause the installation to fail. The name of the file should be: "somename".zip. If you had a previous copy of the file and you download it again, your new file will be named "somename"(1).zip, which will not work.

If you already made a mistake and erroneously installed a component in place of the wizard or vice versa, here is what you should do:

Remove the problematic component/wizard by running in terminal as a root:

If you have blank pages in the web browser, this usually means there is a PHP error. Run:

# tail /var/log/httpd/error_log

right after loading that page to see what the errors are.

Sometimes, when you try to install a plugin you may receive an error message: "Plugin could not be installed - directory permissions may be incorrect". In order to check the permissions of your "libexec" directory, run in terminal:

# ls -l /usr/local/nagios

The owner of "libexec" directory should be nagios:nagios and the permissions should be set to 775 (drwxrwxr-x). If this is not what you have, run in terminal:

"Event Data Is Stale"

We've had a known bug relating to event data in versions 2009R1.4B-2011R1.1. This bug has been patched and will be available in releases later than the versions posted above, but if you're experiencing this error, and/or the nagios service is taking an excessively long time to start, you may have a corrupted mysql table that needs repair. We suggest taking the following steps.

Stop the following services

service nagios stop
service ndo2db stop
service mysqld stop

Run the our repair script for mysql tables.

/usr/local/nagiosxi/scripts/repairmysql.sh nagios

Unzip and copy the the following dbmaint file to /usr/local/nagiosxi/cron/.
This will overwrite the previous version.

If problems continue to persist, contact our support team at our support forums.

Bandwidth Usage for Offloaded MySQL

We don't have an official documentation for benchmarks on bandwidth usage for a Nagios server, but the following specs were recorded and submitted by a user for network traffic between a Nagios XI server and an offloaded MySQL server. Thanks Stephen Wallace for contributing this!

500 hosts, 10 services each at 5mn interval (5500 checks)

Breaks down to around 18 checks per second

Produces around 3MB of network traffic daily between Nagios and MySQL

"Still have questions?"

If you haven't found an answer to your question, you can check the Nagios XI Manuals: