1 High Availability

1.1 Introduction

Pandora FMS is a very stable application (thanks to the test and improvements included in each version and to the correction of hundred of fails discovered by users.In spite of this, in critical environments and/or with much load, it is possible that it would be necessary to distribute the load in several machines, being sure that if any component of Pandora FMS fails, the system will not be down.
Pandora FMS has been designed to it would be very modular. Any of its modules could work in an independent way. But it has also been designed to work with other components and for being able to take the load from those components that have been down.

The Pandora FMS standar design could be this one:

Obviously, the agents are not redundants. If an agent is down,it makes no sense to execute another one, so the only cause for than an agent downs is that data could not be obtained because the execution of any module is failing, and this could not be solved with another agent running in parallel, or because the system is
isolated or fails. The best solution is to make redundancy of the critical systems- regardless if they have Pandora FMS agents or not- and so to make redundancy or these systems monitoring.

It is possible to use HA in several scenaries:

Data Server Balancing and HA.

Network Servers,WMI, Plugin, Web and Prediction Balancing and HA

DDBB Load Balancing.

Recon Servers Balancing and HA.

Pandora FMS Console Balancing and HA.

1.2 Dimensioning and architectural designs HA

The most important components of Pandora FMS are:

Database

Server

Console

Each of the components can be replicated to protect the monitoring system from any catastrophe.

To designate the number of nodes needed to balance the load, we will study the number of objectives to be monitored and the quantity, type and frequency of capture of the metrics to be collected.

Depending on the monitoring needs we will define the different architectures.

Note: The tests carried out to define the architectures have been carried out using different equipment:

1.3 HA of Data Server

The easiest way is to use the HA implemented in the agents (which allow you to contact an alternative server if the principal does not reply). However, since the data server supports port 41121 and is a standard TCP port, it is possible to use any commercial solution that allows balancing or clustering an ordinary TCP service.

For Pandora FMS data server you will need to mount two machines with a configured Pandora FMS data server (and different hostname and server name). You will have to configure a Tentacle server in each of them. Each machine will have a different IP address. If we are going to use an external balancer, it will provide a unique IP address to which the agents will connect to send their data.

If we are using an external balancer, and one of the servers fails, the HA mechanism "promotes" one of the available active servers and Pandora FMS agents will continue to connect with the same address as before, without noticing the change, but in this case, the load balancer will no longer send the data to the server that has failed, but to another active server. There is no need to change anything in every Pandora FMS data server, even each server can keep its own name, useful to know if any has fallen in the server status view. Pandora FMS data modules can be processed by any server without a preassignment being necessary. It is designed precisely this way so that HA can be implemented more easily.

In the case of using the HA mechanism of the agents, there will be a small delay in the sending of data, since at each execution of the agent, it will try to connect with the primary server, and if it does not answer, it will do so against the secondary one (if it has been configured this way). This is described below as "Balancing in Software Agents".

If you want to use two data servers and both manage policies, collections, and remote configurations, you will need to share the following directories by NFS so that all instances of the data server can read and write over these directories. Consoles should also have access to these shared directories shared by NFS.

/var/spool/pandora/data_in/conf

/var/spool/pandora/data_in/collections

/var/spool/pandora/data_in/md5

1.4 Balancing in the Software Agents

From the software agents it is possible to do a balancing of Data servers so it is possible to configure a Data server master and another one for backup.

In the agent configuration file pandora_agent.conf, you should configure and uncomment the following part of the agent configuration file:

There are the following options (for more information, go to the Agents Configuration chapter.

secondary_mode: Mode in which the secondary server should be. It could have two values:

on_error: Send data to the secondary server only if it could not send them to the main server.

always: Always sends data to the secondary server not regarding if it could or not connect with the main server.

secondary_server_ip: Secondary server IP

secondary_server_path: Path where the XML are copied in the secondary server,usually /var/spoo//pandora/data_in

secondary_server_port: Port through the XML will be copy to the secondary server in tentacle 41121, in ssh 22 are in ftp 21.

secondary_transfer_mode: transfer mode that will be used to copy the XML to the sercondary server, tentacle, ssh, ttp etc

secondary_server_pwd: Password option for the transfer through FTP

secondary_server_ssl: Yes or not should be put depending if you want to use ssl to transfer data through Tentacle or not.

secondary_server_opts: This field is for other options that are necessaries for the transfer.

Only the remote configuration of the agent is operative, if it is enabled, in the main server.

1.5 Balancing and HA of the Network Servers, WMI, Plugin, Web and Prediction

This is easier. You need to install several servers, network,WMI, Plugin, Web or Prediction, in several machines of the network (all with the same visibility for the systems that you want monitor). All these machines should be in the same segment (so as the network latency data whould be coherents)

The servers could be selected as primaries.These servers will automatically collect the data form all the modules assigned to a server that is selected as «down».Pandora FMS own servers implement a mechanism to detect that one of them has down thorugh a verification of its last date of contact (server threshold x 2).It will be enough if only one Pandora FMS server would be active for that it could detect the collapse of the other ones. If all Pandora FMS are down, there is no way to detect or to implement HA.

The obvious way to implement HA and a load balancing in a system of two nodes is to asign the 50% of the modules to each server and select both servers as masters (Master. In case that there would be more than two master servers and a third server down with modules expecting to be executed, the first of the master server that would execute the module will "self-assign" the module of the down server. In case of the recovering of one of the down servers, the modules that have been assigned to the primary server would automatically be assigned again.

The load balancing between the different servers is done in the Agent Administration section in the "setup" menu.

In the field "server" there is a combo where you can choose the server that will do the checking.

1.5.1 Server configuration

A Pandora FMS Server can be running in two different modes:

Master mode.

Non-master mode.

If a server goes down, its modules will be executed by the master server so that no data is lost.

At any given time there can only be one master server, which is chosen from all the servers with the master configuration token in /etc/pandora/pandora_server.conf set to a value greater than 0:

master [1..7]

If the current master server goes down, a new master server is chosen. If there is more than one candidate, the one with the highest master value is chosen.

Be careful about disabling servers. If a server with Network modules goes down and the Network Server is disabled in the master server, those modules will not be executed.

For example, if you have three Pandora FMS Servers with master set to 1, a master server will be randomly chosen and the other two will run in non-master mode. If the master server goes down, a new master will be randomly chosen.

1.6 HA of Pandora FMS Console

Just install another console. Any of them can be used simultaneously from different locations by different users. Using a web balancer in front of the consoles, will make possible to access them without really knowing which one is being accessed since the session system is managed by "cookies" and this is stored in the browser.
In the case of using remote configuration, both data servers and consoles must share (NFS) the entry data directory (/var/spool/pandora/data_in) for remote configuration of agents, collections and other directories.

1.7 Database HA

This solution is provided to offer a fully-featured solution for HA in Pandora FMS environments. This is the only officially-supported HA model for Pandora FMS. This solution is provided -preinstalled- since OUM 724. This system replaces DRBD and other HA systems we recommended in the past.

This is the first Pandora DB HA implementation, and the install process is almost manual, by using linux console as Root. In future versions we will easy the setup and configuration from the GUI

Pandora FMS relies on a MySQL database for configuration and data storage. A database failure can temporarily bring your monitoring solution to a halt. The Pandora FMS high-availability database cluster allows you to easily deploy a fault-tolerant, robust architecture.

Cluster resources are managed by Pacemaker, an advanced, scalable High-Availability cluster resource manager. Corosync provides a closed process group communication model for creating replicated state machines. Percona was chosen as the default RDBMS for its scalability, availability, security and backup features.

Active/passive replication takes place from a single master node (writable) to any number of slaves (read only). A virtual IP address always points to the current master. If the master node fails, one of the slaves is promoted to master and the virtual IP address is updated accordingly.

The Pandora FMS Database HA Tool, pandora_ha, monitors the cluster and makes sure the Pandora FMS Server is always running, restarting it when needed. pandora_ha itself is monitored by systemd.

This is an advanced feature that requires knowledge of Linux systems.

1.7.1 Installation

We will configure a two node cluster, with hosts node1 and node2. Change hostnames, passwords, etc. as needed to match your environment.

Commands that should be run on one node will be preceded by that node's hostname. For example:

node1# <command>

Commands that should be run on all nodes will be preceded by the word all. For example:

all# <command>

There is an additional host, which will be referred to as pandorafms, where Pandora FMS is or will be installed.

1.7.1.1 Prerequisites

CentOS version 7 must be installed on all hosts, and they must be able to resolve each other's hostnames.

Next, click on Create slave and add an entry for the second node. You should see something similar to this:

Seconds behind master should be close to 0. If it keeps increasing, replication is nor working.

1.7.2 Adding a new node to the cluster

Install Percona (see Installing Percona). Backup the database of the master node (node1 in this example) and write down the master log file name and position (in the example, mysql-bin.000001 and 785):