[Cosmos] Monitoring and Alerting for Validators

Monitoring and alerting are of utmost importance in any system where availability of the service offered is paramount. In the case of Cosmos, validators must be online (i.e. connected to other nodes at all times) and sign off on each proposed block, otherwise they miss out on rewards and may be penalised for their lack of diligence in securing the network.

This article presents the reasons as to why one should monitor their nodes, tips on what to monitor and what tools are available. We will not be providing a tutorial on all of this but we will provide a brief overview as to how this can be achieved. The monitoring and alerting options that will be discussed are:

Prometheus with Grafana dashboards

Hubble Alerts by Figment Networks

Botelicious by Cypher Core

Endpoints available for a custom solution (RPC, REST, Prometheus)

P.A.N.I.C. by Simply VC, our great tool which we will be open-sourcing

We developed the P.A.N.I.C. alerter with the purpose of having a tailor-made solution for receiving alerts about our nodes. The tool became such a useful part of our setup that we decided to share it with the Cosmos community. P.A.N.I.C. features multiple alert severity dependent notification channels, including phone call alerts, as well as Telegram and email alerts. It also uses Telegram for the ability to query the alerter and for extra control, such as to snooze calls. Redis is used in the background as a backup state store.

You may be asking why anyone would go through all this hassle to monitor their nodes. Continue reading to find out!

Why Bother?

As a validator operator, one has the responsibility of hosting a highly available node. This is not just so that their validator ticks all the boxes in the “last 100 blocks” section on every Cosmos explorer. Prolonged validator downtime in Cosmos is penalised by the burning of a percentage, 0.01%, of a validator’s stake (slashing), the removal of the ability to vote on future blocks for a period of time (jailing), and missing out on rewards. This is on top of the tainted track record in the eyes of your delegators and the network at large.

Downtime is not always the result of an operator’s negligence, but the least that a validator operator can do is to have an adequate setup for getting notified if the validator’s health is at stake, be it through manual check-ups using monitoring dashboards or by setting up automated alerts.

One can also set up alerts for less critical situations, or even for positive events, such as a new delegation. Apart from indicating that the validator is running properly, low-severity and positive alerts show that the monitoring and alerting software itself is still alive and crunching numbers.

Alerting does not stop at validator activity. It may be useful to set an alert for when new versions of the Cosmos SDK or Gaia are released, given that a node should be kept updated. At a lower level, we can also set alerts on the status of the underlying system, such as for when file system space is running out or for abnormal network, processor, and memory usage. We can also choose to monitor the state of the network as a whole, such as for Byzantine validators.

Monitoring

The first step to tracking a validator and anything else is monitoring. At the heart of any monitoring tool is a data collector that periodically gathers and stores relevant information from the system being monitored, which in our case is a Cosmos node. This can then be organised and presented to the operator, typically in the form of a dashboard that is manually observed.

Despite the importance of monitoring, having to keep an eye on potentially multiple dashboards is a very time-consuming way of making sure that a validator is healthy. Additionally, monitoring by itself is passive in that it does not react to abnormal scenarios and leaves it up to the validator operator to observe and decide what constitutes an undesirable state.

A Step Further: Alerting

A more important tool in a validator operator’s arsenal is having some form of automated alerting setup. Alerts use the data gathered and organised by the monitoring software, along with a set of criteria, to actively alert the operator in the case that these criteria get violated. The alerting functionality may come included as a part of the monitoring software itself, or it can also be built around an existing monitoring setup that exposes the monitored metrics.

The main advantage of alerting over manual monitoring is that the only work that the validator operator has to do, save for reacting to the alerts, is to come up with a set of boundaries that the validator must follow. The alerter then automatically sends an alert when a rule is violated. The operator can then use this as a wake-up call to check the monitoring dashboards in an effort to troubleshoot or narrow down the root cause of the alert.

However, defining the alert criteria is no easy task. The challenge lies in minimising false alarms and generating useful alerts. Moreover, our aim is not to increase the quantity of alerts, but rather the quality of the alerts given off. The quality of an alert can be measured by how helpful it is to the operator.

Current Monitoring and Alerting Options

You may be tempted to keep three explorer tabs open and refresh every now and then to make sure that your validator is not missing blocks. Don’t! Below are some better options to consider, each with its own pros and cons.

Prometheus and Grafana

The staple monitoring setup is Prometheus with a Grafana dashboard. Prometheus periodically extracts values to a time-series database. Grafana presents those values in various types of panels that make up a dashboard.

Cosmos nodes include a Prometheus port (26660), exposed by Tendermint, that provides metrics about the node itself, such as the number of peers currently connected to it, and about the network that it is running on, such as the number of missing validators and total online voting power. An example alert that we can set is for when the number of peers reaches a critical low.

Another very useful Prometheus endpoint to add is that provided by Prometheus’ own node_exporter, which makes available hardware and system metrics, such as system resource usage percentages. A suggested Grafana dashboard for this exporter is Node Exporter Full. Example alerts in this case are prolonged high CPU/memory usage, or an almost full storage.

As with any tool Grafana has its downsides, the most annoying of which is that it only allows setting alerts on graphs, and not on simple status values (called a singlestat), such as the number of peers. A workaround is to convert singlestat panels to graphs, at the expense of complicating the dashboard.

This is not a big issue if we were to use Grafana mainly as an alerting tool. Grafana is able to send out alerts using various notification channels, including email, chat services such as Telegram and Discord, and webhooks.

Hubble by Figment Networks

The Hubble explorer by Figment Networks has a trick up its sleeve; Hubble Alerts. With very little effort, one can subscribe to events related to any validator. Alerts are sent via email to the address that is used to register.

Validator events include changes in voting power, missing a number of precommits, going offline, and joining/leaving the active validator set. Alerts can be turned on or off on an event-by-event basis and there is also the option of receiving a daily summary of all alerts.

Hubble Alerts are limited in the rate of emails that can be sent out and in the variety of alerts, but they are a handy set of alerts to sign up to nevertheless.

Botelicious by Cypher Core

Another useful tool is Botelicious, by Jay | Cypher Core. This Discord and Telegram chatbot features a simple alerting system that provides the ability to subscribe to a validator and get alerted when it is missing blocks.

The main function of Botelicious is actually to enable the operator to make node-specific queries to get information such as the last block height that the node is synced at, a list of connected peers, voting power, and so on.

Endpoints for a Custom Solution

Nothing beats creating your own tailor-made alerting solution focused on the monitoring and alerts that are most relevant to you. If this is what you seek, there are three main data sources that you should be considering:

RPC interface: somewhat limited when compared to other data sources, but is enabled by default and has all the necessary details about a node and the network in general. Whether one wants to check the node’s details and current state (/status), connected peers (/net_info), or query a block at a particular height to check if the validator’s public key is missing (/block?height=[height]), these endpoints are definitely worth collecting data from. This interface is exposed on port 26657 by default.

RPC endpoints as seen from an internet browser

REST interface: a richer data source, compared to RPC, and a lot more options. However, it requires the operator to start up a rest-server manually from gaiacli, given that this is disabled by default, or rely on a public node but miss out on node-specific details. The official documentation on this API presents all of the endpoints with the option to try them out. The default port is 1317, once the rest-server is enabled.

Prometheus: we mention this once more, but this time round without the need to install Prometheus itself or Grafana. This data source is a bit limited and less beginner-friendly but more straight-to-the-point. The official documentation presents all of the available metrics. The default port is 26660 with the metrics available on the /metrics endpoint, once Prometheus is enabled from the config.toml file of the node to monitor.

P.A.N.I.C. Alerter by Simply VC

At Simply VC, we followed our own advice and created our own custom alerting solution called P.A.N.I.C. (Python Alerter for Nodes In Cosmos), with the aim of defining exactly what alerts we want to receive, packaging them into one tool, and using whatever alert channel we think is most effective.

Greeting from the Telegram bot

The main feature of this tool is the ability to send alerts using a Telegram bot, email, as well as phone calls using Twilio. The different notification channels are used depending on the severity of the alert (we don’t get a phone call for every small issue). Alerts also vary in severity and channel used, based on whether the node being monitored is a validator or just a full node, such as a sentry node. Behind the scenes, Redis is used to keep a backup of the current state, so that the alerter does not lose its progress if it restarts or is restarted.

The alerter makes heavy use of the RPC data source mentioned above. It collects data points every few seconds and sends alerts based on:

Current state, e.g. node not accessible, node is not keeping up

Change in state, e.g change in voting power, change in number of peers

Consecutive events, e.g. N precommits missed in a row

Timed events, e.g. N precommits missed in the last M minutes

Telegram alert examples

Telegram bots in P.A.N.I.C. serve two purposes. As mentioned above, they are used to send alerts. However they can also accept commands that allow you to check the status of the alerter, snooze or unsnooze calls, and conveniently get Cosmos explorer links to validator lists, blocks, and transactions.

Telegram commands

We believe that P.A.N.I.C. is useful for anyone running a Cosmos node and will thus be open-sourcing it for the Cosmos community in the coming weeks!

Follow our Twitter page for news about the release of P.A.N.I.C. Also, feel free to reach out to Simply VC using the contact methods listed in the About Us section on our website, or directly to myself (Telegram: @MiguelDin).