An event-based automation & remediation platform should heal itself, right? Sure - but it’s still important to monitor
your system to validate that everything is working as expected. Monitoring is not just about faults either. It can
help you understand how much the system is being used, and when you need to scale it out.

These guidelines should help you understand what services, metrics and logs to monitor. They can be implemented
using any combination of common monitoring tools.

Note

These monitoring guidelines are just that: guidelines. You will need to modify them to suit your specific
environment. They are still a work in progress, and we welcome feedback on ways to improve them, and
suggestions for specific monitoring system integration details.

BWC does not have one single API endpoint for checking system health. You can make a reasonable assumption about
current system status by using the API to execute a simple action, and then checking the response:

In a distributed system, only some of these processes will be running on each system. In the example here st2chatops is not configured on this system.

Tools such as nagios or check_mk can be used to monitor the process list. Some services spawn more than one process. The exact number will depend upon your system configuration - e.g. st2actionrunner will spawn additional processes on a multi-core system.

Key metrics for BWC administrators to watch are the number of running and scheduled actions, and the average execution time.
Busy systems will need to scale out the number of st2actionrunner processes.

We recommend storing metrics in a time-series database, such as InfluxDB

The BWC audit logs record all executed actions, execution time and result. These logs should be stored in a system
like Splunk or Elasticsearch that allows for extraction of average run time and execution count.

By default, all BWC logs are stored in the /var/log/st2/ directory. See the Configure Logging
section for more information about logfile location, configuration and using syslog.

Note

We strongly recommend storing all BWC logs in a dedicated log management tool, such as Splunk,
Graylog or the ELK stack. You can also see some examples of Logstash
configuration and Kibana dashboards here: exchange-misc/logstash.

All log messages include a log level - DEBUG, INFO, WARNING, ERROR, CRITICAL. All messages at WARNING and above should be
escalated for investigation.

Most organisations will want to investigate failed action executions. This is an example of a failed execution in the
st2actionrunner logs: