Event Server Health Tests

Event Server Cloudera Manager Descriptor Age

This is a Event Server health test that checks if the Cloudera Manager descriptor has been refreshed recently. The Cloudera Manager descriptor is used to pass configuration information
from the Cloudera Manager Server to the Event Server. If the descriptor becomes stale, Event Server operation may be impacted because the Event Server will receive information about new hosts, roles,
and services, or changes to existing hosts, roles, and services. A stale descriptor usually indicates problems communicating with the Cloudera Manager Server but can also indicate performance
problems or a bug. Consult the Event Server log and the Cloudera Manager Server log for more information. This test can be configured using the Cloudera Manager Descriptor Age
Thresholds Cloudera Manager Event Server setting.

Short Name: Cloudera Manager Descriptor Age

Property Name

Description

Template Name

Default Value

Unit

Cloudera Manager Descriptor Age Thresholds

The health test thresholds for monitoring the time since the Cloudera Manager descriptor was last refreshed.

scm_descriptor_age_thresholds

critical:120000.0, warning:60000.0

no unit

Event Server Event Server Index Directory Free Space

This is a Event Server health test that checks that the filesystem containing the Event Server Index Directory of this Event Server has sufficient free space. See the Event Server Index
Directory description on the Event Server configuration page for more information on this directory type. This test can be configured using the Event Server Index Directory Free
Space Monitoring Absolute Thresholds and Event Server Index Directory Free Space Monitoring Percentage Thresholds Event Server monitoring settings.

The health test thresholds for monitoring of free space on the filesystem that contains this role's Event Server Index
Directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Event Server Index Directory Free Space Monitoring Absolute Thresholds setting is
configured.

eventserver_index_directory_free_space_percentage_thresholds

critical:never, warning:never

PERCENT

Event Server Event Store Size

This is an Event Server health test that checks that the event store size has not grown too far above the configured event store capacity. A failure of this health test indicates that
the Event Server is having a problem performing cleanup. This may indicate a configuration problem or bug in the Event Server. This test can be configured using the Event Store
Capacity Monitoring Thresholds Event Server monitoring setting.

Short Name: Event Store Size

Property Name

Description

Template Name

Default Value

Unit

Event Store Capacity Monitoring Thresholds

The health test thresholds on the number of events in the event store. Specified as a percentage of the maximum number
of events in Event Server store.

eventserver_capacity_thresholds

critical:130.0, warning:115.0

PERCENT

Event Server File Descriptors

This Event Server health test checks that the number of file descriptors used does not rise above some percentage of the Event Server file descriptor limit. A failure of this health test
may indicate a bug in either Hadoop or Cloudera Manager. Contact Cloudera support. This test can be configured using the File Descriptor Monitoring Thresholds Event
Server monitoring setting.

Short Name: File Descriptors

Property Name

Description

Template Name

Default Value

Unit

File Descriptor Monitoring Thresholds

The health test thresholds of the number of file descriptors used. Specified as a percentage of file descriptor
limit.

eventserver_fd_thresholds

critical:70.0, warning:50.0

PERCENT

Event Server GC Duration

This Event Server health test checks that the Event Server is not spending too much time performing Java garbage collection. It checks that no more than some percentage of recent time is
spent performing Java garbage collection. A failure of this health test may indicate a capacity planning problem or misconfiguration of the Event Server. This test can be configured using the
Garbage Collection Duration Thresholds and Garbage Collection Duration Monitoring Period Event Server monitoring settings.

Short Name: GC Duration

Property Name

Description

Template Name

Default Value

Unit

Garbage Collection Duration Monitoring Period

The period to review when computing the moving average of garbage collection time.

eventserver_gc_duration_window

5

MINUTES

Garbage Collection Duration Thresholds

The health test thresholds for the weighted average time spent in Java garbage collection. Specified as a percentage
of elapsed wall clock time.

The health test thresholds for monitoring of free space on the filesystem that contains this role's heap dump
directory.

heap_dump_directory_free_space_absolute_thresholds

critical:5.36870912E9, warning:1.073741824E10

BYTES

Heap Dump Directory Free Space Monitoring Percentage Thresholds

The health test thresholds for monitoring of free space on the filesystem that contains this role's heap dump
directory. Specified as a percentage of the capacity on that filesystem. This setting is not used if a Heap Dump Directory Free Space Monitoring Absolute Thresholds setting is configured.

heap_dump_directory_free_space_percentage_thresholds

critical:never, warning:never

PERCENT

Event Server Host Health

This Event Server health test factors in the health of the host upon which the Event Server is running. A failure of this test means that the host running the Event Server is
experiencing some problem. See that host's status page for more details.This test can be enabled or disabled using the Event Server Host Health Test Event Server
monitoring setting.

The health test thresholds for monitoring of free space on the filesystem that contains this role's log
directory.

log_directory_free_space_absolute_thresholds

critical:5.36870912E9, warning:1.073741824E10

BYTES

Log Directory Free Space Monitoring Percentage Thresholds

The health test thresholds for monitoring of free space on the filesystem that contains this role's log directory.
Specified as a percentage of the capacity on that filesystem. This setting is not used if a Log Directory Free Space Monitoring Absolute Thresholds setting is configured.

log_directory_free_space_percentage_thresholds

critical:never, warning:never

PERCENT

Event Server Process Status

This Event Server health test checks that the Cloudera Manager Agent on the Event Server host is heart beating correctly and that the process associated with the Event Server role is in
the state expected by Cloudera Manager. A failure of this health test may indicate a problem with the Event Server process, a lack of connectivity to the Cloudera Manager Agent on the Event Server
host, or a problem with the Cloudera Manager Agent. This test can fail either because the Event Server has crashed or because the Event Server will not start or stop in a timely fashion. Check the
Event Server logs for more details. If the test fails because of problems communicating with the Cloudera Manager Agent on the Event Server host, check the status of the Cloudera Manager Agent by
running /etc/init.d/cloudera-scm-agent status on the Event Server host, or look in the Cloudera Manager Agent logs on the Event Server host for more details. This test
can be enabled or disabled using the Event Server Process Health Test Event Server monitoring setting.

Short Name: Process Status

Property Name

Description

Template Name

Default Value

Unit

Event Server Process Health Test

Enables the health test that the Event Server's process state is consistent with the role configuration

eventserver_scm_health_enabled

true

no unit

Event Server Swap Memory Usage

This Event Server health test checks the amount of swap memory in use by the role. A failure of this health test may indicate that your machine is overloaded. This test can be configured
using the Process Swap Memory Thresholds monitoring settings.

Short Name: Swap Memory Usage

Property Name

Description

Template Name

Default Value

Unit

Process Swap Memory Thresholds

The health test thresholds on the swap memory usage of the process.

process_swap_memory_thresholds

critical:never, warning:any

BYTES

Event Server Unexpected Exits

This Event Server health test checks that the Event Server has not recently exited unexpectedly. The test returns "Bad" health if the number of unexpected exits exceeds a critical
threshold. For example, if this test is configured with a critical threshold of 1, this test returns "Good" health if there have been no unexpected exits recently. If 1 or more unexpected exits
occured recently, this test returns "Bad" health. The test also indicates whether any of the exits were caused by an OutOfMemory error if the Cloudera Manager Kill When Out of
Memory monitoring setting is enabled. This test can be configured using the Unexpected Exits Thresholds and Unexpected Exits Monitoring
Period Event Server monitoring settings.

Short Name: Unexpected Exits

Property Name

Description

Template Name

Default Value

Unit

Unexpected Exits Monitoring Period

The period to review when computing unexpected exits.

unexpected_exits_window

5

MINUTES

Unexpected Exits Thresholds

The health test thresholds for unexpected exits encountered within a recent period specified by the
unexpected_exits_window configuration for the role.

unexpected_exits_thresholds

critical:any, warning:never

no unit

Event Server Web Server Status

This health test checks that the role's web server is responding quickly to requests by the Cloudera Manager Agent, and that the Cloudera Manager Agent can collect metrics from the web
server. Failure of this health test may indicate a problem with the web server of the Event Server, a misconfiguration of the Event Server, or a problem with the Cloudera Manager Agent. Consult the
Cloudera Manager Agent logs and the logs of the Event Server for more detail. If the test failure message indicates a communication problem, the Cloudera Manager Agent's HTTP requests to the Event
Server's web server are failing or timing out. If the test's failure message indicates an unexpected response, the Event Server's web server responded to the Cloudera Manager Agent's request, but the
response could not be interpreted for some reason. This test can be configured using the Web Metric Collection Event Server monitoring setting.

Short Name: Web Server Status

Property Name

Description

Template Name

Default Value

Unit

Web Metric Collection

Enables the health test that the Cloudera Manager Agent can successfully contact and gather metrics from the web
server.

eventserver_web_metric_collection_enabled

true

no unit

Web Metric Collection Duration

The health test thresholds on the duration of the metrics request to the web server.

eventserver_web_metric_collection_thresholds

critical:never, warning:10000.0

MILLISECONDS

Event Server Write Pipeline

This Event Server health test checks that no messages are being dropped by the writer stage of the Event Server pipeline. A failure of this health test indicates a problem with the Event
Server. This may indicate a configuration problem or a bug in the Event Server. This test can be configured using the Event Server Write Pipeline Monitoring Time Period
monitoring setting.

Short Name: Write Pipeline

Property Name

Description

Template Name

Default Value

Unit

Event Server Write Pipeline Monitoring Thresholds

The health test thresholds for monitoring the Event Server write pipeline. This specifies the number of dropped
messages that will be tolerated over the monitoring time period.

eventserver_write_pipeline_thresholds

critical:any, warning:never

no unit

Event Server Write Pipeline Monitoring Time Period

The time period over which the Event Server write pipeline will be monitored for dropped messages.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.