Alarms reference

The following tables lists all pre-configured StorageGRID Webscale system alarms. Responses are assigned according to the severity of the alarm. This can vary if you customize the alarm settings to fit your system management approach.

Code

Name

Service

Recommended action

ABRL

Available Attribute Relays

BADC, BAMS, BARC, BCLB, BCMN, BCMS, BLDR, BNMS, BSSM, BDDS

Restore connectivity to a service (an ADC service) running an Attribute Relay Service as soon as
possible. If there are no connected attribute relays, the grid node cannot report attribute values to the NMS service. Thus, the NMS
service can no longer monitor the status of the service, or update attributes for the service.

If the problem persists, contact technical support.

ACMS

Available Metadata Services

BARC, BLDR, BCMN

An alarm is triggered when an LDR or ARC service loses connection to a DDS service. If this occurs, ingest or retrieve transactions cannot be processed. If the unavailability of DDS services is only a brief transient issue, transactions can be delayed.

Check and restore connections to a DDS service to clear this alarm and return the service to full functionality.

ACTS

Cloud Tiering Service Status

ARC

Only available for Archive Node's with a Target Type of Cloud Tiering - Simple Storage Service (S3).

If the ACTS attribute for the Archive Node is set to Read-Only Enabled or Read-Write Disabled, you must set the attribute to Read-Write Enabled.

If a major alarm is triggered due to an authentication failure, verify the credentials associated with destination bucket and update values, if necessary.

If a major alarm is triggered due to any other reason, contact technical support.

ADCA

ADC Status

ADC

If an alarm is triggered, select Support > Grid Topology. Then select site > grid node > ADC > Overview > Main and ADC > Alarms > Main to determine the cause of the alarm.

If the problem persists, contact technical support.

ADCE

ADC State

ADC

If the value of ADC State is Standby, continue monitoring the service and if the problem persists, contact technical support.

If the value of ADC State is Offline, restart the service. If the problem persists, contact technical support.

AITE

Retrieve State

BARC, BARC

Only available for Archive Node's with a Target Type of Tivoli Storage Manager (TSM).

If the value of Retrieve State is Waiting for Target, check the TSM middleware server and ensure that it is operating correctly. If the Archive Node has just been added to the StorageGRID Webscale system, ensure that the Archive Node's connection to the targeted external archival storage system is configured correctly.

If the value of Archive Retrieve State is Offline, attempt to update the state to Online. Select Support > Grid Topology. Then select site > grid node > ARC > Retrieve > Configuration > Main, select Archive Retrieve State > Online, and click Apply Changes.

If the problem persists, contact technical support.

AITU

Retrieve Status

BARC, BARC

If the value of Retrieve Status is Target Error, check the targeted external archival storage system for errors.

If the value of Archive Retrieve Status is Session Lost, check the targeted external archival storage system to ensure it is online and operating correctly. Check the network connection with the target.

If the value of Archive Retrieve Status is Unknown Error, contact technical support.

ALIS

Inbound Attribute Sessions

ADC

If the number of inbound attribute sessions on an attribute relay grows too large, it can be an indication that the StorageGRID Webscale system has become unbalanced. Under normal conditions, attribute sessions should be evenly distributed amongst ADC services. An imbalance can lead to performance issues.

If the problem persists, contact technical support.

ALOS

Outbound Attribute Sessions

ADC

The ADC service has a high number of attribute sessions, and is becoming overloaded. If this alarm is triggered, contact technical support.

ALUR

Unreachable Attribute Repositories

ADC

Check network connectivity with the NMS service to ensure that the service can contact the attribute repository.

If this alarm is triggered and network connectivity is good, contact technical support.

AMQS

Audit Messages Queued

BADC, BAMS, BARC, BCLB, BCMN, BCMS, BLDR, BNMS, BDDS

If audit messages cannot be immediately forwarded to an audit relay or repository, the messages are stored in a disk queue. During heavy loads this queue can exceed 100,000 messages. If this occurs, monitor the queue to determine if messages are being forwarded.

If the alarm is triggered, check the load on the system—if there have been a significant number of transactions this can be normal and will resolve itself over time. In this case, the alarm can be ignored and will clear itself.

If the alarm persists, view a chart of the queue size. If the number continues increasing without occasional decreases, contact technical support.

In rare instances, the disk queue can be large enough to cause a thread deadlock when the AMS service starts. If a thread deadlock occurs, contact technical support.

AOTE

Store State

BARC, BARC

Only available for Archive Node's with a Target Type of Tivoli Storage Manager (TSM).

If the value of Store State is Waiting for Target, check the external archival storage system and ensure that it is operating correctly. If the Archive Node has just been added to the StorageGRID Webscale system, ensure that the Archive Node's connection to the targeted external archival storage system is configured correctly.

If the value of Store State is Offline, check the value of Store Status. Correct any problems before moving the Store State back to Online.

AOTU

Store Status

BARC, BARC

If the value of Store Status is Session Lost check that the external archival storage system is connected and online.

If the value of Target Error, check the external archival storage system for errors.

If the value of Store Status is Unknown Error, contact technical support.

APMS

Multipath State

SSM

If the multipath state alarm appears as "Simplex" (select Support > Grid Topology, then select site > grid node > SSM > Events), do the following:

Plug in or replace the cable that does not display any indicator lights.

Wait one to five minutes.

Do not unplug the other cable until at least five minutes after you plug in the first one. Unplugging too early can cause the root volume to become read-only which requires that the hardware be restarted.

Return to the SSM > Resources page, and verify that the "Simplex" Multipath status changed to "Nominal" in the Storage Hardware section.

ARCE

ARC State

ARC

The ARC service has a state of Standby until all ARC components (Replication, Store, Retrieve, Target) have started. It then transitions to Online.

If the value of ARC State does not transition from Standby to Online, check the status of the ARC components.

If the value of ARC State is Offline, restart the service. If the problem persists, contact technical support.

AROQ

Objects Queued

ARC

This alarm can be triggered if the removable storage device is running slowly due to problems with the targeted external archival storage system, or if it encounters multiple read errors. Check the external archival storage system for errors, and ensure that it is operating correctly.

In some cases, this error can occur as a result of a high rate of data requests. Monitor the number of objects queued as system activity declines.

ARRF

Request Failures

ARC

If a retrieval from the targeted external archival storage system fails, the Archive Node retries the retrieval as the failure can be due to a transient issue. However, if the object data is corrupt or has been marked as being permanently unavailable, the retrieval does not fail. Instead, the Archive Node continuously retries the retrieval and the value for Request Failures continues to increase.

This alarm can indicate that the storage media holding the requested data is corrupt. Check the external archival storage system to further diagnose the problem.

If you determine that the object data is no longer in the archive, the object will have to be removed from the StorageGRID Webscale system. For more information, contact technical support.

An alarm is triggered if the value of Audit Shares is Unknown. This alarm can indicate a problem with the installation or configuration of the Admin Node.

If the problem persists, contact technical support.

AUMA

AMS Status

AMS

If the value of AMS Status is DB Connectivity Error, restart the grid node.

If the problem persists, contact technical support.

AUME

AMS State

AMS

If the value of AMS State is Standby, continue monitoring the StorageGRID Webscale system. If the problem persists, contact technical support.

If the value of AMS State is Offline, restart the service. If the problem persists, contact technical support.

AUXS

Audit Export Status

AMS

If an alarm is triggered, correct the underlying problem, and then restart the AMS service.

If the problem persists, contact technical support.

BASF

Available Object Identifiers

CMN

When a StorageGRID Webscale system is provisioned, the CMN service is allocated a fixed number of object identifiers. This alarm is triggered when the StorageGRID Webscale system begins to exhaust its supply of object identifiers.

To allocate more identifiers, contact technical support.

BASS

Identifier Block Allocation Status

CMN

By default, an alarm is triggered when object identifiers cannot be allocated because ADC quorum cannot be reached.

Identifier block allocation on the CMN service requires a quorum (50% + 1) of the ADC services to be online and connected. If quorum is unavailable, the CMN service is unable to allocate new identifier blocks until ADC quorum is re-established. If ADC quorum is lost, there is generally no immediate impact on the StorageGRID Webscale system (clients can still ingest and retrieve content), as approximately one month's supply of identifiers are cached elsewhere in the grid; however, if the condition continues, the StorageGRID Webscale system will lose the ability to ingest new content.

If an alarm is triggered, investigate the reason for the loss of ADC quorum (for example, it can be a network or Storage Node failure) and take corrective action.

If the problem persists, contact technical support.

BRDT

Module temperature

SSM

An alarm is triggered if the temperature of a StorageGRID Webscale appliance E5600SG controller exceeds a nominal threshold.

If the Storage Node is a StorageGRID Webscale appliance, StorageGRID Webscale indicates that the storage controller needs attention.

An alarm is triggered if the service time (seconds) differs significantly from the operating system time. Under normal conditions, the service should resynchronize itself. If the service time drifts too far from the operating system time, system operations can be affected. Confirm that the StorageGRID Webscale system’s time source is correct.

If the problem persists, contact technical support.

BTSE

Clock State

BADC, BLDR, BNMS, BAMS, BCLB, BCMN, BARC, BCMS

An alarm is triggered if the service’s time is not synchronized with the time tracked by the operating system. Under normal conditions, the service should resynchronize itself. If the time drifts too far from operating system time, system operations can be affected. Confirm that the StorageGRID Webscale system’s time source is correct.

If the problem persists, contact technical support.

CAHP

Java Heap Usage Percent

DDS

An alarm is triggered if Java is unable to perform garbage collection at a rate that allows enough heap space for the system to properly function. An alarm might indicate a user workload that exceeds the resources available across the system for the DDS metadata store. Check the ILM Activity in the Dashboard, or select Support > Grid Topology, then select site > grid node > DDS > Resources > Overview > Main.

If the problem persists, contact technical support.

CAIH

Number Available Ingest Destinations

CLB

This alarm clears when underlying issues of available LDR services are corrected. Ensure that the HTTP component of LDR services are online and running normally.

If the problem persists, contact technical support.

CAQH

Number Available Q/R Destinations

CLB

This alarm clears when underlying issues of available LDR services are corrected. Ensure that the HTTP component of LDR services are online and running normally.

If the problem persists, contact technical support.

CASA

Data Store Status

DDS

An alarm is raised if the Cassandra metadata store becomes unavailable.

Check the status of Cassandra:

At the Storage Node, log in as admin and su to root using the password listed in the Passwords.txt file.

Enter: service cassandra status

If Cassandra is not running, restart it: service cassandra restart

This alarm might also indicate that the metadata store (Cassandra database) for a Storage Node requires rebuilding.

This alarm is triggered when the Metadata Effective Space (CEMS) reaches 70% full (minor alarm), 90% full (major
alarm), and 100% full (critical alarm).

If this alarm reaches the 90% threshold, a
warning appears on the Dashboard in the Grid Manager. You must perform an expansion procedure to add new Storage Nodes as soon as possible. See the instructions for expanding a StorageGRID Webscale grid.

If this alarm reaches the 100% threshold, you must stop ingesting objects and add Storage Nodes immediately. Cassandra requires a certain amount of space to perform essential operations such as compaction and repair. These operations will be impacted if object metadata uses more than 100% of the allowed space. Undesirable results can occur.

Note: Contact technical support if you are unable to add Storage Nodes.

Once new Storage Nodes are added, the system automatically rebalances object metadata across all Storage Nodes, and the alarm clears.

CLBA

CLB Status

CLB

If an alarm is triggered, select Support > Grid Topology, then select site > grid node > CLB > Overview > Main and CLB > Alarms > Main to determine the cause of the alarm and to troubleshoot the problem.

If the problem persists, contact technical support.

CLBE

CLB State

CLB

If the value of CLB State is Standby, continue monitoring the situation and if the problem persists, contact technical support.

If the state is Offline and there are no known server hardware issues (for example, the server is unplugged) or scheduled downtime, restart the service. If the problem persists, contact technical support.

CMNA

CMN Status

CMN

If the value of CMN Status is Error, select Support > Grid Topology, then select site > grid node > CMN > Overview > Main and CMN > Alarms > Main to determine the cause of the error and to troubleshoot the problem.

An alarm is triggered and the value of CMN Status is No Online CMN during a hardware refresh of the primary Admin Node when the CMNs are switched (the value of the old CMN State is Standby and the new is Online).

If the problem persists, contact technical support.

CMSS

CMS State

If an alarm is triggered, contact technical support.

CMST

CMS Status

CMS

If an alarm is triggered, contact technical support.

CPRC

Remaining Capacity

NMS

An alarm is triggered if the remaining capacity (number of available connections that can be opened to the NMS database) falls below the configured alarm severity.

If an alarm is triggered, contact technical support.

CPUT

CPU Temperature

SSM

An alarm is triggered if the temperature of a StorageGRID Webscale appliance E5600SG controller CPU exceeds a nominal threshold.

If the Storage Node is a StorageGRID Webscale appliance, the StorageGRID Webscale system indicates that the storage controller needs attention.

This alarm is triggered when the average time required to run a query against the metadata store through the service exceeds the value set in the Grid Manager.

To resolve this alarm, check for hardware and workload changes around the time the query latency increased. For example, hardware issues such as multiple failed disks and workload changes such as a sudden increase in ingests, can lead to an increase in query latency.

DNST

DNS Status

SSM

After installation completes, a DNST alarm is triggered in the SSM service. After the DNS is configured and the new server information reaches all grid nodes, the alarm is canceled.

ECCD

Corrupt Fragments Detected

LDR

An alarm is triggered when the background verification process detects a corrupt erasure coded fragment. If a corrupt fragment is detected, an attempt is made to rebuild the fragment.

Reset the Corrupt Fragments Detected and Copies Lost attributes to zero and monitor them to see if counts go up again. If counts do go up, there may be a problem with the Storage Node's underlying storage. A copy of erasure coded object data is not considered missing until such time that the number of lost or corrupt fragments breaches the erasure code's fault tolerance; therefore, it is possible to have corrupt fragment and to still be able to retrieve the object.

If the problem persists, contact technical support.

ECST

Verification Status

LDR

This alarm indicates the current status of the background verification process for
erasure coded object data on this Storage Node.

A major alarm is triggered if there is an error in the background verification process.

FOPN

Open File Descriptors

BADC, BAMS, BARC, BCLB, BCMN, BLDR, BNMS, BSSM, BDDS

FOPN can become large during peak activity. If it does not diminish during periods of slow activity, contact technical support.

HSTE

HTTP State

BLDR, BLDR

It is critical that the HTTP protocol be online and running without errors.

Check the state of the LDR service and the related Storage component. Ensure all are online.

Check that the HTTP component is configured to autostart when the service is restarted.

HSTU

HTTP Status

HTAS

Auto-Start HTTP

LDR

Specifies whether to start HTTP services automatically on start-up. This is a user-specified configuration option.

IQSZ

Number of Objects

Either objects are arriving for ingest faster than the ILM policy can evaluate them, or a large number of objects that require an ILM re-evaluation are being processed.

Plot the value of IQSZ over the course of a day or week, and check that at times of low system activity the number of objects drops, and tends towards zero.

Check system activity to confirm that there is an increase in system activity. An increase in system activity will result in an increase to attribute data activity. This increased activity will result in a delay to the processing of attribute data. This can be normal system activity and will subside.

Check for multiple alarms. An increase in average latency times can be indicated by an excessive number of triggered alarms.

If the problem persists, contact technical support.

LATW

Worst-Case Latency

NMS

Check for connectivity issues.

Check system activity to confirm that there is an increase in activity. An increase in system activity will result in an increase to attribute data activity. This increased activity will result in a delay to the processing of attribute data. This can be normal system activity and will subside.

Check for multiple alarms. An increase in average latency times can be indicated by an excessive number of triggered alarms.

If the problem persists, contact technical support.

LDRE

LDR State

LDR

If the value of LDR State is Standby, continue monitoring the situation and if the problem persists, contact technical support.

If the value of LDR State is Offline, restart the service. If the problem persists, contact technical support.

LOST

Lost Objects

DDS, LDR

Triggered when the StorageGRID Webscale system fails to retrieve a copy of the requested object from anywhere in the system. Before a LOST (Lost Objects) alarm is triggered, the system attempts to retrieve and replace a missing object from elsewhere in the system.

Lost objects represent a loss of data. The Lost Objects attribute is incremented whenever the number of locations for an object drops to zero without the DDS service purposely purging the content to satisfy the ILM policy.

Check the network connections of the servers hosting the NMS service and the external mail server. Also confirm that the NMS e-mail server configuration is correct.

MINS

E-mail Notifications Status

BNMS, BNMS

A minor alarm is triggered if the NMS service is unable to connect to the mail server. Check the network connections of the servers hosting the NMS service and the external mail server. Also confirm that the NMS e-mail server configuration is correct.

MISS

NMS Interface Engine Status

BNMS, BNMS

An alarm is triggered if the NMS interface engine on the Admin Node that gathers and generates interface content is disconnected from the system. Check Server Manager to determine if the server individual application is down.

MMQS

Peak Message Queue Size

BADC, BAMS, BARC, BCLB, BCMN, BLDR, BNMS, BSSM, BDDS

An alarm indicates that the grid node is overloaded, and cannot be able to process operations at a high enough rate to support normal system operation. Client requests can timeout when nodes are in this condition.

If the problem persists, contact technical support.

NANG

Network Auto Negotiate Setting

SSM

Check the network adapter configuration. The setting must match preferences of your network routers and switches.

An incorrect setting can have a severe impact on system performance.

NDUP

Network Duplex Setting

SSM

Check the network adapter configuration. The setting must match preferences of your network routers and switches.

An incorrect setting can have a severe impact on system performance.

NLNK

Network Link Detect

SSM

Check the network cable connections on the port and at the switch.

Check the network router, switch, and adapter configurations.

Restart the server.

If the problem persists, contact technical support.

NRER

Receive Errors

SSM

These errors can clear without being manually reset. If errors do not clear, check the network hardware.

Check that the adapter hardware and driver are correctly installed and configured to work with your network routers and switches.

If audit relays are not connected to ADC services, audit events cannot be reported. They are queued and unavailable to users until the connection is restored.

Restore connectivity to an ADC service as soon as possible.

If the problem persists, contact technical support.

NSCA

NMS Status

NMS

If the value of NMS Status is DB Connectivity Error, restart the service. If the problem persists, contact technical support.

NSCE

NMS State

NMS

If the value of NMS State is Standby, continue monitoring and if the problem persists, contact technical support.

If the value of NMS State is Offline, restart the service. If the problem persists, contact technical support.

NSPD

Speed

SSM

This can be caused by network connectivity or driver compatibility issues. If the problem persists, contact technical support.

NTBR

Free Tablespace

NMS

If an alarm is triggered, check how fast database usage has been
changing. A sudden drop (as opposed to a gradual change over time)
indicates an error condition. If the problem persists, contact
technical support.

Adjusting the alarm threshold allows you to proactively manage when
additional storage needs to be allocated.

If the available space reaches a low threshold (see alarm
threshold), contact technical support to change the database
allocation.

NTER

Transmit Errors

SSM

These errors can clear without being manually reset. If they do not clear, check network hardware. Check that the adapter hardware and driver are correctly installed and configured to work with your network routers and switches.

If the frequency offset exceeds the configured threshold, there is likely a hardware problem with the local clock. If the problem persists, contact technical support to arrange a replacement.

NTLK

NTP Lock

SSM

If the NTP daemon is not locked to an external time source, check network connectivity to the designated external time sources, their availability, and their stability.

NTLR

Repair Completion Status

DDS

If a nodetoool repair task for Cassandra stalls, the normal background process of checking for and repairing potential database inconsistencies cannot complete and is retried every hour.

Check the Cassandra log at /var/local/log/cassandra/system.log for errors, and correct any issues that you discover. For example, the Storage Node could be isolated due to network issues.

Contact technical support if you cannot identify or resolve the issue that prevents nodetool repair from completing.

NTOF

NTP Time Offset

SSM

If the time offset exceeds the configured threshold, there is likely a hardware problem with the oscillator of the local clock. If the problem persists, contact technical support to arrange a replacement.

NTSA

NTP Sources Available

SSM

If this server is configured to act as a primary NTP server for the StorageGRID Webscale system, this attribute tracks the number of external NTP time sources available. It is normal for this number to fluctuate if there are a large number of external time sources available.

If the server is configured to act as a secondary NTP time server or an NTP client, the server uses other servers as its NTP time sources. For more information about the StorageGRID Webscale system’s NTP configuration, see the Solution Design document for your deployment.

If the number of NTP time sources available falls below the configured minimum, the accuracy and consistency of local time on the server can suffer. If the number of NTP time sources falls to zero, local server time will drift out of synchronization with the time recorded by other services. In extreme cases, this can disrupt system operations. Correct the issue as quickly as possible.

NTSD

Chosen Time Source Delay

SSM

These values give an indication of the reliability and stability of the time source that NTP on the local server is using as its reference.

If an alarm is triggered, it can be an indication that the time source’s oscillator is defective, or that there is a problem with the WAN link to the time source.

NTSJ

Chosen Time Source Jitter

NTSO

Chosen Time Source Offset

NTSU

NTP Status

SSM

If the value of NTP Status is Not Running, contact technical support.

OCOR

Corrupt Objects Detected

LDR

The total number of corrupt replicated objects that the most
recently run background verification process has detected on the
Storage Node.
Any corrupt object should be investigated. More than 10 indicates a major problem.

Note that this value is persistent: it is not updated once the corrupt objects have been restored.

An alarm is triggered if the power of a StorageGRID Webscale appliance enclosure deviates from the recommended operating voltage.

Check Power Supply A or B status to determine which power supply is operating abnormally.

If necessary, replace the power supply.

OQRT

Objects Quarantined

LDR

After the objects are automatically restored by the StorageGRID Webscale system, the quarantined objects must be manually removed from the quarantine directory. Contact technical support.

After the quarantined objects are removed, the value of OQRT is updated and the alarm clears.

ORSU

Outbound Replication Status

BLDR, BARC

An alarm indicates that outbound replication is not possible: storage is in a state where objects cannot be retrieved. An alarm is triggered if outbound replication is disabled manually. Select Support > Grid Topology. Then select site > grid node > LDR > Replication > Configuration.

An alarm is triggered if the LDR service is unavailable for replication. Select Support > Grid Topology. Then select site > grid node > LDR > Storage.

PMEM

Service Memory Usage (Percent)

BADC, BAMS, BARC, BCLB, BCMN, BCMS, BLDR, BNMS, BSSM, BDDS

Can have a value of Over Y% RAM where Y represents the percentage of memory being used by the server.

Figures under 80% are normal. Over 90% is considered a problem.

If memory usage is high for a single service, monitor the situation and investigate.

If the problem persists, contact technical support.

PSAS

Power Supply A Status

SSM

An alarm is triggered if the power supply A of a StorageGRID Webscale appliance deviates from the recommended operating voltage.

If necessary, replace the power supply A.

PSBS

Power Supply B Status

SSM

An alarm is triggered if the power supply B of a StorageGRID Webscale appliance deviates from the recommended operating voltage.

If necessary, replace the power supply B.

RDTE

Tivoli Storage Manager State

BARC, BARC

Only available for Archive Nodes with a Target Type of Tivoli Storage Manager (TSM).

If the value of Tivoli Storage Manager State is Offline, check Tivoli Storage Manager Status and resolve any problems.

Only available for Archive Nodes with a Target Type of Tivoli Storage Manager (TSM).

If the value of Tivoli Storage Manager Status is Configuration Error and the Archive Node has just been added to the StorageGRID Webscale system, ensure that the TSM middleware server is correctly configured.

If the value of Tivoli Storage Manager Status is Connection Failure, or Connection Failure, Retrying, check the network configuration on the TSM middleware server, and the network connection between the TSM middleware server and the StorageGRID Webscale system.

If the value of Tivoli Storage Manager Status is Authentication Failure, or Authentication Failure, Reconnecting, the StorageGRID Webscale system can connect to the TSM middleware server, but cannot authenticate the connection. Check that the TSM middleware server is configured with the correct user, password, and permissions, and restart the service.

If the value of Tivoli Storage Manager Status is Session Failure, an established session has been lost unexpectedly. Check the network connection between the TSM middleware server and the StorageGRID Webscale system. Check the middleware server for errors.

Replication alarms (Inbound Replications – Failed RIRF and Outbound Replications – Failed RORF) can occur during periods of high load or temporary network disruptions. After system activity reduces, these alarms should clear. If the count of failed replications continues to increase, look for network problems and verify that the source and destination LDR and ARC services are online and available.

Alarms can occur during periods of high load or temporary network disruption. After system activity reduces, this alarm should clear. If the count for queued replications continues to increase,
look for network problems and verify that the source and destination LDR and ARC services are online and available.

RORF

Outbound Replications – Failed

BLDR, BARC

The threshold for a notice alarm is 10 objects, while greater than 50 objects triggers a minor alarm.

Replication alarms (Inbound Replications – Failed (RIRF) and Outbound Replications – Failed (RORF)) can occur during periods of high load or due to temporary network disruptions. After system activity reduces, these alarms should clear. If the count of failed replications continues to increase, look for network problems and verify that the source and destination LDR and the ARC services are online and available.

If the value of Status for a grid task being aborted is Error, retry aborting the grid task.

If the problem persists, contact technical support.

SCHR

Status

CMN

If the value of Status for the historical grid task is Aborted, investigate the reason and run the task again if required.

If the problem persists, contact technical support.

SHLH

Health

LDR

If the value of Health for an object store is Error, check and correct:

problems with the volume being mounted

file system errors

SLSA

CPU Load Average

SSM

The higher the value the busier the system.

If the CPU Load Average persists at a high value, the number of transactions in the system should be investigated to determine whether this is due to heavy load at the time. View a chart of the CPU load average: Select Support > Grid Topology. Then select site > grid node > SSM > Resources > Reports > Charts.

If the load on the system is not heavy and the problem persists, contact technical support.

Note: If you use Linux and run multiple containers on a single host, you might want to change the trigger values for the CPU Load Average alarm to better reflect the host utilization. See Changing trigger values for CPU Load Average.

SMST

Log Monitor State

SSM

If the value of Log Monitor State is not Connected for a persistent period of time, contact technical support.

SMTT

Total Events

SSM

If the value of Total Events is greater than zero, check if there are known events (such as network failures) that can be the cause. Unless these errors have been cleared (that is, the count has been reset to 0), Total Events alarms can be triggered.

Note: To reset event counts, you must be a user who belongs to a group that has the Grid Topology Page Configuration permission enabled.

If the value of Total Events is zero, or the number increases and the problem persists, contact technical support.

SNST

Status

CMN

An alarm indicates that there is a problem storing the grid task bundles. If the value of Status is Checkpoint Error or Quorum Not Reached, confirm that a majority of ADC services are connected to the StorageGRID Webscale system (50 percent plus one) and then wait a few minutes.

If the problem persists, contact technical support.

SOSS

Storage Operating System Status

SSM

An alarm is triggered if SANtricity software indicates that there is a "Needs attention" issue with an E2700 controllerStorageGRID Webscale appliance component.

If the value of SSM Status is Error, select Support > Grid Topology, then select site > grid node > SSM > Overview > Main and SSM > Overview > Alarms to determine the cause of the alarm.

If the problem persists, contact technical support.

SSME

SSM State

SSM

If the value of SSM State is Standby, continue monitoring, and if the problem persists, contact technical support.

If the value of SSM State is Offline, restart the service. If the problem persists, contact technical support.

SSTS

Storage Status

BLDR, BLDR

If the value of Storage Status is Insufficient Usable Space, there is no more available storage on the Storage Node and data ingests are redirected to other available Storage Node. Retrieval requests can continue to be delivered from this grid node.

Additional storage should be added. It is not impacting end user functionality, but the alarm persists until additional storage is added.

If the value of Storage Status is Volume(s) Unavailable, a part of the storage is unavailable. Storage and retrieval from these volumes is not possible. Check the volume’s Health for more information: Select Support > Grid Topology. Then select site > grid node > LDR > Storage > Overview > Main. The volume's Health is listed under Object Stores.

If the value of Storage Status is Error, contact technical support.

SVST

Status

SSM

This alarm clears when other alarms related to a non-running service are resolved. Track the source service alarms to restore operation.

Select Support > Grid Topology. Then select site > grid node > SSM > Services > Overview > Main. When the status of a service is shown as Not Running, its state is Administratively Down. The service’s status can be listed as Not Running for the following reasons:

The service has been manually stopped (/etc/init.d/<service> stop).

There is an issue with the MySQL database and Server Manager shuts down the MI service.

A grid node has been added, but not started.

During installation, a grid node has not yet connected to the Admin Node.

If a service is listed as Not Running, restart the service (/etc/init.d/<service> restart).

Nodes running with less than 24 GiB of installed memory can lead to performance problems and system instability. The amount of memory installed on the system should be increased to at least 24 GiB.

TPOP

Pending Operations

ADC

A queue of messages can indicate that the ADC service is overloaded. Too few ADC services can be connected to the StorageGRID Webscale system. In a large deployment, the ADC service can require adding computational resources, or the system can require additional ADC services.

UMEM

Available Memory

SSM

If the available RAM gets low, determine whether this is a hardware or software issue. If it is not a hardware issue, or if available memory falls below 50 MB (the default alarm threshold), contact technical support.

VMFI

Entries Available

SSM

This is an indication that additional storage is required. Contact technical support.

VMFR

Space Available

SSM

If the value of Space Available gets too low (see alarm thresholds), it needs to be investigated as to whether there are log files growing out of proportion, or objects taking up too much disk space (see alarm thresholds) that need to be reduced or deleted.

If the problem persists, contact technical support.

VMST

Status

SSM

An alarm is triggered if the value of Status for the mounted volume is Unknown. A value of Unknown or Offline can indicate that the volume cannot be mounted or accessed due to a problem with the underlying storage device.

VPRI

Verification Priority

BLDR, BARC

By default, the value of Verification Priority is Adaptive. If Verification Priority is set to High, an alarm is triggered because storage verification can slow normal operations of the service.

If the value of Object Verification Status is Verify Location Synchronize Failed, check that the LDR service is connected to at least one CMS service.

Also check the operating system for any signs of block-device or file system errors.

If the value of Object Verification Status is Maximum Number of Failures Reached, it usually indicates a low-level file system or hardware problem (I/O error) that prevents the Storage Verification task from accessing stored content. This alarm can also occur when there is a high number of content errors indicating that data was invalid.

If the value of Object Verification Status is Unknown Error, contact technical support.

XAMS

Unreachable Audit Repositories

BADC, BARC, BCLB, BCMN, BCMS, BLDR, BNMS

Check network connectivity to the server hosting the Admin Node.

If the problem persists, contact technical support.

More information

Changing trigger values for CPU Load Average
If you are using StorageGRID Webscale with Linux hosts and you are running multiple containers on a single host, you can change the trigger values for the CPU Load Average alarm to better reflect the host utilization.