Safeguarding, Monitoring and Providing an Automated fix for a Controlled Messaging System

Safeguarding, Monitoring and Providing an Automated fix for a Controlled Messaging System

I had a request this morning to monitor a bespoke application, the Tyrell Control Messaging application.

So I thought i’d document the application monitoring process for you all to see.

The server names and IP addresses have been sanitised to protect the innocent.

Monitoring is required to let the engineers know that there has been an application failure and provide an automated fix.

So here is how I did it.

Find the node that the service runs on. Click “Service Control Manager”.

Note down each of the services you want to monitor.

Hovering over the service gives you the name of the service.

To monitor the Tyrell Guardian click on “Start monitoring this service”.

Add the rest of the services to the created application template that automatically appears when you add the first component monitor.

Be sure to rename the component names to something useful to match the services being monitored.

Once all the component monitors are in place, enter a suitable template description to let users know what the service application monitor is monitoring.

Click “Submit”.

Now go back to the node and an application template has been applied to the node.

Click on the TYRELL – CONTROL MESSAGING application and the application details – summary for the application can be seen.

This application monitoring template can now be utilised to put in a clever alert to restart the service based upon the threshold status saving an engineer from receiving a call at midnight to tell him/her that the control messaging application has failed.

Creating a Control
Messaging Alert to notify engineers when the control messaging application goes
down and to restart the service.

Go into “Manage Alerts”.

Click on “Add New Alert”.

Enter in the alert properties.

Do not enable the alert yet.

The Trigger condition specifies a specific application and refers to a status of Node UP, Application DOWN. This trigger will then fire an action as specified.

The reset condition is set to reset the alert when trigger action is no longer true.

An escalation level 2 has been added to email a manager when this application does not come back up within an hour.

Email
and event log are specified. Email is set to send to infrastructure team and
event log records the event in Solarwinds.

The reset action is the same as the trigger action.

Finally enable the alert.

7.
Summary of Alert Configuration

Please
review the alert configuration before saving…

Name
of alert:

CONTROL
MESSAGING ALERT

Description
of alert:

This
alert will write to the event log and email the alerts mailbox when the CONTROL
MESSAGING application goes down and when the application comes back up.

Type of Property to
monitor

Application

Enabled(On/Off):

ON

Evaluation
Frequency of alert:

Every
minute

Severity
of alert:

Critical

Alert Custom
Properties: (1)

ResponsibleTeam:

Alert
owner (user who created this alert):

emt\atimberley

Alert
Limitation Category

No
Limitation

Trigger
Condition:

Alert on all objects where: Application – Instance of Application – is – swis://SOLARWINDS/Orion/Orion.Nodes/NodeID=164/Applications/ApplicationID=28 The actual trigger condition: All child conditions must be satisfied (AND) Node – Status – is not equal to – Down Application Alerting Properties – Application Availability – is equal to – Down

Reset
Condition:

When
the trigger condition is no longer true

Time
of Day schedule:

Alert
is always enabled

Trigger
Action:

Escalation Level 1

1. CONTROL MESSAGING Email/Page (Application
“${N=SwisEntity;M=ApplicationAlert.ApplicationName}” on
“${N=SwisEntity;M=Node.Caption}” is currently
${N=SwisEntity;M=ApplicationAlert.ApplicationAvailability})

1. CONTROL MESSAGING Email/Page (Application
“${N=SwisEntity;M=ApplicationAlert.ApplicationName}” on
“${N=SwisEntity;M=Node.Caption}” is currently
${N=SwisEntity;M=ApplicationAlert.ApplicationAvailability})