Troubleshooting techniques in vRealize Operations components

vRealize Operations Manager helps in managing the health, efficiency and compliance of virtualized environments. In this tutorial, we have covered the latest troubleshooting techniques of vRealize Operations. Let’s see what we can do to monitor and troubleshoot the key vRealize Operations components and services.

The Watchdog service

Watchdog is a vRealize Operations service that maintains the necessary daemons/services and attempts to restart them as necessary should there be any failure. The vcops-watchdog is a Python script that runs every 5 minutes by means of the vcops-watchdog-daemon with the purpose of monitoring the various vRealize Operations services, including CaSA.

The Watchdog service performs the following checks:

PID file of the service

Service status

Here are some useful service log locations:

Log file nameLocationPurposevcops-watchdog.log/usr/lib/vmware-vcops/user/log/vcops-watchdog/Stores log information for the WatchDog service

The Collector service

The Collector service sends a heartbeat to the controller every 30 seconds. By default, the Collector will wait for 30 minutes for adapters to synchronize.

The collector properties, including enabling or disabling Self Protection, can be configured from the collector.properties properties file located in /usr/lib/vmware-vcops/user/conf/collector.

Here are some useful service log locations:

Log file nameLocationPurposecollector.log/storage/log/vcops/log/Stores log information for the Collector service

The Controller service

The Controller service is part of the analytics engine. The controller does the decision making on where new objects should reside. The controller manages the storage and retrieval of the inventory of the objects within the system.

The Controller service has the following uses:

It will monitor the collector status every minute

How long a deleted resource is available in the inventory

How long a non-existing resource is stored in the database

Here are some useful service file locations:

Log file nameLocationPurposecontroller.properties/usr/lib/vmware-vcops/user/conf/controller/Stores properties information for the Controller service

Databases

As we learned, vRealize Operations contains quite a few databases, all of which are of great importance for the function of the product. Let’s take a deeper look into those databases.

Cassandra DB

Currently, Cassandra DB stores the following information:

User Preferences and Config

Alerts definition

Customizations

Dashboards, Policies, and View

Reports and Licensing

Shard Maps

Activities

Cassandra stores all the information which we see in the content folder; basically, any settings which are applied globally.

You are able to log into the Cassandra database from any Analytic Node. The information is the same across nodes.

There are two ways to connect to the Cassandra database:

cqlshrc is a command-line tool used to get the data within Cassandra, in a SQL-like fashion (inbuilt).

Once you are logged on to the Cassandra DB, we can run the following commands to see information:

Command syntaxPurposeDescribe tablesTo list all the relation (tables) in the current database instanceDescribe To list the content of that particular tableExitTo exit the Cassandra command lineselect commandTo select any Column data from a tabledelete commandTo delete any Column data from a table

Some of the important tables in Cassandra are:

Table namePurposeactivity_2_tblStores all the activitiestbl_2a8b303a3ed03a4ebae2700cbfae90bfStores the Shard mapping information of an object (table name may be differ in each environment)supermetricStores the defined super metricspolicyStores all the defined policiesAuthStores all the user details in the clusterglobal_settingsAll the configured global settings are stored herenamespace_to_classtypeInforms what type of data is stored in what table under CassandrasymptomproblemdefinitionAll the defined symptomscertificatesStores all the adapter and data source certificates

The Cassandra.yaml file stores certain information such as the default location to save data (/storage/db/vcops/cassandra/data). The file contains information about all the nodes. When a new node joins the cluster, it refers to this file to make sure it contacts the right node (master node). It also has all the SSL certificate information.

Cassandra is started and stopped via the CaSA service, but just because CaSA is running does not mean that the Cassandra service is necessarily running.

The service command to check the status of the Cassandra DB service from the command line (SSH) is:

Regardless of which method you choose to perform the health check, if any of the nodes have over 600 MB of load, you should consult with VMware Global Support Services on the next steps to take, and how to elevate the load issues.

Central (Repl DB)

The Postgres database was introduced in 6.1. It has two instances in version 6.6. The Central Postgres DB, also called repl, and the Alerts/HIS Postgres DB, also called data, are two separate database instances under the database called vcopsdb.

The central DB exists only on the master and the master replica nodes when HA is enabled. It is accessible via port 5433 and it is located in /storage/db/vcops/vpostgres/repl.

Currently, the database stores the Resources inventory.

You can connect to the central DB from the command line (SSH). Log in on the analytic node you wish to connect to and run:

su - postgres

The command should not prompt for a password if ran as root.

Once logged in, connect to the database instance by running:

/opt/vmware/vpostgres/current/bin/psql -d vcopsdb -p 5433

The service command to start the central DB from the command line (SSH) is:

Alerts/HIS (Data) DB

The Alerts DB is called data on all the data nodes including the master and master replica node.

It was again introduced in 6.1. Starting from 6.2, the Historical Inventory Service xDB was merged with the Alerts DB. It is accessible via port 5432 and it is located in /storage/db/vcops/vpostgres/data.

Currently, the database stores:

Alerts and alarm history

History of resource property data

History of resource relationship

You can connect to the Alerts DB from the command line (SSH). Log in on the analytic node you wish to connect to and run:

su - postgres

The command should not prompt for a password if ran as root.

Once logged in, connect to the database instance by running:

/opt/vmware/vpostgres/current/bin/psql -d vcopsdb -p 5432

The service command to start the Alerts DB from the command line (SSH) is:

service vpostgres start

FSDB

The File System Database (FSDB) contains all raw time series metrics and super metrics data for the discovered resources.

What is FSDB in vRealize Operations Manager?:

FSDB is a GemFire server and runs inside analytics JVM.

FSDB in vRealize Operations uses the Sharding Manager to distribute data between nodes (new objects). (We will discuss what vRealize Operations cluster nodes are later in this chapter.)

The File System Database is available in all the nodes of a vRops Cluster deployment.

It has its own properties file.

FSDB stores data (time series data ) collected by adapters and data which is generated/calculated (system, super, badge, CIQ metrics, and so on) based on analysis of that data.

If you are troubleshooting FSDB performance issues, you should start from the Self Performance Details dashboard, more precisely, the FSDB Data Processing widget. We covered both of these earlier in this chapter.

You can also take a look at the metrics provided by the FSDB Metric Picker:

You can access it by navigating to the Environment tab, vRealize Operations Clusters, selecting a node, and selecting vRealize Operations Manager FSDB. Then, select the All Metrics tab.

You can check the synchronization state of the FSDB to determine the overall health of the cluster by running the following command from the command line (SSH):

ShardingManager_.log/usr/lib/vmware-vcops/user/logCan be used to get the total time the sync tookfsdb-accessor-.log/usr/lib/vmware-vcops/user/logProvides information on FSDB database cleanup and other disk-related information

Platform-cli

Platform-cli is a tool by which we can get information from various databases, including the GemFire cache, Cassandra, and the Alerts/HIS persistence databases.

In order to run this Python script, you need to run the following command:

We discussed how to troubleshoot some of the most important components like services and databases along with troubleshooting failures in the upgrade process. To know more about self-monitoring dashboards and infrastructure compliance, check out this book Mastering vRealize Operations Manager – Second Edition.