Log Insight: Fixing Disconnected Nodes

I recently heard of an issue where multiple nodes from a Log Insight cluster had become disconnected. The steps performed in an attempt to resolve the issue were to remove the nodes from the Cluster page of the Administration section, manually delete some directories from the LI virtual appliance via the CLI and attempt to re-add the nodes. In this post, I will explain why this is not supported and how to properly address such issues.

Why Unsupported?

Let’s start with the removal of nodes from the UI. While a red X exists to perform the operation, the operation from the UI is not supported and only for catastrophic issues where a node is permanently lost. In this case, the node was not lost so removing it is unsupported. Also note that adding a removed node back to the cluster is not supported either.

Removing Multiple Nodes

As you may recall, Log Insight provides N+1 redundancy in most cases — this recently changed in LI 3.0 — meaning if you remove one node you are OK, but if you remove more than one node you are not. What happens if you remove more than one node? While if you perform the operation within about 5 minutes of one another you will have data loss which may include configuration and/or user data — which data will be lost? It is impossible to say. Note, of course, in all cases removing a node results in all events on that node being lost.

Removing Online versus Offline Nodes

If you remove a node, wait 5 minutes and repeat then you should not lose any configuration and/or user data, however this assumes the nodes you are removing are currently online — as I just mentioned removing online nodes is not supported. This means if you permanently lose more than one node in a cluster and remove those nodes — whether at the same time or after 5 minutes — some amount of configuration and/or user data will be lost. If this sounds concerning to you then you should probably check out my posts on Log Insight backup and recovery on how to better protect configuration and/or user data.

CLI Changes

Changes made to the CLI are not supported and may not get you in the state desired! Upon removing a node of LI version 2.5 or older rejoining was unsupported. At this point, you would need to deploy a new virtual appliance and add it to the cluster. Starting with LI 3.0, a successfully removed node will return to its bootstrap process meaning if you go to the UI of the removed node you will have the ability to make it be a standalone instance or join/rejoin an existing cluster — though this is unsupported today.

Previously Used IP

Finally, there are concerns about adding a worker node with an IP address of a previous worker node. A LI cluster does not like seeing a completely new node using the same IP as a previously connected node. As such, new nodes must use new IP addresses when joining a LI cluster. Failure to do so may result in undesired results.

Supported Process

So what is the supported process? Well, in the case of multiple nodes being offline you need to fix the underlying issue. Taking a look at /storage/var/loginsight/runtime.log both on an online node as well as an offline node to determine what is going on. Do NOT remove any nodes. Contact VMware support if needed. If you must remove nodes, note that removing more than one node will result in s same (random) data loss — not ingested events, but configuration and/or user data.