SolrCloud Autoscaling Fault Tolerance

The autoscaling framework uses a few strategies to ensure it’s able to still trigger actions in the event of unexpected changes to the system.

Node Added or Lost Markers

Since triggers execute on the node that runs the Overseer, should the Overseer node go down the nodeLost
event would be lost because there would be no mechanism to generate it. Similarly, if a node has
been added before the Overseer leader change was completed, the nodeAdded event would not be
generated.

For this reason Solr implements additional mechanisms to ensure that these events are generated
reliably.

With standard SolrCloud behavior, when a node joins a cluster its presence is marked as an ephemeral ZooKeeper path in the /live_nodes/<nodeName> ZooKeeper directory. Now an ephemeral path is also created under /autoscaling/nodeAdded/<nodeName>.
When a new instance of Overseer leader is started it will run the nodeAdded trigger (if it’s configured)
and discover the presence of this ZooKeeper path, at which point it will remove it and generate a nodeAdded event.

When a node leaves the cluster, up to three remaining nodes will try to create a persistent ZooKeeper path
/autoscaling/nodeLost/<nodeName> and eventually one of them succeeds. When a new instance of Overseer leader
is started it will run the nodeLost trigger (if it’s configured) and discover the presence of this ZooKeeper
path, at which point it will remove it and generate a nodeLost event.

Trigger State Checkpointing

Triggers generate events based on their internal state. If the Overseer leader goes down while the trigger is
about to generate a new event, it’s likely that the event would be lost because a new trigger instance
running on the new Overseer leader would start from a clean slate.

For this reason, after each time a trigger is executed its internal state is persisted to ZooKeeper, and
on Overseer start its internal state is restored.

Trigger Event Queues

Autoscaling framework limits the rate at which events are processed using several different mechanisms.
One is the locking mechanism that prevents concurrent
processing of events, and another is a single-threaded executor that runs trigger actions.

This means that the processing of an event may take significant time, and during this time it’s possible that the
Overseer may go down. In order to avoid losing events that were already generated but not yet fully
processed, events are queued before processing is started.

Separate ZooKeeper queues are created for each trigger, and events produced by triggers are put on these
per-trigger queues. When a new Overseer leader is started it will first check
these queues and process events accumulated there, and only then it will continue to run triggers
normally. Queued events that fail processing during this "replay" stage are discarded.