No failure isolation: If the platform upgrade fails, all nodes are down

No ability to resume from failure—we need to restart from the beginning. This also causes the upgrade to be slow

Error prone and not friendly for developers. Every time there is a need to upgrade some metadata or update services in a specific order, developers would need to edit the upgrade code path. This is something they may not be intimately familiar with. Also, since more people end up touching this critical code path, the chance of bugs goes up

Seeing that we were frequently hitting these limitations in the field, as well as to make our updates developer friendly, we decided to overhaul the entire upgrade infrastructure.

Self-upgrading cluster

Very early on in our design process, we made a decision to have upgrades be a first-class citizen of our platform. What we mean by that is that we wanted updates to be triggered by simply sending a ClusterUpdate RPC to Orion, our cluster manager.

This was mainly motivated by our observation that the cluster manager already manages the nodes and orchestrates the services, and therefore in a great position to orchestrate the update.

At this point, I should mention that the ThoughtSpot cluster manager is a stateless service running on each node. One of these is elected as the leader through Zookeeper, and then performs the job of cluster orchestration. This leader is canonically referred to as the “cluster manager”. The relevant state of the cluster manager is persisted in Zookeeper, for quick fail-over to another server in case the current leader fails.

Fault-tolerance

Since Orion already had a fail-over mechanism built in, we are able to seamlessly transfer the ownership of executing the update workflow to the new leader. This has helped us recover from a failure of the node where the update was triggered, a failure we couldn’t have tolerated in the earlier design.

Failure recovery and resumable upgrades

During the update, the progress is checkpointed at each step and stored in zookeeper to enable quick recovery once measures have been taken to fix any underlying issue which might have caused a failure.

Extensibility

We made the framework highly extensible by providing hooks at pre-update and post-update stages to run any command or binaries. The most important benefit of this is the ability to run a sequence of checks before an upgrade in order to ensure that the cluster is healthy before we upgrade it. In the future, we can give the ability to the user to skip a failed health check if we expect the issue to be fixed by the update itself.

Developer Friendly

Instead of editing the highly-critical update code path, developers are now able to perform both metadata upgrades as well as ensure an order in which services are updated.

This can be done in a declarative way by having one-off services which perform metadata and data upgrades, and by having services specify other services as dependencies, including any one-off services.

We then perform a topological sort on the services graph, and update the services such that a service is updated before updating a dependent service.

Failure isolation and Availability

We achieved failure isolation by using a “canary.” We selectively remove one of the nodes from the cluster, and then update it. This involves updating any system packages, configuration, python packages, as well as the cluster manager service itself. If this step fails, we stop the update, but let the rest of the cluster run as if nothing has changed.

This ensures that any failures that could have happened due to updating system or python packages, or from updating the cluster manager, are limited to this canary node.

Once any found issue is fixed, we can resume updates by simply sending a ResumeUpdate RPC to the cluster manager.

Speed

Once the canary is updated, it transfers the ownership of the cluster to itself. It then orchestrates the rest of the update.

This encompasses updating all other nodes to the new “platform” as well as updating all the services. Note that both of these can proceed in parallel, enabling the application to be up even as some nodes might not have been updated.

Fault tolerance

The new upgrade framework also tolerates failures. If there’s a failure in updating a service or in running a one-off service required by another long-running service, Orion just skips updating the service graph rooted at the failed service, and updates all other services, ensuring that the application is available even if the update of a non-critical service fails.

This was not the case earlier, since there was no explicit notion of service dependencies.

Limitations

There are a few limitations of this approach. We can’t upgrade Zookeeper and HDFS, services which the cluster manager depends on. These need to be updated out-of-band.

Also, monitoring the progress of an upgrade via logs is more challenging due to the distributed nature of the upgrade.

Fortunately we have some good ideas about how to address these challenges!

Future

So what’s next? Although our platform is already highly available, we continue to reduce necessary downtime to be as close to zero as possible. And we’re working on other augmentations, like the ability to stop or roll back upgrades seamlessly.