The proposal is to create an Ansible based project named tripleo-ha-utils that
will be consumable by the various tools that we use to deploy TripleO
environments like tripleo-quickstart or infrared or by manual deployments.

The project will initially cover three principal roles:

stonith-config: a playbook used to automate the creation of fencing
devices in the overcloud;

instance-ha: a playbook that automates the seventeen manual steps needed
to configure instance HA in the overcloud, test them via rally and verify
that instance HA works appropriately;

validate-ha: a playbook that runs a series of disruptive actions in the
overcloud and verifies it always behaves correctly by deploying a
heat-template that involves all the overcloud components;

Today the project exists outside the TripleO umbrella, and it is named
tripleo-quickstart-utils [1] (see “Alternatives” for the historical reasons of
this name). It is used internally inside promotion pipelines, and has
also been tested with success in RDOCloud.

The base principle of the project is to give people the ability to integrate
the first roles with whatever kind of test. For example, today we’re using
a simple bash framework to interact with the cluster (so pcs commands and
other interactions), rally to test instance-ha and Ansible itself to simulate
full power outage scenarios.
The idea is to keep this pluggable approach leaving the final user the choice
about what to use.

One of the aims of this project is to be retro-compatible with the previous
version of OpenStack. Starting from Liberty, we cover instance-ha and
stonith-config Ansible playbooks for all the releases.
The same happens while testing HA since all the tests are plugged in depending
on the release.

While evaluating alternatives, the first thing to consider is that this
project aims to be a TripleO-centric set of tools for HA, not a generic
OpenStack’s one.
We want tools to help the user answer questions like “Is the Galera bundle
cluster resource able to tolerate a stop and a consecutive start without
affecting the environment capabilities?” or “Is the environment able to
evacuate instances after being configured for Instance HA?”. And the answer we
want is YES or NO.

tripleo-validations: the most logical place to put this, at least

looking at the name, would be tripleo-validations. By talking with folks
working on it, it came out that the meaning of tripleo-validations project is
not doing disruptive tests. Integrating this stuff would be out of scope.

tripleo-quickstart-extras: apart from the fact that this is not

something meant just for quickstart (the project supports infrared and
“plain” environments as well) even if we initially started there, in the
end, it came out that nobody was looking at the patches since nobody was
able to verify them. The result was a series of reviews stuck forever.
So moving back to extras would be a step backward.

None. The good thing about this solution is that there’s no impact for anyone
unless the solution gets loaded inside an existing project. Since this will be
an external project, it will not impact anything of the current stuff.

Due to the disruptive nature of these tests, the TripleO CI should not be
updated to include these tests, mostly because of timing issues.
This project should remain optionally usable by people when needed, or in
specific CI environments meant to support longer than usual jobs.