The various OpenStack projects have an ongoing requirement to perform
some set of actions in an atomic manner performed by some distributed set of
applications on some set of distributed resources without having those
resources end up in some corrupted state due those actions being performed on
them without the traditional concept of locking.

A DLM is one such concept/solution that can help (but not entirely
solve) these types of common resource manipulation patterns in distributed
systems. This specification will be an attempt at defining the problem
space, understanding what each project currently has done in regards of
creating its own DLM-like entity and how we can make the situation better
by coming to consensus on a common solution that we can benefit from to
make everyone’s lives (developers, operators and users of OpenStack
projects) that much better. Such a consensus being built will also
influence the future functionality and capabilities of OpenStack at large
so we need to be especially careful, thoughtful, and explicit here.

Building distributed systems is hard. It is especially hard when the
distributed system (and the applications [X,Y,Z...] that compose the
parts of that system) manipulate mutable resources without the ability to do
so in a conflict-free, highly available, and
scalable manner (for example, application X on machine 1 resizes
volume A, while application Y on machine 2 is writing files to
volume A). Typically in local applications (running on a single
machine) these types of conflicts are avoided by using primitives provided
by the operating system (pthreads for example, or filesystem locks, or
other similar CAS like operations provided by the processor instruction
set). In distributed systems these types of solutions do not work, so
alternatives have to either be invented or provided by some
other service (for example one of the many academia has created, such
as raft and/or other paxos variants, or services created
from these papers/concepts such as zookeeper or chubby or one of the
many raft implementations or the redis redlock algorithm). Sadly in
OpenStack this has meant that there are now multiple implementations/inventions
of such concepts (most using some variation of database locking), using
different techniques to achieve the defined goal (conflict-free, highly
available, and scalable manipulation of resources). To make things worse
some projects still desire to have this concept and have not reached the
point where it is needed (or they have reached this point but have been
unable to achieve consensus around an implementation and/or
direction). Overall this diversity, while nice for inventors and people
that like to explore these concepts does not appear to be the best
solution we can provide to operators, developers inside the
community, deployers and other users of the now (and every expanding) diverse
set of OpenStack projects.

Avoid multiple entities from manipulating the same volume resource(s)
at the same time while still being scalable and highly available.

Solution:

Currently is limited to file locks and basic volume state transitions. Has
limited scalability and reliability of cinder under failure/load; has been
worked on for a while to attempt to create a solution that will fix some of
these fundamental issues.

Avoid multiple conductors from manipulating the same bare-metal
instances and/or nodes at the same time while still being scalable and
highly available.

Other required/implemented functionality:

Track what services are running, supporting what drivers, and rebalance
work when service state changes (service discovery and rebalancing).

Sync state of temporary agents instead of polling or heartbeats.

Solution:

Partition resources onto a hash-ring to allow for ownership to be scaled
out among many conductors as needed. To avoid entities in that hash-ring
from manipulating the same resource/node that they both may co-own a database
lock is used to ensure single ownership. Actions taken on nodes are performed
after the lock (shared or exclusive) has been obtained (a state machine
built using automaton also helps ensure only valid transitions
are performed).

Notes:

Has logic for shared and exclusive locks and provisions for upgrading
a shared lock to an exclusive lock as needed (only one exclusive lock
on a given row/key may exist at the same time).

Reclaim/take over lock mechanism via periodic heartbeats into the
database (reclaims is apparently a manual and clunky process).

Etcd proposed @ 179965 I believe this further validates the view
that we need a consensus on a uniform solution around DLM (vs continually
having projects implement whatever suites there fancy/flavor of the week).

Multiple engines working on the same stack (or nested stack of). The
ongoing convergence rework may change this state of the world (so in the
future the problem space might be slightly different, but the concept
of requiring locks on resources will still exist).

Solution:

Lock a stack using a database lock and disallow other engines
from working on that same stack (or stack inside of it if nested),
using expiry/staleness allow other engines to claim potentially
lost lock after period of time.

Notes:

Liveness of stack lock not easy to determine? For example is an engine
just taking a long time working on a stack, has the engine had a network
partition from the database but is still operational, or has the engine
really died?

To resolve this a combination of an oslo.messaging ping used to
determine when a lock may be dead (or the owner of it is dead), if an
engine is non-responsive to pings/pongs after period of time (and its
associated database entry has expired) then stealing is allowed to occur.

Lacks simple introspection capabilities? For example it is necessary
to examine the database or log files to determine who is trying to acquire
the lock, how long they have waited and so on.

Lock releasing may fail (which is highly undesirable, IMHO it should
never be possible to fail releasing a lock); implementation does not
automatically release locks on application crash/disconnect/other but relies
on ping/pongs and database updating (each operation in this
complex ‘stealing dance’ may fail or be problematic, and therefore is not
especially simple).

Select a distributed lock manager (one that is opensource) and integrate
it deeply into openstack, work with the community that owns it to develop
and issues (or fix any found bugs) and use it for lock management
functionality and service discovery...

Select a API (likely tooz) that will be backed by capable
distributed lock manager(s) and integrate it deeply into openstack and
use it for lock management functionality and service discovery...

Place all functionality behind tooz (as much as possible) and let the
operator choose which implementation to use. Do note that functionality that
is not possible in all backends (for example consul provides a DNS interface
that complements its HTTP REST interface) will not be able to be exposed
through a tooz API, so this may limit the developer using tooz to
implement some feature/s).

Compliance: further details about what each tooz driver must
conform to (as in regard to how it operates, what functionality it must support
and under what consistency, availability, and partition tolerance scheme
it must operate under) will be detailed at: 240645

It is expected as the result of 240645 that
certain existing tooz drivers will be deprecated and eventually removed
after a given number of cycles (due to there inherent inability to meet the
policy constraints created by that specification) so that the quality
and consistency of there operating policy can be guaranteed (this guarantee
reduces the divergence in implementations that makes plugins that much
harder to diagnosis, debug, and validate).

Note

Do note that the tooz alternative which needs to be understood
is that tooz is a tiny layer around solutions mentioned above, which
is an admirable goal (I guess I can say this since I helped make that
library) but it does favor pluggability over picking one solution and
making it better. This is obviously a trade-off that must IMHO not be
ignored (since X solutions mean that it becomes that much harder to
diagnose and fix upstream issues because X-Y solutions may not have
the issue in the first place); TLDR: pluggability comes at a cost.