Does gradual software update of supervised services triggered by operator

Provides limited form of service discovery

All the features are scriptable by clean and simple Lua code fragments

It builds on top of lithos (which is isolation, containerization, and
supervising service) and cantal (which is sub-real-time monitoring and node
discovery service).

Verwalter is a framework for long-running services. It has abstractions to
configure running 10 instances of service X or use 7% of capacity for service
Y. The resources are consumed until configuration changed. Contrast this
approach with Mesos or Yarn which has “start task A until it completes”
abstraction. (However, Verwalter can run and scale Mesos or Yarn cluster).

Lithos is essentially a process supervisor. Here is the basic workflow:

Read configuration at /etc/lithos/sandboxes

For each sandbox read configuration in /etc/lithos/processes

Prepare the sandbox a/k/a linux container

Start process and keep restarting if that fails

Add/remove process if configuration changed

Lithos provides all necessary isolation for running processes (except it does
not handle network at the moment of writing), but it’s super-simple
comparing to docker and mesos (i.e. mesos-slave) and even systemd:

The security model of lithos is the ground for the security of whole
verwalter-based cluster. So let’s take a look:

It’s expected that configs for sandboxes are predefined by
administrators and are not dynamically changed (either by verwalter or any
other tool)

Sandbox config constrains folders, users, and few other things that
an application can’t escape

The command-line to run in sandbox is defined in image for the application

All this means that verwalter can only change the following things:

Image (i.e. version of image) to run command from

The name of the command to run from limited set of options

Number of processes to run

I.e. whatever evil would be in verwalter’s script it can’t run an arbitrary
command line on any host. So can’t install a rootkit, steal users’ passwords
and do any other harm except taking down the cluster (which is an expected
permission for resource scheduler). This is in contrast to docker/swarm and
mesos that allow to run anything.

The verwalter is final piece of the puzzle to build fully working and
auto-rebalancing cluster.

In particular it does the following:

Establishes leader of the cluster (or a subcluster in case of split-brain)

Leader runs model of the cluster defined by sysadmin and augmented with lua
scripts, to get a number of processes run at each machine (and other
important pieces of configuration).

Leader delivers configuration to every other node

At every node, the configuration is rendered to local configuration files
(most importantly /etc/lithos/processes, but other types of
configuration are supported too), and respective processes are notified.

All nodes display web frontend to review configuration. Frontend also has
actionable buttons for common maintainance tasks like software upgrade or
remove node from a cluster

Unlike popular combinations of etcd + confd, consul + consul-template, or
mesos with whatever framework, verwalter can do scheduling decisions in
split-brain scenario even in minority partition. Verwalter is not a database so
having two leaders is not a problem when used wisely.

Note

Yes you can control how small cluster must be for cluster model to
work, and you can configure different reactions in majority and minority
partition. I.e. doing any decisions on a single node isolated from 1000
other nodes is useless. But switching off external memcache instance
for the sake of running local one may be super-useful if you have a
micro-service running on just two nodes.

As the nodes are all equal you can issue a request to any node, or you can add
any existing node of a cluster to the new node, it doesn’t matter. All the
info will quickly propagate to other nodes via gossip protocol.

As illustrated on the picture the discovery is random. But it tuned well to
efficiently cover the whole network.

As described above, verwalter operates in one of the two modes: leader and
follower. It starts as follower and waits until it will be reached by a leader.
The Leader in turn discovers followers through cantal. I.e. it assumes that
every cantal that joins the cluster has a verwalter instance.

Note

While cantal is joining cluster and verwalter does its own bootrapping
and possible leader election, the lithos continues to run. The above means
if there was any configuration for lithos before a reboot of the system or
before you do any maintenance of the verwalter/consul, the processes are
started and supervised. Any processes that crash are restarted and so on.

In case you don’t want processes to start on boot, you may configure the
system to clean lithos configs on reboot (for example by putting them on
tmpfs filesystem). Such configuration is occasionally useful, but we
consider the default behaviour to start all processes that were previously
run more useful in most cases.

When verwalter follower is not reached by a leader for the predefined time (don’t
matter whether it is on startup or after it had a leader), it starts an election
process. The election process is not described in detail here because it’s work
in progress. It will be described in detail later in other parts of
documentation.

When verwalter elected as a leader:

It connects to every node and ensures that every follower knows the leader

After establishing connections, it gathers the configuration of all
currently running processes on every node

It connects to local cantal and requests statistics for all the nodes

Then it runs scheduling algorithm that produces new configuration for every
node

At next step it delivers configuration to respective nodes

Repeat from step 3 at regular intervals (~10 sec)

In fact, steps 1-3 are done simultaneously. As outlined in
cantal documentation it gathers and aggregates metrics by itself, easing
the work for verwalter.

Note that at the moment when a new leader is elected the previous one is probably
not accessible (or there were two of them, so no shared consistent configuration
exists). So it is important to gather all current node configurations to keep
number of reallocations/movements of processes between machines at a minimum. It
also allows to have persistent processes (i.e. processes that store data on the
local filesystem or in local memory, for example, database shards).

Having not only old configuration but also statistics is crucial, we can
use it for the following things:

Detect failing processes

Find out the number of requests that are processed per second

Predict trends, i.e. whether traffic is going up or down

All this info is gathered continuously and asynchronously. Nodes come and leave
at every occasion, so it is too complex to reason about them in a reactive
manner. So from SysOp’s point of view the scheduler is a pure function from a
{set of currently running processes; set of metrics} to the new
configuration. The verwalter itself does all heavy lifting of keeping all nodes
in contact, synchronizing changes, etc.

The input to the function in simplified human-readable form looks like the
following:

Furthermore, we have helper utilities to actually keep matching processes
running. So in many simple cases scheduler may just return the number of
processes it wants to run or keep running. In simplified form it looks like
this:

functionschedule_simple(metrics)cfg={django_workers=metrics.django.rps/DJANGO_WORKER_CAPACITY,flask_workers=metrics.flask.rps/FLASK_WORKER_CAPACITY,}total=cfg.django_workers+cfg.flask_workersiftotal>MAX_WORKERSthen-- not enough capacity, but do our bestcfg=distribute_fairly(cfg)else-- have some spare capacity for background taskscfg.background_workers=MAX_WORKERS-totalendreturncfgendmake_scheduler(schedule_simple,{worker_grow_rate:'5 processes per second',-- start processes quicklyworker_decline_rate:'1 process per second',-- but stop at slower rate})

Of course the example is oversimplified, it is only here to get some spirit of
what scheduling might look like.

By using proper lua sandbox, we ensure that function is pure (have no side
effects), so if you need some external data, it must be provided to cantal or
verwalter by implementing their API. In lua script, we do our best to ensure
that function is idempotent, so we can log all the data and resulting
configuration for post mortem debugging.

Also this allows us to make “shadow” schedulers. I. e. ones that have no real
scheduling abilities, but are run on every occasion. The feature might be
useful to evaluate new scheduling algorithm before putting one in production.

The follower is much simpler. When leadership is established, it receives
configuration updates from the leader. Configuration may consist of:

Application name and number of processes to run

Host name to IP address mapping to provide for an application

Arbitrary key-value pairs that are needed for configuring application

(Parts of) configurations of other nodes

Note the items (1), (4) and partially (3) do provide the limited form of
service discovery that was declared at start of this guide. The (2) is there
mostly for legacy applications which does not support service discovery. The
(4) is mostly for proxy servers that need a list of backends, instead of having
backends discover them by host name.

Note

We use extremely ignorant description of “legacy” here. Because even
in 2015 most services don’t support service discovery out of the box and
most proxies have a list of backends in the config. I mean not just old
services that are still widely used. But also services that are created in
recent years. Which is problem on it’s own but not the one verwalter is
aimed to solve. It’s just designed to work both with good and old-style
services.

Every configuration update is applied by verwalter locally. In the simplest
form it means:

Render textual templates into temporary file(s)

Run configuration checker for application

Atomically move configuration file or directory to the right place

Signal the application to reload configuration

For some applications it might be more complex. For lithos which is the most
common configuration target for verwalter it’s just a matter of writing
YAML/JSON config to temporary location and calling lithos_switch utility.

Note

We’re still evaluating whether it’s good idea to support plugins for
complicated configuration scenarios. Or whether the files are universal
transport and you just want to implement daemon on it’s own if you want some
out of scope stuff. The common case might be making API calls instead of
reloading configuration like you might need for docker or any cloud
provider. Lua scripting at this stage is also an option being considered.

When crossing data center things start to be more complicated. In
particular verwalter assumes:

Links between data centers are order of magnitude slower than inside
(normal RTT between nodes inside datacenter is 1ms; whereas between DC even
on the same continent 40ms is expected value and sometimes may be up to
120-500 ms). In some cases traffic is expensive.

The connection between datacenters is less reliable and when it’s down
clients might be serviced by single data center too. It should be possible
to configure partial degradation.

Each DC has some spare capacity on it’s own. So moving resources between
data centers might be more gradual.

There are few data centers (i.e. it’s normal to have 100-1000 nodes,
but almost nobody has more than a dozen of DCs).

So verwalter establishes a leader inside every datacenter. On the
cross-data-center boundary all verwalter leaders treated equally. They form
full mesh of connections. And when one of them experiences peak load it just
requests some resources from other.

Let’s repeat that again: because verwalter is not a database, consistency is
not important here. I.e. if some resources are provided by DC1 for DC2 and for
some reason latter lost connectivity or has some other reason to not use
requested resources, we just release them on a timeout by looking at
appropriate metrics. So dialog between data center leaders translated to
the human language may look like the following:

All things here are scriptable. So your logic may only move background tasks
across data-centers or use cloud API’s to request more virtual machines

Note

A quick note to last sentence. You can’t access cloud API directly
because of sandboxing. But you may produce a configuration for some
imaginary cloud provider management daemon that includes bigger value in
the setting number of virtual machines to provision.