Scheduling is a key component of container orchestration, and helps us maximise the workload's availability, whilst making maximum use of the resources available for those workloads. Automated scheduling removes the need for manual deployment of services, which would otherwise be an onerous task, especially when those services require scaling up and down horizontally.

Sometimes, however, it's important for an operator to be able to change where workloads are scheduled, and we'll look into how it's possible to change how Swarm's scheduler places workloads across a cluster. We'll also see what action Swarm takes with regard to deployed services, when failures are detected in the cluster.

Prerequisites

In order to follow the tutorial, the following items are required:

a four-node Swarm Mode cluster, as detailed in the first tutorial of this series,

a single manager node (node-01), with three worker nodes (node-02, node-03, node-04), and

direct, command-line access to node-01, or, access to a local Docker client configured to communicate with the Docker Engine on node-01.

The most straightforward configuration can be achieved by following the first tutorial.

Service Mode

Services in Swarm Mode are an abstraction of a workload, and comprise of one or more tasks, which are implemented as individual containers.

Services in Docker Swarm have a mode, which can be set to one of two types. The default mode for a service when it is created is 'replicated', which means that the service comprises of a configurable number of replicated tasks. This mode is useful when services need to be horizontally scaled in order to cater for load, and to provide resilience.

If a service is created with the default, assumed mode set to replicated, the service will be created with just a single task. But, it is possible to set the number of replicas when the service is created. For example, if a service needs to be scaled from the outset, we would create the service using the following command executed on a manager node (node-01):

The service is created with three tasks running on three of the four nodes in the cluster. We could, of course, achieve the same result by creating the service with the single, default replica, and then use the docker service scale command to scale the service to the required number of replicas.

Whilst a replicated service allows for any number of tasks to be created for the service, a 'global' service results in a single task on every node that is configured to accept tasks (including managers). Global mode is useful where it is desirable or imperative to run a service on every node — an agent for monitoring purposes, for example.

It's necessary to use the --mode global config option when creating the service:

This time, each task name gets a suffix which is the ID of the node it is scheduled on (e.g. nginx.d1euoo53in1krtd4z8swkgwxo), rather than a sequential number in the case of replicated tasks. This is because each task object is associated with a specific node in the cluster. If a new node joins the cluster, new tasks are scheduled on the node for each and every service with a global mode.

Whilst we could have used --mode replicated in conjunction with --replicas 3 in the first example above, because replicated mode is the default, it wasn't necessary to use this config option. Once the mode has been set for a service, it cannot be changed to its alternative. The service will need to be removed and re-created in order to change service mode.

Scheduling Strategy

The way that tasks or containers are scheduled on a Swarm Mode cluster is governed by a scheduling strategy. Currently, Swarm Mode has a single scheduling strategy, called 'spread'. The spread strategy attempts to schedule a service task based on an assessment of the resources available on cluster nodes.

In its simplest form, this means that tasks are evenly spread across the nodes in a cluster. For example, if we create a service with three replicas, each replicated task will be scheduled on a different node:

The one caveat to this simplistic approach to spread-based scheduling, occurs when scaling an existing service. The scheduler will seek to schedule a new task, such that the new task will be scheduled on a node, if one exists, that is not running a task for the same service, irrespective of how many other tasks it is running for other services. If all the cluster nodes are running at least one task for the service, then the scheduler selects the node with the fewer tasks from the same service, before it uses the general assessment of all tasks running across all nodes. This is informally referred to as 'HA scheduling'.

In the real world, workloads consume resources, and when those workloads co-habit, they need to be good neighbours. Swarm Mode allows the definition of a service with a reservation of, and limit to, cpus or memory for each of its tasks. Specifying a limit with --limit-cpus or --limit-memory, ensures that a service's tasks do not consume more of the specified resource than is defined in the limit. In contrast to limits, reserving resources for tasks has a direct bearing on where tasks are scheduled.

Let's see how reserving resources works in practice. The four nodes in our cluster have 1 GB of memory each. If the nodes you are using to follow this tutorial have more or less memory, you will need to adjust the reserved memory values appropriately. First, we'll create a service with three replicas, and reserve 900 MB of memory for each task:

The service's tasks are scheduled on three different nodes, just as we'd expect with Swarm's use of the spread scheduling strategy. Now, let's deploy another service, this time with four replicas, and reserve 200 MB of memory for each task:

Ordinarily, with the spread scheduling strategy, we'd expect one task to end up on node-02, and the others to end up on node-01, node-03 and node-04. However, there is not enough memory available on any of node-01, node-03 and node-04, to reserve 200 MB, and as a result, the remaining tasks are scheduled on node-02, instead.

An amount of CPU can also be reserved for tasks, and is treated in exactly the same way with regard to scheduling. Note that it is possible to specify fractions of CPU (e.g. --reserve-cpu 1.5), as the reserve is based on a calculation which involves the CFS Quota and Period.

Be aware that if the scheduler is unable to allocate a service task, because insufficient resources are available on cluster nodes, the task will remain in a 'pending' state until sufficient resources become available for it to be scheduled.

Service Constraints

Whilst the secheduling aspect of orchestration removes the headache of manually deploying container workloads, sometimes it's convenient (and, sometimes, imperative) to influence where workloads are scheduled. We might want manager nodes to be excluded from consideration. We may need to ensure a stateful service is scheduled on a node where the corresponding data resides. We might want a service to make use of specialist hardware associated with a particular node, etc.

Swarm Mode uses the concept of constraints, which are applied to services, in order to influence where tasks are scheduled. A constraint is applied with the --constraint config option, which takes an expression as a value, in the form <attribute><opeartor><value>. Swarm Mode has a number of in-built attributes, but it's also possible to specify arbitrary attributes using labels associated with nodes.

For the purposes of demonstrating the use of constraints, we can use the in-built node.role attribute, for specifying that we only want a service to be scheduled on worker nodes:

We used the 'global' mode for the service, and would normally have expected a task to be scheduled on every node, including the manager node, node-01. The constraint, however, limited the deployment of the service to the workers, only. We could have achieved the same using the constraint expression node.role!=manager.

The single replica for the task has been scheduled on node-03, which has been imbued with the label associated with the constraint. Any task or tasks associated with a service that has a constraint applied, which cannot be scheduled due to the imposition of other constraints or lack of resources, will remain in a 'pending' state, until such time that it is possible for the task or tasks to be scheduled.

Scheduling Preferences

Whilst constraints provide the ability to deterministically influence the scheduling of tasks, placement preferences provide a 'soft' means of influencing scheduling. Placement preferences direct the scheduler to account for expressed preferences, but if they can't be met due to resource limitations or defined constraints, then scheduling continues according to the normal spread strategy. The placement preference scheme was born from a need to schedule tasks based on topology.

Let's schedule a service based on the cluster's nodes, and their location in (pretend) availability zones. We'll place node-01 and node-02 in zone 'a', node-03 in zone 'b', and node-04 in zone 'c'. When we specify a placement preference based on a zone-related label, the tasks for the service in question will be scheduled equally across the zones. To create the labels for the nodes:

The tasks have been scheduled equally amongst the three 'zones', with node-01 and node-02 acquiring two tasks apiece, whilst node-03 and node-04 have been allocated four tasks each.

The outcome of the deployment of this service would have been very different if we had applied a resource reservation in conjunction with the placement preference. As each node is configured with 1 GB of memory, if we created the service with --reserve-memory 300Mb, the placement preferences could not physically be honoured by the scheduler, and each node would be scheduled with three tasks apiece, instead.

Multiple placement preferences can be expressed for a service, using --placement-pref multiple times, with the order of the preferences being significant. For example, if two placement preferences are defined, the tasks will be spread between the nodes satisfying the first expressed preference, before being further divided according to the second preference. This allows refined placement of tasks, to effect the high availability of services.

Rescheduling on Failure

Those who have spent time with an ops-oriented hat on can identify with the adage, "Anything that can go wrong, will go wrong". Workloads will fail. Cluster nodes, or other infrastructure components, will fail, or become unavailable for periods of time. Ensuring the continued operation of a deployed service, and the recovery to a pre-defined status quo, is an important component of orchestration.

Swarm Mode uses a declarative approach to workloads, and employs 'desired state reconciliation' in order to maintain the desired state of the cluster. If components of the cluster fail, whether they be individual tasks, or a cluster node, Swarm's reconciliation loop attempts to restore the desired state for all workloads affected.

The easiest way for us to demonstrate this is to simulate a node becoming unavailable in the cluster. We can achieve this with relative ease, by changing the 'availability' of a node in the cluster for scheduling purposes. When we issue the command docker node ls, one of the node attributes reported on is 'availability', which normally yields 'Active':

Now, let's set the availability of node-02 to 'drain', which will take it out of the pool for scheduling purposes, and terminate the task nginx.3. It will then get rescheduled on one of the other nodes in the cluster:

The output from docker service ps shows the history for the task in 'slot 3'; a container was shutdown on node-02, and then replaced with a container running on node03.

Conclusion

This tutorial has provided an overview of Docker Swarm Mode's scheduling capabilities. Like most projects in the open source domain, Swarmkit, the project that Docker Swarm Mode is based on, continues to evolve on each new release, and it's probable that its scheduling capabilities will be further enhanced over time. In the meantime, we've highlighted:

Swarm's default spread scheduling strategy,

How resource reservation, and constraints affect scheduling,

How it's possible to influence the scheduler, using placement preferences, and

Swarm's approach to rescheduling on failure.

In the next tutorial, we'll explore how deployed services are consumed, internally and externally.

If you have any questions and/or comments, feel free to leave them in the section below.