Overview

The default OKD pod scheduler is responsible for determining placement of new
pods onto nodes within the cluster. It reads data from the pod and tries to find
a node that is a good fit based on configured policies. It is completely
independent and exists as a standalone/pluggable solution. It does not modify
the pod and just creates a binding for the pod that ties the pod to the
particular node.

Generic Scheduler

The existing generic scheduler is the default platform-provided scheduler
engine that selects a node to host the pod in a three-step operation:

Filter the Nodes

The available nodes are filtered based on the constraints or requirements
specified. This is done by running each node through the list of filter
functions called predicates.

Prioritize the Filtered List of Nodes

This is achieved by passing each node through a series of
priority functions
that assign it a score between 0 - 10, with 0 indicating a bad fit and 10
indicating a good fit to host the pod. The scheduler configuration can also take
in a simple weight (positive numeric value) for each priority function. The
node score provided by each priority function is multiplied by the weight
(default weight for most priorities is 1) and then combined by adding the scores for each node
provided by all the priorities. This weight attribute can be used by
administrators to give higher importance to some priorities.

Select the Best Fit Node

The nodes are sorted based on their scores and the node with the highest score
is selected to host the pod. If multiple nodes have the same high score, then
one of them is selected at random.

Scheduler Policy

The scheduler configuration file is a JSON file that specifies the predicates and priorities the scheduler
will consider.

In the absence of the scheduler policy file, the default configuration file,
/etc/origin/master/scheduler.json, gets applied.

The predicates and priorities defined in
the scheduler configuration file completely override the default scheduler
policy. If any of the default predicates and priorities are required,
you must explicitly specify the functions in the scheduler configuration file.

Modifying Scheduler Policy

The scheduler policy is defined in a file on the master,
named /etc/origin/master/scheduler.json by default,
unless overridden by the kubernetesMasterConfig.schedulerConfigFile
field in the
master configuration file.

Available Predicates

Predicates are rules that filter out unqualified nodes.

There are several predicates provided by default in OKD. Some of
these predicates can be customized by providing certain parameters. Multiple
predicates can be combined to provide additional filtering of nodes.

Static Predicates

These predicates do not take any configuration parameters or inputs from the
user. These are specified in the scheduler configuration using their exact
name.

Default Predicates

The default scheduler policy includes the following predicates:

NoVolumeZoneConflict checks that the volumes a pod requests
are available in the zone.

{"name" : "NoVolumeZoneConflict"}

MaxEBSVolumeCount checks the maximum number of volumes that can be attached to an AWS instance.

CheckNodeMemoryPressure checks if a pod can be scheduled on a node with a memory pressure condition.

{"name" : "CheckNodeMemoryPressure"}

CheckNodeDiskPressure checks if a pod can be scheduled on a node with a disk pressure condition.

{"name" : "CheckNodeDiskPressure"}

NoVolumeNodeConflict

{"name" : "NoVolumeNodeConflict"}

Other Supported Predicates

OKD also supports the following predicates:

CheckVolumeBinding evaluates if a pod can fit based on the volumes, it requests, for both bound and unbound PVCs.
* For PVCs that are bound, the predicate checks that the corresponding PV’s node affinity is satisfied by the given node.
* For PVCs that are unbound, the predicate searched for available PVs that can satisfy the PVC requirements and that
the PV node affinity is satisfied by the given node.

The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound PVCs can be matched with an available and node-compatible PV.

{"name" : "CheckVolumeBinding"}

The CheckVolumeBinding predicate must be enabled in non-default schedulers.

CheckNodeCondition checks if a pod can be scheduled on a node reporting out of disk, network unavailable, or not ready conditions.

{"name" : "CheckNodeCondition"}

PodToleratesNodeNoExecuteTaints checks if a pod tolerations can tolerate a node NoExecute taints.

{"name" : "PodToleratesNodeNoExecuteTaints"}

CheckNodeLabelPresence checks if all of the specified labels exist on a node, regardless of their value.

{"name" : "CheckNodeLabelPresence"}

checkServiceAffinity checks that ServiceAffinity labels are homogeneous for pods that are scheduled on a node.

{"name" : "checkServiceAffinity"}

General Predicates

The following general predicates check whether non-critical predicates and essential predicates pass. Non-critical predicates are the predicates
that only non-critical pods need to pass and essential predicates are the predicates that all pods need to pass.

Non-critical general predicates

PodFitsResources determines a fit based on resource availability
(CPU, memory, GPU, and so forth). The
nodes can declare their resource capacities and then pods can specify what
resources they require. Fit is based on requested, rather than used
resources.

{"name" : "PodFitsResources"}

Essential general predicates

PodFitsHostPorts determines if a node has free ports for the requested pod ports (absence
of port conflicts).

{"name" : "PodFitsHostPorts"}

HostName determines fit based on the presence of the Host parameter
and a string match with the name of the host.

ServiceAffinity places pods on nodes based on the service running on that pod.
Placing pods of the same service on the same or co-located nodes can lead to higher efficiency.

This predicate attempts to place pods with specific labels
in its node selector
on nodes that have the same label.

If the pod does not specify the labels in its
node selector, then the first pod is placed on any node based on availability
and all subsequent pods of the service are scheduled on nodes that have the
same label values as that node.

For example. if the first pod of a service had a node selector rack was scheduled to a node with label region=rack,
all the other subsequent pods belonging to the same service will be scheduled on nodes
with the same region=rack label. For more information,
see Controlling Pod Placement.

Multiple-level labels are also supported. Users can also specify all pods for a service to
be scheduled on nodes within the same region and within the same zone (under the region).

The labelsPresence parameter checks whether a particular node has a specific label. The labels create node groups that the
LabelPreference priority uses. Matching by label can be useful, for example, where nodes have their physical location or status defined by labels.

Available Priorities

Priorities are rules that rank remaining nodes according to preferences.

A custom set of priorities can be specified to configure the scheduler.
There are several priorities provided by default in OKD.
Other priorities can be customized by providing certain
parameters. Multiple priorities can be combined and different weights
can be given to each in order to impact the prioritization.

Static Priorities

Static priorities do not take any configuration parameters from
the user, except weight. A weight is required to be specified and cannot be 0 or negative.

These are specified in the scheduler configuration,
by default /etc/origin/master/scheduler.json.

Default Priorities

The default scheduler policy includes the following priorities:

The default scheduler policy includes the priorities noted in the list. Each of
the priority function has a weight of 1 except NodePreferAvoidPodsPriority,
which has a weight of 10000.

SelectorSpreadPriority looks for services, replication controllers (RC),
replication sets (RS), and stateful sets that match the pod,
then finds existing pods that match those selectors.
The scheduler favors nodes that have fewer existing matching pods. Then, it schedules the pod on a node with the smallest number of
pods that match those selectors as the pod being scheduled.

{"name" : "SelectorSpreadPriority", "weight" : 1}

InterPodAffinityPriority computes a sum by iterating through the elements of weightedPodAffinityTerm and adding
weight to the sum if the corresponding PodAffinityTerm is satisfied for that node. The node(s) with the highest sum are the most preferred.

{"name" : "InterPodAffinityPriority", "weight" : 1}

LeastRequestedPriority favors nodes with fewer requested resources. It
calculates the percentage of memory and CPU requested by pods scheduled on the
node, and prioritizes nodes that have the highest available/remaining capacity.

{"name" : "LeastRequestedPriority", "weight" : 1}

BalancedResourceAllocation favors nodes with balanced resource usage rate.
It calculates the difference between the consumed CPU and memory as a fraction
of capacity, and prioritizes the nodes based on how close the two metrics are to
each other. This should always be used together with LeastRequestedPriority.

{"name" : "BalancedResourceAllocation", "weight" : 1}

NodePreferAvoidPodsPriority ignores pods that are owned by a controller other than a replication controller.

TaintTolerationPriority prioritizes nodes that have a fewer number of intolerable taints on them for a pod. An intolerable taint is one which has key PreferNoSchedule.

{"name" : "TaintTolerationPriority", "weight" : 1}

Other Priorities

OKD also supports the following priorities:

EqualPriority gives an equal weight of 1 to all nodes, if no priority
configurations are provided. We recommend using this priority only for testing environments.

{"name" : "EqualPriority", "weight" : 1}

MostRequestedPriority prioritizes nodes with most requested resources. It calculates the percentage of memory and CPU
requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.

ServiceSpreadingPriority spreads pods by minimizing the number of pods
belonging to the same service onto the same machine.

{"name" : "ServiceSpreadingPriority", "weight" : 1}

Configurable Priorities

You can configure these priorities in the scheduler configuration,
by default /etc/origin/master/scheduler.json, to add labels to affect
how the priorities.

The type of the priority
function is identified by the argument that they take. Since these are
configurable, multiple priorities of the same type (but different
configuration parameters) can be combined as long as their user-defined names
are different.

ServiceAntiAffinity takes a label and ensures a good spread of the pods
belonging to the same service across the group of nodes based on the label
values. It gives the same score to all nodes that have the same value for the
specified label. It gives a higher score to nodes within a group with the least
concentration of pods.

In some situations using ServiceAntiAffinity based on custom labels does not spread pod as expected.
See this Red Hat Solution.

*The labelPreference parameter gives priority based on the specified label.
If the label is present on a node, that node is given priority.
If no label is specified, priority is given to nodes that do not have a label.

Infrastructure Topological Levels

These label names have no particular meaning and
administrators are free to name their infrastructure levels anything
(eg, city/building/room). Also, administrators can define any number of levels
for their infrastructure topology, with three levels usually being adequate
(such as: regions → zones → racks). Administrators can specify affinity
and anti-affinity rules at each of these levels in any combination.

Affinity

Administrators should be able to configure the scheduler to specify affinity at
any topological level, or even at multiple levels. Affinity at a particular
level indicates that all pods that belong to the same service are scheduled
onto nodes that belong to the same level. This handles any latency requirements
of applications by allowing administrators to ensure that peer pods do not end
up being too geographically separated. If no node is available within the same
affinity group to host the pod, then the pod is not scheduled.

If you need greater control over where the pods are scheduled, see
Using Node Affinity
and Using Pod Affinity and Anti-affinity.
These advanced scheduling features allow administrators
to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

Anti Affinity

Administrators should be able to configure the scheduler to specify
anti-affinity at any topological level, or even at multiple levels.
Anti-affinity (or 'spread') at a particular level indicates that all pods that
belong to the same service are spread across nodes that belong to that
level. This ensures that the application is well spread for high availability
purposes. The scheduler tries to balance the service pods across all
applicable nodes as evenly as possible.

If you need greater control over where the pods are scheduled, see
Using Node Affinity
and Using Pod Affinity and Anti-affinity.
These advanced scheduling features allow administrators
to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

Sample Policy Configurations

The configuration below specifies the default scheduler configuration, if it
were to be specified via the scheduler policy file.

In all of the sample configurations below, the list of predicates and priority
functions is truncated to include only the ones that pertain to the use case
specified. In practice, a complete/meaningful scheduler policy should include
most, if not all, of the default predicates and priorities listed above.