Overview

A node must maintain stability when available compute resources are low.
This is especially important when dealing with incompressible resources such as
memory or disk. If either resource is exhausted, the node becomes unstable.

Administrators can proactively monitor nodes for and prevent against situations where the node runs out of compute and memory resources using configurable eviction policies.

If swap memory is enabled for a node, that node cannot detect that it is under MemoryPressure.

To take advantage of memory based evictions, operators must
disable swap.

Configuring Eviction Policies

An eviction policy allows a node to fail one or more pods when the node is running low on available resources.
Failing a pod allows the node to reclaim needed resources.

An eviction policy is a combination of an eviction trigger signal with a specific eviction threshold value that is set in the node configuration file or through the command line. Evictions can be either hard, where a node takes immediate action on a pod that exceeds a threshold, or soft, where a node allows a grace period before taking action.

By using well-configured eviction policies, a node can proactively monitor for and prevent
against total starvation of a compute resource.

When the node fails a pod, it terminates all containers in the pod, and
the PodPhase is transitioned to Failed.

When detecting disk pressure, the node supports the nodefs and imagefs file system partitions.

The nodefs, or rootfs, is the file system that the node uses for local disk volumes, daemon logs, emptyDir,
and so on (for example, the file system that provides /). The rootfs contains openshift.local.volumes,
by default /var/lib/origin/openshift.local.volumes.

The imagefs is the file system that the container runtime uses for storing images and
individual container-writable layers. Eviction thresholds are at 85% full for imagefs. The imagefs file system depends on the runtime and,
in the case of Docker, which storage driver you are using.

For Docker:

If you are using the devicemapper storage driver, the imagefs is thin pool.

You can limit the read/write layer for the container by setting the --storage-opt dm.basesize flag in the Docker daemon.

$ sudo dockerd --storage-opt dm.basesize=50G

If you are using the overlay2 storage driver, the imagefs is the file system that contains /var/lib/docker/overlay2.

For CRI-O, which uses the overlay driver, the imagefs is /var/lib/containers/storage by default.

If you do not use local storage isolation (ephemeral storage) and not using XFS quota (volumeConfig), you cannot limit local disk usage by the pod.

Using the Node Configuration to Create a Policy

To configure an eviction policy, edit the node configuration file (the /etc/origin/node/node-config.yaml file) to specify the eviction thresholds under the eviction-hard or eviction-soft parameters.

Available diskspace on either the node root file system or image file system has exceeded an eviction threshold.

nodefs.inodesFree

nodefs.inodesFree = node.stats.fs.inodesFree

imagefs.available

imagefs.available = node.stats.runtime.imagefs.available

imagefs.inodesFree

imagefs.inodesFree = node.stats.runtime.imagefs.inodesFree

Each of the above signals supports either a literal or percentage-based value. The percentage-based value is calculated relative to the total capacity associated with each signal.

A script derives the value for memory.available from your cgroup driver using the same set of steps that the kubelet performs. The script excludes inactive file memory (that is, the number of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that inactive file memory is reclaimable under pressure.

Do not use tools like free -m, because free -m does not work in a container.

If you store volumes and logs in a dedicated file system, the node will not
monitor that file system.

As of OpenShift Container Platform 3.4, the node supports the ability to trigger eviction
decisions based on disk pressure. Operators must opt-in to enable disk-based
evictions. Prior to evicting pods due to disk pressure, the node also
performs
container
and image garbage collection. In future releases, garbage collection will be
deprecated in favor of a pure disk-eviction based configuration.

Understanding Eviction Thresholds

You can configure a node to specify eviction thresholds, which triggers the node
to reclaim resources, by adding a threshold to the node configuration file.

If an eviction threshold is met, independent of its associated grace period, the
node reports a condition indicating that the node is under memory or disk pressure. This prevents the scheduler from scheduling any additional pods on the node while attempts to reclaim resources are made.

The node continues to report node status updates at the frequency specified by the node-status-update-frequency argument, which
defaults to 10s (ten seconds).

Eviction thresholds can be hard, for when the node takes immediate action when a
threshold is met, or soft, for when you allow a grace period before
reclaiming resources.

Soft eviction usage is more common when you are targeting a certain level of
utilization, but can tolerate temporary spikes. We recommended
setting the soft eviction threshold lower than the hard eviction
threshold, but the time period can be operator-specific. The system reservation
should also cover the soft eviction threshold.

The soft eviction threshold is an advanced feature. You should configure a hard eviction threshold before attempting to use soft eviction thresholds.

the quantity value must match the quantity representation used by
Kubernetes and can be expressed as a percentage if it ends with the % token.

For example, if an operator has a node with 10Gi of memory, and that operator wants to induce eviction if available memory falls below 1Gi, an eviction threshold for memory can be specified as either of the following:

memory.available<1Gi
memory.available<10%

The node evaluates and monitors eviction thresholds every 10 seconds and the
value can not be modified. This is the housekeeping interval.

Understanding Hard Eviction Thresholds

A hard eviction threshold has no grace period and, if observed, the node takes immediate action to reclaim the associated starved resource. If a hard eviction
threshold is met, the node kills the pod immediately with no graceful termination.

Understanding Soft Eviction Thresholds

A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The node does not reclaim resources associated with the eviction signal until that grace period is exceeded. If no grace period is provided in the node configuration the node errors on startup.

In addition, if a soft eviction threshold is met, an operator can specify a maximum allowed pod termination grace period to use when evicting pods from the
node. If eviction-max-pod-grace-period is specified, the node uses the lesser value among the pod.Spec.TerminationGracePeriodSeconds and the maximum-allowed grace period. If not specified, the node kills pods immediately with no graceful termination.

For soft eviction thresholds the following flags are supported:

eviction-soft: a set of eviction thresholds (for example, memory.available<1.5Gi) that, if met over a corresponding grace period, triggers a pod eviction.

eviction-soft-grace-period: a set of eviction grace periods (for example, memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.

eviction-max-pod-grace-period: the maximum-allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.

Configuring the Amount of Resource for Scheduling

You can control how much of a node resource is made available for scheduling in order to allow the scheduler to fully allocate a node and to prevent
evictions.

Set system-reserved equal to the amount of resource you want available to the scheduler for deploying pods and for system-daemons.
Evictions should only occur if pods use more than their requested amount of an allocatable resource.

A node reports two values:

Capacity: How much resource is on the machine

Allocatable: How much resource is made available for scheduling.

To configure the amount of allocatable resources:

Edit the node configuration file (the /etc/origin/node/node-config.yaml file) to add or modify the system-reserved parameter for eviction-hard or eviction-soft.

Restart the OpenShift Container Platform service for the changes to take effect:

# systemctl restart atomic-openshift-node

Controlling Node Condition Oscillation

If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node condition
oscillates between true and false, which can cause problems for the scheduler.

To prevent this oscillation, set the eviction-pressure-transition-period parameter to control how long the node must wait before transitioning out of a pressure condition.

Edit or add the parameter to the kubeletArguments section of the node configuration file
(the /etc/origin/node/node-config.yaml)
using a set of <resource_type>=<resource_quantity> pairs.

kubeletArguments:
eviction-pressure-transition-period="5m"

+
The node toggles the condition back to false when the node has not observed an eviction threshold being met
for the specified pressure condition for the specified period.

+

Use the default value (5 minutes) before doing any adjustments.
The default choice is intended to allow the system to stabilize, and to prevent the scheduler from assigning new pods to the node before it has settled.

Restart the OpenShift Container Platform services for the changes to take effect:

# systemctl restart atomic-openshift-node

Reclaiming Node-level Resources

If an eviction criteria is satisfied, the node initiates the process of reclaiming the pressured resource until the signal goes below the defined threshold. During this time, the node does not support scheduling any new pods.

The node attempts to reclaim node-level resources prior to evicting end-user pods, based on whether the host system has a dedicated imagefs configured for the
container runtime.

With Imagefs

If the host system has imagefs:

If the nodefs file system meets eviction thresholds, the node frees up disk
space in the following order:

Delete dead pods/containers

If the imagefs file system meets eviction thresholds, the node frees up disk
space in the following order:

Delete all unused images

Without Imagefs

If the host system does not have imagefs:

If the nodefs file system meets eviction thresholds, the node frees up disk
space in the following order:

Delete dead pods/containers

Delete all unused images

Understanding Pod Eviction

If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until the signal goes below
the defined threshold.

The node ranks pods for eviction by their quality of service, and, among those with the same quality of service, by the consumption of the starved compute resource relative to the pod’s scheduling request.

The following table lists each QOS level and the associated OOM score.

Table 2. Quality of Service Levels

Quality of Service

Description

Guaranteed

Pods that consume the highest amount of the starved resource relative to
their request are failed first. If no pod has exceeded its request, the strategy
targets the largest consumer of the starved resource.

Burstable

Pods that consume the highest amount of the starved resource relative to their
request for that resource are failed first. If no pod has exceeded its request,
the strategy targets the largest consumer of the starved resource.

BestEffort

Pods that consume the highest amount of the starved resource are failed
first.

A Guaranteed pod will never be evicted because of another pod’s resource consumption unless a system daemon (such as node, docker, journald) is consuming more resources than were reserved using system-reserved, or kube-reserved allocations or if the node has only Guaranteed pods remaining.

If the node has only Guaranteed pods remaining, the node evicts a Guaranteed pod that least impacts node stability and limits the impact of the unexpected consumption to other Guaranteed pods.

Local disk is a BestEffort resource. If necessary, the node evicts pods one at a time to reclaim disk when DiskPressure is encountered. The node ranks
pods by quality of service. If the node is responding to inode starvation, it will reclaim inodes by evicting pods with the lowest quality of service first.
If the node is responding to lack of available disk, it will rank pods within a quality of service that consumes the largest amount of local disk, and evict
those pods first.

Understanding Quality of Service and Out of Memory Killer

If the node experiences a system out of memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.

The node sets a oom_score_adj value for each container based on the quality of service for the pod.

If the node is unable to reclaim memory prior to experiencing a system OOM event, the oom_killer calculates an oom_score:

% of node memory a container is using + `oom_score_adj` = `oom_score`

The node then kills the container with the highest score.

Containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request are failed first.

Unlike pod eviction, if a pod container is OOM failed, it can be restarted by the node based on the node restart policy.

Understanding the Pod Scheduler and OOR Conditions

The scheduler views node conditions when placing additional pods on the node. For example, if the node has an eviction threshold like the following:

eviction-hard is "memory.available<500Mi"

and available memory falls below 500Mi, the node reports a value in Node.Status.Conditions as MemoryPressure as true.

Table 4. Node Conditions and Scheduler Behavior

Node Condition

Scheduler Behavior

MemoryPressure

If a node reports this condition, the scheduler will not place BestEffort pods on that node.

DiskPressure

If a node reports this condition, the scheduler will not place any additional pods on that node.

Example Scenario

Consider the following scenario.

An opertator:

has a node with a memory capacity of 10Gi;

wants to reserve 10% of memory capacity for system daemons
(kernel, node, etc.);

wants to evict pods at 95% memory utilization to reduce
thrashing and incidence of system OOM.

Implicit in this configuration is the understanding that system-reserved should include the amount of memory covered by the eviction threshold.

To reach that capacity, either some pod is using more than its request, or the system is using more than 1Gi.

If a node has 10 Gi of capacity, and you want to reserve 10% of that capacity for the system daemons (system-reserved), perform the following calculation:

capacity = 10 Gi
system-reserved = 10 Gi * .1 = 1 Gi

The amount of allocatable resources becomes:

allocatable = capacity - system-reserved = 9 Gi

This means by default, the scheduler will schedule pods that request 9 Gi of
memory to that node.

If you want to turn on eviction so that eviction is triggered when the node
observes that available memory falls below 10% of capacity for 30 seconds, or
immediately when it falls below 5% of capacity, you need the scheduler to see
allocatable as 8Gi. Therefore, ensure your system reservation covers the greater
of your eviction thresholds.

This configuration ensures that the scheduler does not place pods on a node that immediately induce memory pressure and trigger eviction assuming those pods use
less than their configured request.

Recommended Practice

DaemonSets and Out of Resource Handling

If a node evicts a pod that was created by a DaemonSet, the pod will
immediately be recreated and rescheduled back to the same node, because the node
has no ability to distinguish a pod created from a DaemonSet versus any other
object.

In general, DaemonSets should not create BestEffort pods to avoid being
identified as a candidate pod for eviction. Instead DaemonSets should ideally
launch Guaranteed pods.