Overview

When you perform node management operations, the CLI interacts with
node objects
that are representations of actual node hosts. The
master
uses the information from node objects to validate nodes with
health checks.

You must choose the selector (label query) to filter on. Supports =, ==, and !=.

You must have cluster-reader permission to view the usage statistics.

Metrics must be installed to view the usage statistics.

Adding nodes

To add nodes to your existing OKD cluster, you can run an Ansible
playbook that handles installing the node components, generating the required
certificates, and other important steps. See the
advanced
installation method for instructions on running the playbook directly.

Deleting nodes

When you delete a node using the CLI, the node object is deleted in Kubernetes,
but the pods that exist on the node itself are not deleted. Any bare pods not
backed by a replication controller would be inaccessible to OKD,
pods backed by replication controllers would be rescheduled to other available
nodes, and
local
manifest pods would need to be manually deleted.

Pods should now be only scheduled for the remaining nodes that are in Ready
state.

If you want to uninstall all OKD content from the node host,
including all pods and containers, continue to
Uninstalling
Nodes and follow the procedure using the uninstall.yml playbook. The
procedure assumes general understanding of the
advanced installation
method using Ansible.

Marking nodes as unschedulable or schedulable

By default, healthy nodes with a Readystatus are
marked as schedulable, meaning that new pods are allowed for placement on the
node. Manually marking a node as unschedulable blocks any new pods from being
scheduled on the node. Existing pods on the node are not affected.

Alternatively, instead of specifying specific node names (e.g., <node1>
<node2>), you can use the --selector=<node_selector> option to mark selected
nodes as schedulable or unschedulable.

Evacuating pods on nodes

Evacuating pods allows you to migrate all or selected pods from a given node or
nodes. Nodes must first be
marked unschedulable to
perform pod evacuation.

Only pods backed by a
replication
controller can be evacuated; the replication controllers create new pods on
other nodes and remove the existing pods from the specified node(s). Bare pods,
meaning those not backed by a replication controller, are unaffected by default.
You can evacuate a subset of pods by specifying a pod-selector. Pod selector is
based on labels, so all the pods with the specified label will be evacuated.

To list pods that will be migrated without actually performing the evacuation,
use the --dry-run option:

Alternatively, instead of specifying specific node names (e.g., <node1>
<node2>), you can use the --selector=<node_selector> option to evacuate pods
on selected nodes.

To list objects that will be migrated without actually performing the evacuation,
use the --dry-run option and set it to true:

$ oc adm drain <node1> <node2> --dry-run=true

Rebooting nodes

To reboot a node without causing an outage for applications running on the
platform, it is important to first evacuate the
pods. For pods that are made highly available by the routing tier, nothing
else needs to be done. For other pods needing storage, typically databases, it
is critical to ensure that they can remain in operation with one pod
temporarily going offline. While implementing resiliency for stateful pods
is different for each application, in all cases it is important to configure
the scheduler to use node anti-affinity to
ensure that the pods are properly spread across available nodes.

Another challenge is how to handle nodes that are running critical
infrastructure such as the router or the registry. The same node evacuation
process applies, though it is important to understand certain edge cases.

Infrastructure nodes

Infrastructure nodes are nodes that are labeled to run pieces of the
OKD environment. Currently, the easiest way to manage node reboots
is to ensure that there are at least three nodes available to run
infrastructure. The scenario below demonstrates a common mistake that can lead
to service interruptions for the applications running on OKD when
only two nodes are available.

Node A is marked unschedulable and all pods are evacuated.

The registry pod running on that node is now redeployed on node B. This means
node B is now running both registry pods.

Node B is now marked unschedulable and is evacuated.

The service exposing the two pod endpoints on node B, for a brief period of
time, loses all endpoints until they are redeployed to node A.

The same process using three infrastructure nodes does not result in a service
disruption. However, due to pod scheduling, the last node that is evacuated and
brought back in to rotation is left running zero registries. The other two nodes
will run two and one registries respectively. The best solution is to rely on
pod anti-affinity. This is an alpha feature in Kubernetes that is available for
testing now, but is not yet supported for production workloads.

Using pod anti-affinity

Pod anti-affinity is slightly different than
node anti-affinity. Node anti-affinity can be
violated if there are no other suitable locations to deploy a pod. Pod
anti-affinity can be set to either required or preferred.

Using the docker-registry pod as an example, the first step in enabling
this feature is to set the scheduler.alpha.kubernetes.io/affinity on the
pod.

Specifies a weight for a preferred rule. The node with the highest weight is preferred.

4

Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.

5

The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

scheduler.alpha.kubernetes.io/affinity is internally stored as a string
even though the contents are JSON. The above example shows how this string can
be added as an annotation to a YAML deployment configuration.

This example assumes the Docker registry pod has a label of
docker-registry=default. Pod anti-affinity can use any Kubernetes match
expression.

The last required step is to enable the MatchInterPodAffinity scheduler
predicate in /etc/origin/master/scheduler.json. With this in place, if only
two infrastructure nodes are available and one is rebooted, the Docker registry
pod is prevented from running on the other node. oc get pods reports the pod
as unready until a suitable node is available. Once a node is available and all
pods are back in ready state, the next node can be restarted.

Handling nodes running routers

In most cases, a pod running an OKD router will expose a host port.
The PodFitsPorts scheduler predicate ensures that no router pods using the
same port can run on the same node, and pod anti-affinity is achieved. If the
routers are relying on
IP failover
for high availability, there is nothing else that is needed. For router pods
relying on an external service such as AWS Elastic Load Balancing for high
availability, it is that service’s responsibility to react to router pod
restarts.

In rare cases, a router pod may not have a host port configured. In those cases,
it is important to follow the recommended restart
process for infrastructure nodes.

Configuring node resources

You can configure node resources by adding kubelet arguments to the node
configuration file (/etc/origin/node/node-config.yaml). Add the
kubeletArguments section and include any desired options:

Setting maximum pods per node

See the
Cluster
Limits page for the maximum supported limits for each version of
OKD.

In the /etc/origin/node/node-config.yaml file, two parameters control the
maximum number of pods that can be scheduled to a node: pods-per-core and
max-pods. When both options are in use, the lower of the two limits the number
of pods on a node. Exceeding these values can result in:

Increased CPU utilization on both OKD and Docker.

Slow pod scheduling.

Potential out-of-memory scenarios (depends on the amount of memory in the node).

In Kubernetes, a pod that is holding a single container actually uses two
containers. The second container is used to set up networking prior to the
actual container starting. Therefore, a system running 10 pods will actually
have 20 containers running.

pods-per-core sets the number of pods the node can run based on the number of
processor cores on the node. For example, if pods-per-core is set to 10 on
a node with 4 processor cores, the maximum number of pods allowed on the node
will be 40.

kubeletArguments:
pods-per-core:
- "10"

Setting pods-per-core to 0 disables this limit.

max-pods sets the number of pods the node can run to a fixed value, regardless
of the properties of the node.
Cluster
Limits documents maximum supported values for max-pods.

kubeletArguments:
max-pods:
- "250"

Using the above example, the default value for pods-per-core is 10 and the
default value for max-pods is 250. This means that unless the node has 25
cores or more, by default, pods-per-core will be the limiting factor.

Resetting Docker storage

As you download Docker images and run and delete containers, Docker does not always free up mapped disk space. As a result, over time you can run out of space on a node,
which might prevent OKD from being able to create new pods or cause pod creation to take several minutes.

For example, the following shows pods that are still in the ContainerCreating state after six minutes and the events log shows a FailedSync event.

One solution to this problem is to reset Docker storage to remove artifacts not needed by Docker.

On the node where you want to restart Docker storage:

Run the following command to mark the node as unschedulable:

$ oc adm manage-node <node> --schedulable=false

Run the following command to shut down Docker and the atomic-openshift-node service:

$ systemctl stop docker atomic-openshift-node

Run the following command to remove the local volume directory:

$ rm -rf /var/lib/origin/openshift.local.volumes

This command clears the local image cache. As a result, images, including ose-* images, will need to be re-pulled.
This might result in slower pod start times while the image store recovers.

Remove the /var/lib/docker directory:

$ rm -rf /var/lib/docker

Run the following command to reset the Docker storage:

$ docker-storage-setup --reset

Run the following command to recreate the Docker storage:

$ docker-storage-setup

Recreate the /var/lib/docker directory:

$ mkdir /var/lib/docker

Run the following command to restart Docker and the atomic-openshift-node service:

$ systemctl start docker atomic-openshift-node

Run the following command to mark the node as schedulable:

$ oc adm manage-node <node> --schedulable=true

Changing node traffic interface

By default, DNS routes all node traffic. During node registration, the master
receives the node IP addresses from the DNS configuration, and therefore
accessing nodes via DNS is the most flexible solution for most deployments.

If your deployment is using a cloud provider, then the node gets the IP
information from the cloud provider. However, openshift-sdn attempts to
determine the IP through a variety of methods, including a DNS lookup on the
nodeName (if set), or on the system hostname (if nodeName is not set).

However, you may need to change the node traffic interface. For example,
where:

OKD is installed in a cloud provider where internal hostnames are not configured/resolvable by all hosts.

The node’s IP from the master’s perspective is not the same as the node’s IP from its own perspective.

Configuring the openshift_set_node_ip Ansible variable
forces node traffic through an interface other than the default network
interface.

To change the node traffic interface:

Set the openshift_set_node_ip Ansible variable to true.

Set the openshift_ip to the IP address for the node you want to configure.

Although openshift_set_node_ip can be useful as a workaround for the
cases stated in this section, it is generally not suited for production
environments. This is because the node will no longer function properly if it
receives a new IP address.