This topic contains steps to verify the overall health of the OpenShift Container Platform
cluster and the various components, as well as describing the intended behavior.

Knowing the verification process for the various components is the first step to
troubleshooting issues. If experiencing issues, you can use the checks provided
in this section to diagnose any problems.

Checking complete environment health

To verify the end-to-end functionality of an OpenShift Container Platform cluster, build and deploy an example application.

Procedure

Create a new project named validate, as well as an example application from the cakephp-mysql-example template:

$ oc new-project validate
$ oc new-app cakephp-mysql-example

You can check the logs to follow the build:

$ oc logs -f bc/cakephp-mysql-example

Once the build is complete, two pods should be running: a database and an application:

Visit the application URL. The Cake PHP framework welcome page should be
visible. The URL should have the following format
cakephp-mysql-example-validate.<app_domain>.

Once the functionality has been verified, the validate project can be
deleted:

$ oc delete project validate

All resources within the project will be deleted as well.

Creating alerts using Prometheus

You can integrate OpenShift Container Platform with Prometheus to create visuals and alerts
to help diagnose any environment issues before they arise. These issues can
include if a node goes down, if a pod is consuming too much CPU or memory, and
more.

Prometheus on OpenShift Container Platform is a Technology Preview feature only.
Technology Preview features are not supported with Red Hat production service
level agreements (SLAs), might not be functionally complete, and Red Hat does
not recommend to use them for production. These features provide early access to
upcoming product features, enabling customers to test functionality and provide
feedback during the development process.

The above cluster example consists of three master hosts, three infrastructure
node hosts, and three node hosts. All of them are running. All hosts in the
cluster should be visible in this output.

The Ready status means that master hosts can communicate with node hosts and
that the nodes are ready to run pods (excluding the nodes in which scheduling is
disabled).

Before you run etcd commands, source the etcd.conf file:

# source /etc/etcd/etcd.conf

You can check the basic etcd health status from any master instance with the
etcdctl command:

# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE \
--ca-file=/etc/etcd/ca.crt --endpoints=$ETCD_LISTEN_CLIENT_URLS cluster-health
member 59df5107484b84df is healthy: got healthy result from https://10.156.0.5:2379
member 6df7221a03f65299 is healthy: got healthy result from https://10.156.0.6:2379
member fea6dfedf3eecfa3 is healthy: got healthy result from https://10.156.0.9:2379
cluster is healthy

However, to get more information about etcd hosts, including the associated
master host:

Multiple running instances of the container registry require backend storage
supporting writes by multiple processes. If the chosen infrastructure provider
does not contain this ability, running a single instance of a container registry
is acceptable.

If OpenShift Container Platform is using an external container registry, the internal
registry service does not need to be running.

Network connectivity

Network connectivity has two main networking layers: the cluster network for
node interaction, and the software defined network (SDN) for pod interaction.
OpenShift Container Platform supports multiple network configurations, often optimized for a
specific infrastructure provider.

Due to the complexity of networking, not all verification scenarios are covered
in this section.

Connectivity on master hosts

etcd and master hosts

Master services keep their state synchronized using the etcd key-value store.
Communication between master and etcd services is important, whether those
etcd services are collocated on master hosts, or running on hosts designated
only for the etcd service. This communication happens on TCP ports 2379 and
2380. See the
Host
health section for methods to check this communication.

SkyDNS

SkyDNS provides name resolution of local services running in OpenShift Container Platform.
This service uses TCP and UDP port 8053.

To verify the name resolution:

$ dig +short docker-registry.default.svc.cluster.local
172.30.150.7

If the answer matches the output of the following, SkyDNS service is working correctly:

Both the API service and web console share the same port, usually TCP8443
or 443, depending on the setup. This port needs to be available within the
cluster and to everyone who needs to work with the deployed environment. The
URLs under which this port is reachable may differ for internal cluster and for
external clients.

The node host is listening on TCP port 10250. This port needs to be
reachable by all master hosts on any node, and if monitoring is deployed in the
cluster, the infrastructure nodes must have access to this port on all instances
as well. Broken communication on this port can be detected with the following
command:

In the output above, the node service on the ocp-node-w135 node is
not reachable by the master services, which is represented by its NotReady
status.

The last service is the router, which is responsible for routing connections
to the correct services running in the OpenShift Container Platform cluster. Routers listen
on TCP ports 80 and 443 on infrastructure nodes for ingress traffic.
Before routers can start working, DNS must be configured:

The IP address, in this case 35.xx.xx.92, should be pointing to the load
balancer distributing ingress traffic to all infrastructure nodes. To verify the
functionality of the routers, check the registry service once more, but this
time from outside the cluster:

Node instances need at least 15 GB space for the /var directory, and at least
another 15 GB for Docker storage (/var/lib/docker in this case). Depending on
the size of the cluster and the amount of ephemeral storage desired for pods, a
separate partition should be created for
/var/lib/origin/openshift.local.volumes on the nodes.

Persistent storage for pods should be handled outside of the instances running
the OpenShift Container Platform cluster. Persistent volumes for pods can be provisioned by
the infrastructure provider, or with the use of container native storage or
container ready storage.

Docker storage

Docker Storage can be backed by one of two options. The first is a thin pool
logical volume with device mapper, the second, since Red Hat Enterprise Linux
version 7.4, is an overlay2 file system. The overlay2 file system is generally
recommended due to the ease of setup and increased performance.

The Docker storage disk is mounted as /var/lib/docker and formatted with xfs
file system. Docker storage is configured to use overlay2 filesystem:

The API service exposes a health check, which can be queried externally with:

$ curl -k https://master.example.com/healthz
ok

Controller role verification

The OpenShift Container Platform controller service,
atomic-openshift-master-controllers.service, is available across all master
hosts. The service runs in active/passive mode, meaning it should only be
running on one master at any time.

The OpenShift Container Platform controllers execute a procedure to choose which host runs
the service. The current running value is stored in an annotation in a special
configmap stored in the kube-system project.

Verify the master host running the atomic-openshift-master-controllers service as a cluster-admin user:

Verifying correct Maximum Transmission Unit (MTU) size

Verifying the maximum transmission unit (MTU) prevents a possible networking
misconfiguration that can masquerade as an SSL certificate issue.

When a packet is larger than the MTU size that is transmitted over HTTP, the
physical network router is able to break the packet into multiple packets to
transmit the data. However, when a packet is larger than the MTU size is that
transmitted over HTTPS, the router is forced to drop the packet.

The above example output shows the MTU size being used to ensure the SSL
connection is correct. The attempt to connect is successful, followed by
connectivity being established and completes with initializing the NSS with the
certpath and all the server certificate information regarding the
docker-registry.

The above example shows that the connection is established, but it cannot finish
initializing NSS with certpath. The issue deals with improper MTU size set
within the /etc/origin/node/node-config.yaml file.

To fix this issue, adjust the MTU size within the
/etc/origin/node/node-config.yaml to 50 bytes smaller than the MTU size being
used by the OpenShift SDN Ethernet device.

To change the MTU size, modify the /etc/origin/node/node-config.yaml file
and set a value that is 50 bytes smaller than output provided by the ip command.

For example, if the MTU size is set to 1500, adjust the MTU size to 1450 within
the /etc/origin/node/node-config.yaml file:

networkConfig:
mtu: 1450

Save the changes and reboot the node:

You must change the MTU size on all masters and nodes that are part of the
OpenShift Container Platform SDN. Also, the MTU size of the tun0 interface must be the same
across all nodes that are part of the cluster.

Once the node is back online, confirm the issue no longer exists by re-running
the original curl command.

$ curl -v https://docker-registry.default.svc:5000/healthz

If the timeout persists, continue to adjust the MTU size in increments of 50
bytes and repeat the process.