Triaging a malfunctioning cluster

Use this guide to troubleshoot an indeterminate failure type, and determine which of the more in-depth guides will help to solve the issue. Knowledge of the DNS records, certificate authority and the topology of the master and worker nodes is typically required to perform a proper diagnosis.

If you are having problems with a single Deployment or Pod, jump to Verify the control plane. For all other issues, work through this guide.

Verify the Tectonic Console

Open Tectonic Console in a browser, and use the following list to check its status and to determine if there are any networking issues between it and the cluster. Load the address where your Console is running and match your observed behavior below:

Browser returns an error related to certificates, similar to Your connection is not secure

Your cluster installation is using a certificate that is not trusted by your computer. This is common when using a corporate certificate authority or when using an authority generated by the Tectonic Installer. Trust the certificate to continue.

Verify communication to the cluster

Tectonic Console does not load due to connectivity issues diagnosed above. This behavior does not mean that all cluster services are down. Use kubectl get nodes to evaluate the connection to the Kubernetes API:

If the Controller Manager and Scheduler are running, and new pods are being started successfully, there may be a misconfiguration that is affecting the cluster, but not causing anything to crash.

Verify etcd cluster

The cluster appears to be functioning, but it is showing signs that the etcd cluster is not healthy. Be aware that troubleshooting and recovery differ slightly based on how the etcd cluster was launched with Tectonic Installer. Make a note of which option was selected during installation:

Bring an external etcd cluster

Provision an etcd cluster

Create a self-hosted etcd cluster

First, determine the state of etcd, by looking at the logs of the API server, which is the main consumer of the etcd cluster. If more than one API server is running, pick one to inspect.

Troubleshooting connectivity to cluster

Connections to your cluster depend on a chain of network technologies that vary depending on the compute platform running Tectonic. Connection through Tectonic Console and through the Kubernetes API are similar in function, but may be configured differently, and therefore may act differently in an outage.

There are two main DNS records for your cluster, which are a combination of the cluster name (e.g. east-coast) and the domain (eg. example.com) you provided during installation.

Correctly functioning DNS is the first part of the chain. Test your DNS records with dig:

$ dig east-coast.example.com
$ dig east-coast-api.example.com

Observed Behavior

Action

ANSWER SECTION: contains one or more IP addresses

DNS appears to be configured to point either to your master nodes, or to a load balancer. Continue below.

Response does not contain an ANSWER SECTION:, but instead contains an AUTHORITY SECTION:

DNS records do not point to any master nodes or to a load balancer. Access to the cluster cannot function without these records.

Next, test connectivity to the Console, and any other applications using Tectonic Ingress:

$ curl -I https://east-coast.example.com/

Observed Behavior

Action

Response contains HTTP/1.1 200 OK

Console can be reached from your computer.

Response contains curl: (35) Server aborted the SSL handshake

Console can’t be reached. The load balancer does not have any healthy backends.

The Kubernetes API can be reached from your computer. Your request will appear unauthorized because the authentication headers have not been submitted.

Response contains curl: (35) Server aborted the SSL handshake

Console can’t be reached. The load balancer does not have any healthy backends.

Troubleshooting Tectonic Ingress

Tectonic Ingress routes traffic to your containers from outside the cluster. It also routes traffic to Tectonic components hosted on the cluster. If DNS passed validation in Troubleshooting connectivity to cluster above, the Ingress address is available, and delivering traffic to the cluster.

When Ingress is not working you will not be able to use the Console, so we will rely on other tools. First, check the response from the Ingress address in a browser or curl:

$ curl -I https://east-coast.example.com/

Observed Behavior

Action

Browser times out

All of the Ingress routing pods are unavailable. Use kubectl logs to troubleshoot.

Response contains curl: (52) Empty reply from server

All of the Ingress routing pods are unavailable. Use kubectl logs to troubleshoot.

Troubleshooting etcd

etcd is a distributed database that holds the state of your Tectonic cluster. Clusters are typically 3 or more members that are constantly syncing and agreeing on the state of the world. A majority of members, called a "quorum", is required to maintain proper function of the cluster.

etcd clusters will automatically go into read-only mode when the quorum is not reached, in order to protect the integrity of the data. This mode allows for some degraded functionality of the cluster. To restablish quorum, add new healthy members to your cluster, and remove any failed members.

Troubleshooting Identity

Tectonic Identity is the source of authentication for your cluster and is in the critical path for all new sessions using the Console, Kubernetes API, or kubectl. The failure domains document explains in detail how it is architected to reduce downtime, as it is a critical part of the cluster.

Identity will not start if there is an error in its configuration, which is the most common error. View its logs to look for errors:

If Identity presents a "Database Error", this is typically a failure of the Kubernetes control plane, which is where Identity stores its access tokens and state. This affects automatic access token refreshing, signing key rotation, etc.