Tools

Tutorial - Tools for debugging applications on DC/OS

IMPORTANT: Mesosphere does not support this tutorial, associated scripts, or commands, which are provided without warranty of any kind. The purpose of this tutorial is purely to demonstrate capabilities, and it may not be suited for use in a production environment. Before using a similar solution in your environment, you should adapt, validate, and test.

DC/OS GUIs

DC/OS GUI

The DC/OS GUI is a great place to start debugging as it provides quick access to:

Cluster Resource Allocation to provide an overview of available cluster resources

Task Logs to provide insight into tasks failures

Task Debug Information to provide information about the most recent task offers and/or why a task did not start

Figure 1. Task debug interface

Mesos GUI

The DC/OS GUI shows the majority of the information you need for debugging. However, sometimes going a step further and accessing the Mesos GUI can be helpful – especially when checking failed tasked or registered frameworks. The Mesos GUI can be accessed via https://<cluster-address>/mesos.

Figure 2. Mesos GUI

ZooKeeper GUI

As much of the cluster and framework state is stored in Zookeeper, it can sometimes be helpful to check these states using the ZooKeeper/Exhibitor GUI. Frameworks such as Marathon, Kafka, and Cassandra store information with Zookeeper, so this resource can be particularly useful when debugging such frameworks. For example, a failure while uninstalling of one of these frameworks can leave entries behind. So then for sure, if you experience difficulties when reinstalling a framework you have uninstalled earlier, checking this GUI could be very helpful. You can access it via https:///exhibitor.

Figure 3. ZooKeeper/Exhibitor GUI

Logs

Logs are useful tools for seeing events and the conditions that occurred before they emerged. Often logs include error messages that can supply helpful information regarding the cause of the error. As logging is an important topic in its own right, we recommend the DC/OS logging documentation, for more information.

DC/OS has a number of different sources for logs. In general, these are the most helpful logs for application debugging:

In DC/OS, there are multiple options for accessing any of these logs: the DC/OS GUI the DC/OS CLI, or HTTP endpoints. Moreover, DC/OS rotate logs by default to prevent utilizing all available disk space.

NOTE: Need a scalable way to manage and search your logs? It could be worth building an ELK stack for log aggregation and filtering.

Sometimes it can help to increase the level of detail written to a log temporarily to obtain more detailed troubleshooting information for debugging. For most components, this can be done by accessing an endpoint. For example, if you want to increase the log level of a Mesos Agent for 5 minutes after the server receives the API call, you could follow something like this simple two-step process:

Connect to Master Node

dcos node ssh --master-proxy --leader

Raise Log Level on Mesos Agent 10.0.2.219

curl -X POST 10.0.2.219:5051/logging/toggle?level=3&duration=5mins

Task/Application Logs

Task/application logs are often helpful in understanding the state of the problematic application. By default, applications logs are written (together with execution logs) to the STDERR and STDOUT files in the task work directory. When looking at the task in the DC/OS GUI, you can just simply view the logs as shown below.

Figure 4. Task log

You can also do the same from the DC/OS CLI:

dcos task log --follow <service-name>

Scheduler/Marathon Logs

Marathon is DC/OS’s default scheduler when starting an application. Scheduler logs, and Marathon logs in particular, are a great source of information to help you understand why or how something was scheduled (or not) on which node. Recall that the scheduler matches tasks to available resources. So then because the scheduler also receives task status updates, the log also contains detailed information about task failures.

You can retrieve and view a scheduler log about a specific service through the list of services found in the DC/OS GUI, or via the following command:

dcos service log --follow <scheduler-service-name>

Note that since Marathon is the “Init” system of DC/OS, it is running as a SystemD unit (same with respect to the other DC/OS system components). Due to this fact, you need the CLI command to access its logs.

Mesos Agent Logs

Mesos agent logs are helpful for understanding how an application was started by the agent and how it may have failed. You can launch the Mesos GUI by navigating to https://<cluster_name>/mesos and examining the agent logs as shown below.

Figure 5. Mesos agent interface

Alternatively, you can view the agent logs by first using dcos node log --mesos-id=<node-id> from the DC/OS CLI to locate the corresponding node ID. Enter:

Mesos Master Logs

The Mesos Master is responsible for matching available resources to the scheduler. It also forwards task status updates from the Mesos Agents to the corresponding scheduler. This makes the Mesos Master logs a great resource for understanding the overall state of the cluster.

Be aware that there are typically multiple Mesos Masters for a single cluster. So you should identify the current leading Mesos Master to get the most recent logs. In fact, in some cases it might even make sense to retrieve logs from another Mesos master as well: e.g., a master node failed and you want to understand why.

You can either retrieve the master logs from the Mesos GUI via <cluster-name>/mesos, via dcos node log --leader, or for a specific master node using ssh master and journalctl -u dcos-mesos-master.

System Logs

We have now covered the most important log sources in the DC/OS environment, but there are many more logs available. Every DC/OS component writes a log. As mentioned above, each DC/OS component is running as one Systemd unit. You can retrieve the logs directly on the particular node by SSHing into the node, and then typing journalctl -u <systemd-unit-name>. Two of the more common system units to consider during debugging (besides Mesos and Marathon) are the docker.service and the dcos-exhibitor.service.

As an example, consider the system unit for the docker daemon on the Mesos agent ffc913d8-4012-4953-b693-1acc33b400ce-S0 (recall the dcos node command retrieves the Mesos ID).

Metrics

Metrics are useful because they help identify potential issues before they become actual bugs. For example, imagine a situation wherein a container uses up all allocated memory. If you could detect this while the container is still running but not yet killed, you are much more likely to be able to intervene in time.

One way to leverage metrics to help with debugging is to set up a dashboard. This dashboard would include the most important metrics related to the services you want to monitor. For example, you could use prometheus and grafana to make a metrics dashboard.

Ideally, with the dashboard configured and functioning, you can identify potential problems before they become actual bugs. Moreover, when issues do arise, this sort of dashboard can be extremely helpful in determining the cause of the bug(e.g. maybe a cluster has no free resources). Each link from the endpoint item listed above provides recommendations for the metrics you should monitor for that endpoint.

Interactive

Sometimes the task logs provide insufficient help. In these cases, using your favorite Linux tools (e.g. curl, cat, ping, etc…) to get an interactive point of view could be a worthwhile next step.

For example, if you are using a [Universal Container Runtime (UCR)] (/latest/deploying-services/containerizers/ucr/), you can use dcos task exec as follows:

dcos task exec -it <mycontainerid>

and be presented with an interactive bash shell inside that container.

IMPORTANT: If you alter the state of the container when using dcos task exec in the manner above, you must update the stored app-definition and restart the container from that updated app-definition. If you fail to do so, then your changes will be lost the next time the container restarts.

Alternatively, when using a docker containerizer, you can SSH into the node in question and run docker exec to investigate the running container.

HTTP Endpoints

DC/OS has a large number of additional endpoints that could be useful for debugging:

<cluster>/mesos/master/state-summary

state-summary

The state-summary endpoint returns a json encoded summary of the agents, tasks, and frameworks inside the cluster. This is especially helpful when considering allocation of resources across the cluster, as it shows you whether there are resources already reserved for a particular role (there are more details on this in one of the debugging scenarios provided below.

Other Tools

There are other debugging tools as well – internal to DC/OS as well as external tools like Sysdig or Instana. These tools can be especially helpful in determining non DC/OS specific issues (e.g., Linux Kernel or networking problems).