Scenario 1

Tutorial - Resource Allocation

IMPORTANT: Mesosphere does not support this tutorial, associated scripts, or commands, which are provided without warranty of any kind. The purpose of this tutorial is purely to demonstrate capabilities, and it may not be suited for use in a production environment. Before using a similar solution in your environment, you should adapt, validate, and test.

Scenario 1: Resource Allocation

Setup

Check the application status using the DC/OS GUI, you should see something like the following:

Figure 1. DC/OS GUI showing app status

with the status of the application most likely to be “Waiting” followed by some number of thousandths “x/1000”. “Waiting” refers to the overall application status and the number; “x” here represents how many instances have successfully deployed (6 in this example).

So now we know that some (6/1000) instances of the application have successfully deployed, but the overall deployment status is “Waiting”. But what does this mean?

Resolution

The “Waiting” state means that DC/OS (or more precisely Marathon) is waiting for a suitable resource offer. So it seems to be an deployment issue and we should start by checking the available resources.

If we look at the DC/OS dashboard, we should see a pretty high CPU allocation similar to the following (of course, the exact percentage depends on your cluster):

Figure 2. DC/OS resource allocation display

Since we are not yet at 100% allocation, but we are still waiting to deploy, something interesting is going on. So let’s look at the recent resource offers in the debug view of the DC/OS GUI.

Figure 3. Recent resource offers

We can see that there are no matching CPU resources. But again, the overall CPU allocation is only at 75%. Further puzzling, when we take a look at the ‘Details’ section further below, we see that the latest offers from a different host match the resource requirements of our application. So, for example, the first offer coming from host 10.0.0.96 matched the role, constraint (not present in this app-definition) memory, disk, port resource requirements — but failed the CPU resource requirements. The offer before this also seemed like it should have met the resource requirements. So despite it looking like we have enough CPU resources available, the application seems to be failing for just this reason.

Let’s look at the details more closely.

Figure 4. Rsource allocation details

According to this, some of the remaining CPU resources are allocated to a different Mesos resource role and so cannot be used by our application (it runs in role ‘*’, the default role).

When looking at the agent information we can see two different kinds of agent.

Figure 5. Cluster information

The first kind has no free CPU resources and also no reserved resources. Of course, this might be different if you had other workloads running on your cluster prior to these exercises. Note that these unreserved resources correspond to the default role ‘*’ — the role by which we are trying to deploy our tasks.

The second kind has unused CPU resources, but these resources are reserved in the role ‘slave_public’.

We now know that the issue is that there are not enough resources in the desired resource role across the entire cluster. As a solution we could either scale down the application (1000 instances does seem a bit excessive), or we need to add more resources to the cluster.

General Pattern

NOTE: When your application framework (e.g. Marathon) is not accepting resource offers, check whether there are sufficient resources available in the respective resource role.

This was a straightforward scenario with too few CPU resources. Typically resource issues are more likely caused by more complex factors - such as improperly configured port resources or placement constraints. Nonetheless, this general workflow pattern still applies.