I filed a related task, T202160: Evaluate different strategy for Docker CI instances, this past Friday which has to do with consolidating executors onto larger instance types, and after discussing that issue in the RelEng weekly meeting, it seems so closely related to this one that I might as well lick both cookies.

The current plan is:

Replace one or two ci1.medium instances with m1.xlarge. The latter has 4x vCPU and memory compared to m1.mediums, so:

Allocate 4-5 Jenkins executors to the m1.xlarge instances. Using 5 executors would result in a slightly lower mem:executor ratio but higher vcpu:executor. (See low vCPU utilization noted in related task.)

Pool the nodes as m4executors and let jobs be scheduled/run normally for a week or so.

Compare job execution time and resource utilization of the m1.xlarge instances with that of the m1.mediums. (Is there an easy way to see mean execution time of jobs by the labels of the nodes they ran on?)

Adjust executor numbers accordingly.

If it makes sense to do so, request a different flavor from Cloud Services that will give us ratios of vcpu/memory more congruent with load.

Hopefully a week or two will give us enough data to compare these different configurations.

instance type

executors (n)

vcpu/n

mem (G)/n

m1.medium

1

2

4

m1.xlarge

5

1.6

3.2

bigmem

8

1

4.5

I'm looking into how best to collect useful metrics, currently thinking that we'll want to look at resource utilization for the instances but also job execution time grouping by (job name, zuul project, instance type).

Friday I removed the integration-slave-docker-1026 node since it was constantly running out of disk-space and then self-recovering (concurrent running containers eating into /).

I replaced it with integration-slave-docker-1027 which is an m1.xlarge with 4 executors (the same as integration-slave-docker-1025). The resources on this machine are comparable to an m1.medium with 1 executor. Since we are currently at our provisioning capacity, I added a m1.xlarge (1027) to gain more capacity.

The statsd publisher was kind of a bust—it isn't collecting enough data and lacks useful metadata by which to filter results, and refactoring the code each time we wish to change the collection/segmentation method seems too cumbersome. I'll be rolling that back today.

However, I was able to collect some useful information for the mediawiki-quibble-vendor-mysql-hhvm-docker job by querying the Jenkins JSON API, and populating a spreadsheet—a win for 90s tech.

These preliminary results show that the average build duration for the bigmem instance configuration in the above matrix performed much better than the other two configurations. However, the recent disk-full failures (see T202457: mediawiki-quibble docker jobs fails due to disk full) that mostly affected the bigmem instance (though not exclusively) may have skewed these results; the spreadsheet does consider durations for only successful builds but if the node is taken offline due to a full disk, the currently running jobs will finish their runs with drastically more resources than they would if they had to continue contending with other newly scheduled builds, thus giving them quite an edge.

Despite the possibility of skewed results, I think repeating this duration comparison between different node configurations after we solve the disk-full issue would be worthwhile.

The full-disk issues have been resolved by giving Docker its own big chunk of the LVM volume group for images and running containers (see T203841: Provide dedicated storage space to Docker for images/containers) and by @hashar's improvements to Quibble's workspace cleanup. As a result, we were able to spin up a few more large instance nodes yesterday: 2 xlarge instances (1 replacing a failed node) and 2 bigram instances, each configured with 4 executors.

Based on the current trends in grafana it seems like there's enough unallocated memory on average to increase the bigram executors, but in any case, the current effective number of m4executors is at 30, which should provide a big boost in capacity for now.

(BTW, an easy way to see this now that we have different configurations is to use the Groovy console:)