NOTE: labvirt have hyperthreading activated, Diamond aggreate the CPU usage and divide that by the total number of CPU including the HT one. The graph multiply the value by two to more or less take in account hyper threading.

Mentioned in SAL (#wikimedia-releng) [2017-11-03T13:30:00Z] <hashar> Unpool integration-slave-docker-1002 and integration-slave-docker-1003 . They are slow CPU wise, most probably due to the underlying labvirt being CPU starved. - T179378

Note that labvirt1015 is currently being stress tested for a new CPU (T171473), so the 20 VMs there are temporary. Ignoring those, we have 714 VMs running which works out to 44.6 VMs per labvirt once we repool labvirt1015 (and ignoring actual usage and resource consumption). We should probably take a close look at any labvirt running more VMs than that and see if we can move some off to other nodes. We have a couple more labvirts on order (or nearly on order) that hopefully will help out a bit too when they arrive and join the pool in the coming months. Note that naive allocation by VM count isn't a magic solution however. labvirt1001 has the fewest number of VMs, but also currently has the fourth highest CPU load in the reports.

The labvirt have HyperThreading enabled. Instructions from multiple programs ends up multiplexed which provided a speed boost. At the operating system level each CPU is shown as two CPUS. Hence a server having 24 CPU would be reflected as having 48 CPU.

Diamond collects the CPU usage from the system, does a sum and divide it by the number of CPU.

So if you 24 busy process, diamond reports a 50% CPU usage. With hyper threading the server is able to run slightly more programs which is shown whenever the CPU usage goes past 50%. But one sure thing, it would never reach 100% and probably ceil around 70%.

On https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning the CPU guest is thus misleading. A server at 50% CPU is probably overloaded. I have added a graph that multiply the CPU usage per 2 to better reflect the business. A better indicator might be load / physical CPU. Eg a load of 24 on 24 real CPU is fine, but a load of 30 would indicate overloading.