Performance and Scalability

Wednesday Apr 10, 2013

Today I conclude this series on M5-32 scalability [
Part1
,
Part2
] with enhancements we made in the Scheduler, Devices, Tools, and Reboot
areas of Solaris.

Scheduler

The Solaris thread scheduler is little changed, as the architecture
of balancing runnable threads across levels in the
processor resource hierarchy
, which I described when the T2 processor was introduced,
has scaled well. However, we have continued to optimize the clock
function of the scheduler.
Clock is responsible for quanta expiration, timeout processing, resource
accounting for every CPU, and for misc housekeeping functions.
Previously, we parallelized
quanta expiration
and
timeout expiration(aka callouts).
In Solaris 11, we eliminated the need to acquire the
process and thread locks in most cases during quanta expiration and accounting,
and we eliminated or reduced the impact of several smallish O(N)
calculations that had become significant at 1536 CPUs. The net result
is that all functionality associated with clock scales nicely, and
CPU 0 does not accumulate
noticeable %sys CPU time due to clock processing.

Devices

SPARC systems use an IOMMU to map PCI-E virtual addresses to physical
memory. The PCI VA space is a limited resource with high demand. The
VA span is only 2GB to maintain binary compatibility with traditional
DDI functions, and many drivers pre-map large DMA buffer pools so that mapping
is not on the critical path for transport operations. Every CPU can
post concurrent DMA requests, thus demand increases with scale.
Managing these conflicting demands is a challenge. We reimplemented
DVMA allocation using the Solaris kmem_cache and vmem
facilities, with object size and quanta chosen to match common DMA
transfer sizes. This provides a good balance between contention-free
per-CPU caching, and redistribution of free space in the back end
magazine and slab layers. We also modified drivers to use DMA
pools more efficiently, and we modified the IOMMU code so that
2GB of VA is available per PCI function, rather than per PCI root port.

The net result for the end user is higher device throughput and/or
lower CPU utilization per unit of throughput on larger systems.

Tools

The very tools we use to analyze scalability may exhibit problems
themselves, because they must collect data for all the entities on a
system. We noticed that mpstat was consuming so much CPU time on
large systems that it could not sample at 1 second intervals and was
falling behind. mpstat collects data for all CPUs in every interval,
but 1536 CPUs is not a large number to handle in 1 second, so
something was amiss. Profiling showed the time was spent searching
for per-cpu kstats (see kstat(3KSTAT)), and every lookup searched the
entire kc_chain linked list of all kstats. Since the number of
kstats grows with NCPU, the overall algorithm takes time O(NCPU^2),
which explodes on the larger systems. We modified the kstat library
to build a hash table when kstats are opened, and re-implemented
kstat_lookup() on that. This reduced cpu consumption by 8X on our
"small" 512-CPU test system, and improves the performance of all
tools that are based on libkstat, including mpstat, vmstat, iostat,
and sar.

Even dtrace is not immune. When a script starts, dtrace allocates
multi-megabyte trace buffers for every CPU in the domain, using a
single thread, and frees the buffers on script termination using a
single thread. On a T3-4 with 512 CPUs, it took 30 seconds to run a
null D script. Even worse, the allocation is done while holding the
global cpu_lock, which serializes the startup of other D scripts, and
causes long pauses in the output of some stat commands that briefly
take cpu_lock while sampling. We fixed this in Solaris 11.1 by
allocating and freeing the trace buffers in parallel using vmtasks,
and by hoisting allocation out of the cpu_lock critical path.

Large scale can impact the usability of a tool. Some stat tools
produce a row of output per CPU in every sampling interval, making it
hard to spot important clues in the torrent of data. In Solaris
11.1, we provide new aggregation and sorting options for the mpstat,
cpustat, and trapstat commands that allow the user to make sense of
the data. For example, the command

mpstat -k intr -A 4 -m 10 5

sorts CPUs by the interrupts metric, partitions them into quartiles, and
aggregates each quartile into a single row by computing the mean column
values within each. See the man pages for details.

Reboot

Large servers take longer to reboot than small servers. Why?
They must initialize more CPUs, memory, and devices, but much of
the shutdown and startup code in firmware and the kernel is single
threaded. We are addressing that. On shutdown, Solaris now scans
memory in parallel to look for dirty pages that must be flushed to disk.
The sun4v hypervisor zero's a domain's memory in parallel, using CPUs
that are physically closest to memory for maximum bandwidth.
On startup, Solaris VM initializes per-page metadata using SPARC cache
initializing block stores, which speeds metadata initialization by more
than 2X. We also fixed an O(NCPU^2) algorithm in bringing CPUs online,
and an O(NCPU) algorithm in reclaiming memory from firmware.
In total, we have reduced the reboot time for M5-32 systems by many
minutes, and we continue to work on optimizations in this area.

In these few short posts, I have summarized the work of many people over
a period of years that has pushed Solaris to new heights of scalability,
and I look forward to seeing what our customers will do with their massive
T5-8 and M5-32 systems.
However, if you have seen the
SPARC processor roadmap, you know that our work is not done. Onward and upward!