Scaling Out Singleton Insert Workloads Using Containers: Part II

In the previous part of this blog post I discussed how containers could be used to scale out a singleton work load. Where as my attempts to get my experiments to work ran into difficulties with Ubuntu Linux, Docker community edition on Windows 2016 had no such problems.

Anything that relies on a single point of serialization to ensure that an entity can be recovered to a consistent state in the event of a crash, incurs a scalability penalty because of the synchronization that needs to take place around this mechanism. This is just one of three main areas that inhibit transnational throughput:

XDESMGR
Unless you are using SQL Server’s in-memory engine, you are likely to encounter contention around the spinlock which synchronizes access to the part of the engine that gives out transaction ids.

LOGCACHE_ACCESS
Under a heavy OLTP load across two CPU sockets, spinlock contention will be encountered on LOGCACHE_ACCESS, the entity which serialises access to the log buffer. This is not so much of a problem with the in-memory engine as this does away with write ahead logging and also it is extremely aggressive when batching together the logical operations which are logged.

Cache Line Ping Pong

A cache line is the unit of transfer between main memory and the CPU cache, I could write an entire blog post on the nuances of the CPU cache and cache coherency, suffice it to say there are two pieces of information which are vital to know. Firstly each spinlock has a cache line associated with it, to acquire a spin lock the cache line associated with it has to travel to the CPU core running the thread that wishes to acquire in, a compare and swap operation is then performed on the cache line and then it is returned to the CPU core from were it came. The further that these cache lines have to travel, the more CPU cycles are burned up. Remember that all memory access is CPU intensive. Code and software that leads to cache lines bouncing from CPU core to core, hence the term “Cache line ping pong” is highly inefficient.

Why Not Just Use The In-Memory Engine ?

You will note that I have mentioned that the in-memory engine solves a lot of the problems the legacy engine encounters around spinlocks, therefore a reasonable question to ask might be, why not just use the in-memory engine and end the blog post here. Consider a scenario where you are having to ingest a large amount of data, IoT for example, you may soon run out of memory, considering that as a rough rule of thumb you require twice the amount of memory as the data being stored per memory optimized table.

Scaling A Singleton Workload With A Single Instance

For this test I will be using:

A two socket Dell R720 with 12 Xeon 2.7 Ghx V2 cores per socket and 512GB of memory.

Windows server 2016

SQL Server 2017 CU3

Hyper-threading turned off

Socket 0 core 0 removed from the CPU affinity mask

Trace flags 1117, 1118 and 2330 enabled

Each thread in the workload performs 250,000 inserts into the following table:

Where Is The Bottleneck ?

A stack trace gathered for the singleton insert workload using a clustered Guid key as viewed through windows performance analyzer reveals that most of the CPU time is burned up by the SQLServerLogMgr::AppendLogRequest function:

Drilling down into the call stack reveals more information about this function:

It appears that this has something to do with the LOGCACHE_ACCESS spinlock:

Tuning For Scalability 101

When faced with contention on a singleton resource, the obvious solution is to create more of that resource, however there are no knobs, levers or dials that will allow us create more instances of the LOGCACHE_ACCESS (or XDESMGR) spinlocks. What to do ?.

. . . Sharding To The Rescue !!!

Using containers we can create a number of SQL Server instances and shard the singleton workload across all of our available CPU cores, below is the PowerShell script which will allow us to do this:

What Does The Script Do ?

The script creates 24 containers each with their own SingletonInsert database, inside each SingletonInsert database it creates a clustered index and a stored procedure to perform inserts. 250,000 singleton inserts are performed inside each container using a single session, the penultimate loop in the script executes this workload with 1 through to 24 session. The –cpuset-cpu flag is of particular interest because this provides the means by which the script spins up containers on specific logical processors. In this example, each container is effectively bound to a single CPU core, meaning there is no cache line ping pong involved around the XDESMGR, LOGCACHE_ACCESS and LOGFLUSHQ spin-locks. Two types of clustered index are provided by the script, one using a GUID based key and the other using a sequential key

The Results

Firstly for the sequential “Spid offset” key we have this graph:

for the GUID key we have:

In both cases, whilst the throughput when using a single instance plateaus off, the containerized approach keeps following a respectable curve.

Where To Next ?

I would like to repeat this exercise with the windows container image and compare the results with those for the Linux image, also I would like to see what difference changing the number of logical CPUs per container makes to the throughput curves.