Demanding Database Storage Performance

Client Objectives:
Client was suffering excessive database-level storage latency, dramatically slowing down critical database performance, while storage team reported no elevated latencies on the all-flash SAN, and that all is well.

Solution:
Heraflux reviewed their entire IT architecture, including all-flash storage, virtualization, and SQL Server databases, and determined both the virtual disk presentation could be improved and that the storage abstraction device in front of the SAN was creating a significant bottleneck to performance.

Results:
Client achieved improved database performance because the storage latencies dropped below two milliseconds from within the database engine.

While there are many consulting firms out there, we found Heraflux Technologies to exhibit all of the traits of a true partner, rather than just another firm.

A number of the business-critical databases that empower daily business operations were recording messages in the error logs about significantly higher than expected disk latencies. SQL Server “long I/O alerts,” which are indications that a single I/O operation took longer than 15 seconds, were growing in frequency during routine database operations, and during these periods of time, the databases were performing well below expectations.

The database administration team spent many hours attempting to track down the root cause of the storage latency issues, including measure and adjusting the SQL Server and Windows-layers to better distribute the database workloads across the presented storage. However, the improvements were incremental, but the overall trend was that the volume of long I/O alerts were growing. The storage team, on the other hand, reviewed the storage performance as reported by the all-flash SAN, and determined that the SAN was performing quite well, with latencies never exceeding one millisecond.

The Client needed an outside specialist to review the architecture and inner workings of the datacenter to determine where in the architecture the source of the database latency was originating. The entire datacenter architecture for the manner that storage was presented to the database server VMware-based virtual machines needed to be reviewed to determine where the storage performance bottleneck was at, and remediate the problem.

Solution

Heraflux learned quickly that the datacenter infrastructure and applications had some additional layers in the architecture that were previously unreported.

Heraflux performed a deep-dive systems review of the VMware virtualization systems platform powering the database and application servers, as well as the all-flash storage array. First, Heraflux performed a performance analysis of the current platform to establish a performance baseline for the SQL Servers. Windows Perfmon and SQL Server dynamic management views (DMVs), both included with the respective layers, were used to establish this baseline, so no additional costs were incurred to establish this baseline.

Heraflux subsequently found that the virtual disk configuration on the database servers were suboptimal, with I/O bottlenecks originating in the Windows layer because of how Windows performs I/O coalescing by virtual disk controller and not by virtual disk. The storage requests were stacking up because Windows only had one available route to get storage requests in and out of the operating system. Three additional controllers were added to the virtual machine, and the workload balanced among all available controllers, and the storage performance significantly improved. However, it was nowhere near the low latencies that the SAN was reporting.

Heraflux then reviewed the storage architecture, and found that a storage abstraction device, called a SAN volume controller, was in-line with the storage device. These devices are designed to make SAN administration less time consuming. All storage requests were funneled through this device. An extensive analysis of this device found that the controllers and cache memory inside these devices were being overwhelmed during times of elevated I/O requests, and as a result, the device was slowing down inbound I/O while it processed the requests. Additional storage abstraction devices were added to this architecture to better scale out the workload, and the storage performance then improved across the entire datacenter to match the very low storage performance latencies that the SAN was reporting.

The end result was that the SQL Server databases were now able to access the storage subsystem faster than ever, and every single database transaction’s performance increased substantially. The databases were performing better than ever, the storage abstraction layer performing as designed, and the long I/O alerts disappeared.

Results

The SQL Server database layer no longer logged long I/O alerts each maintenance cycle, and the latencies as measured by SQL Server were in the 1-2ms range.

The storage abstraction device was no longer a bottleneck to storage performance, and the all-flash SAN performed well under the higher volumes.

The SQL Server databases now performed as expected, shifting the performance bottleneck back onto the application servers where it belongs.