Search This Blog

Having fun with Go slices

The funGo offers precious few data structures that can be generically typed: queue, map, and slice (including array). This often leaves developers using slices, particularly ordered slices, for all sorts of use cases. I thought it might be fun to create some utilities for slice manipulation, starting with a faster stable sort. I hope in Go 2 we can get generics so data structures like binary trees will be safer.

For those that aren't aware, a stable sort is a sort where two elements with the same key retain their same relative positions as they had in the input. There are several sorting algorithms that support stable sorting, including merge sorts, insertion sorts, and bubble sorts. Efficient implementations of the ever-popular quicksort are not stable.

I've had several use cases for a stable sort in the past so I thought it might be fun to take the stable sort from the standard library and attempt to make it concurrent to improve performance. The standard library's sort mimics a divide and conquer algorithm so making this concurrent is somewhat straightforward.The standard library's stable sort breaks the input slice into much smaller subslices of length 20. The subslices are then sorted using an insertion sort, which is stable. The small subslices are then iteratively merged in ever larger subslices using an algorithm called symmerge. Symmerge is a space-efficient algorithm to merge two sorted arrays invented by Kim and Kutzner. It involves symmetric comparison and rotation to achieve an optimal number of comparisons which can be defined as O(m log n/m), where m is the length of the smaller array and n is the length of the longer array. The explanation is quite a bit more complex than the algorithm and I won't rehash it here, but I highly encourage you to visit the paper and give it a quick read.Combining these techniques, Go's standard library is able to quickly, efficiently, and stably sort arrays. My goal was to improve performance by simply making the current algorithm concurrent. The initial insertion sorts can safely happen across the slice concurrently and then the final merge stages can also happen concurrently in each block size iteration. To ensure consumers can drop this in as a replacement, this stable sort function must also accept a sort.Interface interface (stuttering is ugly Go!).To ensure I wasn't paying a dynamic dispatch penalty, I confirmed that the comparison function on the interface was inlined using -gcflags="-m".The resultsTo begin, I tested the algorithm with varying input sizes and various sizes of chunks for the initial insertion sort. All tests were run on a Macbook Pro with a quad core 2.2 GHz i7 (Haswell). All code used in this test can be found at https://github.com/dzyp/slice.

The vertical access is logarithmic and all algorithms display linear scaling. Go authors have done their research and confirmed that a slice size of 20-30 is optimal with larger sizes taking too great a penalty on the insertion sort (insertion sort has average comparison complexity of O(n^2)). The logarithmic scale makes the final time appear much closer than actual; the std lib is able to sort 10,000,000 structs in 1859ms while the multithreaded sort can accomplish the same in 898ms.Next, I wanted to see how the number of available CPUs affected performance. I performed this test modifying GOMAXPROCS, but in recent versions of Go this is automatically set to runtime.NumCPU().

Good scaling was seen going from one to four CPUs, but little additional performance was gained going from four to eight. Whether this is due to the fact that this CPU only has four physical cores (with the remainder being virtual due to hyperthreading) or running up against Amdahl's Law is a subject for further investigation.Further workIn order to reduce the number of allocations, I use pools so messages sent down queues to workers can be reused. However, if the CPU-local cache implemented by the pool is cold, this can subject threads to a global lock. This can reduce performance if many consumers are trying to sort simultaneously. I have a stash that implements the same functionality using a ring buffer instead, but this had a marginal impact on performance and was slightly more complex. As such, the initial version uses the simple algorithm.Most remaining performance gains will likely come from implementing an improved sorting algorithm, although these options are somewhat limited by being restricted to the data.Interface methods. Another option is pipelining the insertion sorts directly into the merging process to bypass the wait at the end of the insertion sort operations.Up nextSimply making the current stable sorting algorithm concurrent can yield fairly dramatic improvements to sort time. Beyond improving sort, I'd like to explore quicker insertions (using the sort) and faster deletions without the use of tombstones. My sincerest hopes, however, is that these would be made redundant with an improved type system or generic binary trees in the standard library :).

Post a Comment

Popular posts from this blog

IntroductionI've been fortunate enough to have a career where I've been free to explore new technologies and techniques to support a business need. One of the problems I was once tasked with solving was graph traversal on a rather large scale. It was not uncommon for a graph in our system to have hundreds of millions of edges, or for a large portion of that graph to be traversed.

However, traversal in our use-case was not path finding as much as node visitation in a particular order, aka, a cyclic dependency graph. As long as ordering was correct, we didn't necessarily need to know how traversal got to any individual node.

Performance in this use-case was also paramount, so this particular traversal process was going to be done in-memory. So, to summarize, I needed a way to maintain hundreds of millions of edges and to run a process in the correct order in the dependency graph. IE, if B relied on A, A needed to be processed first. To throw a little wrinkle into the fab…

Mobile devices have revolutionized how people use the internet. Pulling some numbers from Smart Insights' mobile marketing statistics, we see that more minutes are spent on mobile devices than traditional desktop devices when measured globally. The following image is telling:

However, growing up in a rural-ish area, I've noticed that data coverage is not necessarily commensurate with data requirements. Even when it is, the cost of the data is often unaffordable. Even in relatively wealthy countries with modern infrastructure, such as the United States, data rates can be quite high.

This introduces an interesting and important problem for those engineers trying to build applications that target the mobile space. Users strongly desire to use their applications on mobile devices and using those devices as they were intended to be used is often impossible or unaffordable. The challenge for the engineer is creating a solution that allows users to seamlessly use these applicatio…

Hi! I'm Dustin and I live in the midwest where I am a software engineer. I began my journey in software writing a C# backend for Iowa State University's Agronomy distance education program. From there, I moved to Workiva where I mostly worked on map-reduce operations in Google's App Engine and also did some front-end work in Flex.
More recently, I've worked on graph traversal systems and indexing technologies in Google's Go, deployed to Amazon Web Services. Along the way I've had the chance to work on just about every hosted queue/sql/nosql system you can find. I've managed along the way to land two patents with two more filed (http://patents.justia.com/inventor/dustin-hiatt). I've also had the opportunity to create and contribute to one of the most popular open-source data structures library for Go (https://github.com/Workiva/go-datastructures).
I'm interested in all things high-performance and distributed. I've also taken a keen interest lately in machine learning and Google's newer cloud AI APIs.