The algorithm implemented here is best for large arrays (N>1e6) due to
the latency introduced by its use of multiple kernel launches. It is
recommended to use segmented_sort instead for batches of smaller arrays.