Sorting can be a major bottleneck in Perl programs. Performance
can vary by orders of magnitude, depending on how the sort is
written. In this paper, we examine Perl´s sort function in
depth and describe how to use it with simple and complex data. Next we
analyze and compare several well-known Perl sorting optimizations
(including the Orcish Maneuver and the Schwartzian Transform). We then
show how to improve their performance significantly, by packing multiple
sortkeys into a single string. Finally, we present a fresh approach,
using the sort function with packed sortkeys and without a
sortsub. This provides much better performance than any of the other
methods, and is easy to implement directly or by using a new module we
created, Sort::Records.

NOTE: Sort::Records died during development but five years later,
Sort::Maker was released and does all that was promised and more. Find
it on CPAN

What is sorting and why do we use it?

Sorting is the rearrangement of a list into an order defined by a monotonically increasing or decreasing sequence of sortkeys, where each sortkey is a single-valued function of the corresponding element of the list. (We will use the term sortkeys to avoid confusion with the keys of a hash.)

Sorting is used to reorder a list into a sequence suitable for further processing or searching. In many cases the sorted output is intended for people to read; sorting makes it much easier to understand the data and to find a desired datum.

Sorting is used in many types of programs and on all kinds of data. It is such a common, resource-consuming operation that sorting algorithms and the creation of optimal implementations comprise an important branch of computer science.

This paper is about creating optimal sorts using Perl. We start with a brief overview of sorting, including basic algorithm theory and notation, some well-known sorting algorithms and their efficiencies, sortkey processing, and sorting outside of Perl. Next we will describe Perl´s sort function [1] and basic ways to use it. Then we cover handling complex sortkeys, which raises the question of how to optimize their processing. Finally we introduce a relatively new method, which moves all the sortkey processing out of the sort function, and which produces the most efficient Perl sort. A new module is also described, which implements this sorting technique and which has powerful support for sortkey extraction (the processing of the input data to produce the sortkeys.

Algorithm and sorting theory

A complete discussion of algorithm and sorting theory is beyond the scope of this paper. This section will cover just enough theory and terminology to explain the methods that we use to compare sort techniques.

The complexity of an algorithm is a measure of the resources needed to execute the algorithm -- typically there is a critical operation that needs to be executed many times. Part of algorithm theory is figuring out which operation is the limiting factor, and then formulating a function that describes the number of times the operation is executed. This complexity function is commonly written with the big-O notation -- O(f(N)) -- where `O´ is read as `order of´ and `f(N)´ is some function of N, the size of the input data set.

O(f(N)) comparisons have some unusual properties. The actual size of N is usually irrelevant to the correct execution of an algorithm, but its influence on the behavior of f(N) is critical. If an algorithm´s order is O(N*logN + N), when N is large enough the effect of the N on the function´s value is negligible compared to the N*logN expression. So that algorithm´s order is just O(N*logN). In many cases the calculated order function for an algorithm is a polynomial of N, but you see only the term with the highest power, and no coefficient is shown. Similarly, if two algorithms have the same order but one does more work for each operation, they are still equivalent in order space, even though there may be a substantial difference in real-world speeds. That last point is crucial in the techniques we will show to optimize Perl sorts, all of which have the same big-O function, O(N*logN).

Here are some well-known algorithms and their order functions (adapted from [2]):

Notation

Name

Example

O(1)

constant

array or hash index

O(logN)

logarithmic

binary search

O(N)

linear

string comparison

O(N*logN)

n log n

advanced sort

O(N**2)

quadratic

simple sort

O(N**3)

cubic

matrix multiplication

O(2**N)

exponential

set partitioning

Sorting´s critical operation is determining in which order to put pairs of elements of the data. The comparison can be as simple as finding whether two numbers are equal or which is greater than the other (or doing similar operations on strings), or it can be quite complex.

Simple sorting algorithms (bubble or insertion sorts) compare each element to each of the others repeatedly, so their complexity is O(N**N). Even with the triangle optimization ($x is equal to $x, and $x compared to $y is the negative of $y compared to $x), which reduces the function to O((N * (N-1))/2), the complexity is still O(N**N), as explained above.

But these algorithms have their uses. When N is very small, they can actually be faster than the other methods, because the O(1) and O(N) overhead of the advanced sorts may outweigh the O(N**2) behavior of the simple sorts. "Fancy algorithms are slow when N is small, and N is usually small. Fancy algorithms have big constants." [3] The really important cases, which are worth care in the coding, occur when N is large.

Advanced sorting methods repeatedly partition the records to be sorted into smaller sets, to reduce the total number of comparisons needed. Their complexity is O(N*logN), which can be much less than O(N**2) for sufficiently large values of N. These algorithms include `tree sort´, `shell sort´, and `quicksort´. [4]

Some specialized sort algorithms (such as `radix sort´) work by comparing pieces of numeric sortkeys, and can achieve linear complexity (O(N)) [5]. These methods are not general-purpose, so we will not address them further.

One property of sort algorithms is whether they are stable. A stable sort preserves the order in the sorted data of two elements that compare equal. Some sorting problems require stability. The simple sorting algorithms are generally stable; the advanced ones are not. We will show how to make Perl´s advanced sort behave stably if required.

An important sorting variation is when the original data elements can´t conveniently be moved around by the sort algorithm´s shuffling. So instead of sorting the elements directly, you sort their index numbers. You then use the sorted indexes to create a list of sorted elements. Some sort operators in other languages (APL comes to mind) simply return sorted indexes, and it is up to the programmer to use them correctly. We will show how to create an efficient Perl index sort and where it is useful.

Sortkeys

If you are sorting a set of scalar-valued elements where the comparison looks at the entire element, the sortkey is simply the entire element. More generally, the sortkey is based on some properties that are functions of all or part of the element. Such subkeys may be extracted from internal properties of parts of the element (fields) or derived from external properties of the element (such as the modification date of a file named by the element, which is quite expensive to retrieve from the file system).

To avoid repeated computation of the sortkeys, the sort process has to retain the association between records and their extracted or derived sortkeys. Sorting theory and algorithms usually ignore the cost of this association, as it is typically a constant factor of the comparison operation. But as we will see later, in the real world, removing that overhead or reducing it from O(N*logN) to O(N) is very valuable, especially as N grows.

Complex sortkeys can add tremendously to the overhead of each comparison. This occurs where the records have to be sorted by primary, secondary, and lower-order subkeys. This is also known as doing a subsort on the lower keys. Extracting and comparing complex sortkeys can be costly and error-prone.

No general-purpose implementation of a sort algorithm can efficiently support extracting and comparing different types of sortkeys. Therefore, most sort implementations provide a simple interface to call a sortsub -- a custom comparison subroutine which is passed two operands. These operands can be the records themselves, or references to or indexes of complex records. The comparison returns a negative, zero, or positive value, depending on the ordering of the sortkeys of the two records. The programmer is responsible for any preprocessing of the records to generate the sortkeys and any postprocessing to retrieve the sorted data. The generic sort function only manages the comparisons and shuffles the operands into sorted order.

As Perl´s sort function is O(N*logN), efficiency must come from extracting and comparing the sortkeys using the least amount of work. Much of this paper will be about methods to make sortkey extraction and comparison as efficient as possible.

External sorting

Every popular commercial operating system offers a sort utility. Unix/POSIX flavors typically have a sort command which is fast and fairly flexible with regard to sortkey extraction from text files. In some cases, the Unix/POSIX sort command may be easier to code and more efficient than using the Perl sort function.

Several vendors sell highly optimized commercial sort packages that have received decades of attention and can handle massive amounts of data. But they are very expensive and not suitable for use inside a Perl program.

All of these are capable of dealing efficiently with very large amounts of data, using external media such as disk or tape files for intermediate storage when needed. In contrast, the Perl sort function requires that the entire list of operands be in (real or -- much more expensively -- virtual) memory at the same time. So Perl is not the appropriate tool to use for huge sorts (where huge is defined by your system´s memory limits), which we shall not consider further.

Internal sorting

The Perl sort function uses an implementation of the quicksort algorithm that is similar to (but more robust than) the qsort function in the ANSI/ISO Standard C Library [6]. In the simplest use, the Perl sort function requires no sortsub:

@out = sort @in;

This default sorts the data in ascending lexicographic order, using the fast C memcmp function as the comparison operation. If a locale is specified, it substitutes the more complicated and somewhat slower C strcoll function.

If you want any kind of ordering other than this, you must provide a custom comparison sortsub. The sortsub can be specified either as a code block, the name of a subroutine, or a typeglob that refers to a subroutine (a coderef). In Perl 5.6, a scalar variable that contains a coderef can also be used to specify the sortsub.

In order to optimize the calling of the sortsub, Perl bypasses the usual passing of arguments via @_, using instead a more efficient special-purpose method. Within the sortsub, the special global package variables $a and $b are aliases for the two operands being compared. The sortsub must return a number less than 0, equal to 0, or greater than 0, depending on the result of comparing the sortkeys of $a and $b. The special variables $a and $b should never be used to change the values of any input data, as this may break the sort algorithm.

Even the simplest custom sort in Perl will be less efficient than using the default comparison. The default sort runs entirely in C code in the perl core, but any sortsub must execute Perl code. A well-known optimization is to minimize the amount of Perl code executing and to try to stay inside the perl core as much as possible. Later we will see various optimization techniques that will reduce the amount of Perl code executed.

The primary goal of this paper is to perform all sorts using the default comparison. Here is how an explicit ascending lexicographic would be done using a sortsub:

@out = sort { $a cmp $b } @in;

For a simple measurement, compare Default and Explicit in Benchmark A1 of Appendix A. The default method is about twice as fast as the explicit method.

Trivial sorts

We call trivial sorts those that use the entire record as the sortkey and do only a minimal amount of processing of the record. To do trivial Perl sorts other than ascending lexicographic, you just need to create an appropriate sortsub. Here are some common ones that perform useful functions.

The simplest such example is the ascending numeric sort, which uses the picturesquely monikered `spaceship´ operator:

@out = sort { $a <=> $b } @in;

A numeric sort capability is required because the lexicographic order of, say, (1, 2, 10) does not correspond to the numeric order.

If you want the sort to be in descending order there are three techniques you can use. The worst is to negate the result of the comparison in the sortsub. Better is to reverse the order of the comparison by swapping $a and $b. This has the same speed as the corresponding forward sort.

The best method is to apply the reverse function to the result of a default ascending lexicographic sort.

@out = reverse sort @in;

Note that this is faster than using the explicit descending lexicographic sort, for the reason discussed above: the default sort is faster than using a sortsub. The reverse function is efficient because it just moves pointers around.

Another common problem is sorting with case insensitivity. This is easily solved using the lc or uc function. Either one will give the same results.

@out = sort { lc $a cmp lc $b } @in;

Benchmark A1 analyzes these examples as a function of the input size. The O(N*logN) behavior is apparent, as well as the cost of using even a simple built-in function like lc in the sortsub.

Fielded and record sorts

The above trivial sorts sort the input list using as the sortkey the entire string (for a lexicographic sort) or the first number in each datum (for a numeric sort). More typically, the sortkey is based on some property that is a function of all or part of each datum. Several individual subkeys may be combined into a single sortkey or may be compared in pairs individually.

A complex string may be divided into fields, some of which may serve as subkeys. For example, the Unix/POSIX sort command provides built-in support for collation based on one or more fields of the input; the Perl sort function does not, and the programmer must provide it. One CPAN module focuses on fielded sorts [7].

If your data are records which are complex strings or references to arrays or hashes, you have to perform comparisons on selected parts of the records. This is called record sorting. (Fielded sorts are a subset of record sorts.)

In the code examples that follow, KEY() is meant to be substituted with some Perl code that performs sortkey extraction. It is best that it not be an actual subroutine call, because subroutine calls within sortsubs can be expensive. Calls to built-in Perl functions (such as the calls to lc in the example above) are like Perl operators, thus relatively less expensive.

When sorting string records, $a and $b are set to those strings, so to extract the sortkeys you generally perform various string operations on the records. Functions commonly used for this include split, substr, unpack, and m//. Here is one example, sorting a list of password-file lines by user name using split. The fields are separated by colons, and the user name is the first field.

In some cases you need to sort records by a primary subkey, then for all the records with the same primary subkey value, you need to sort by a secondary subkey. One horribly inefficient way to do this is to sort first by the primary subkey, then get all the records with a given subkey and sort them by the secondary subkey. The standard method is to do a multi-key sort. This entails extracting a subkey for each field, and comparing paired subkeys in priority order. So if two records with the same primary subkey are compared, they will actually be compared based on the secondary subkey. Sorting on more than two subkeys is done by extending the logic.

Perl has a very nice feature which makes multi-key sorts easy to write. The || (short-circuit or) operator returns the actual value of the first logically true operand it sees. So if you use || to concatenate a set of key comparisons, the first comparison is the primary subkey. If a pair of primary subkeys compare equal, the sortsub´s return value will be the result of the secondary subkey comparison.

An example will illustrate this `ladder´ of comparisons better than more text. Here is a three-subkey sort:

In the two previous examples, we showed a sort with relatively expensive sortkey extraction (via split), and a multi-subkey sort. Let´s combine them. For concreteness, we shall deal with a problem that has received much attention in comp.lang.perl.misc -- sorting a list of IP addresses in `dotted-quad´ form. Each element of the list is a string of the form "nnn.nnn.nnn.nnn\tabc.xyz.com\n", where nnn represents a decimal integer between 0 and 255, with or without leading zero-padding.

In the most naive approach, we sort on each of these four numeric fields as individual subkeys, in succession.

Benchmark A2 shows that comparing the subkeys in pairs is less efficient than packing them and comparing the packed strings. This observation applies to all sorting methods. In further benchmarks of advanced sorts for this problem, we will always used packed sortkeys.

Nevertheless, naive sorting is still woefully inefficient, because both sortkeys are recomputed every time one input operand is compared against another. What we need now is a way to compute each sortkey once only and to remember the result.

Advanced sorts

As all sorts in Perl use the builtin sort function and therefore the same quicksort algorithm, all Perl sorts are of order O(N*logN). We can´t improve upon that, so we have to address other issues to gain efficiency. As the complexity is fixed, tackling the constant factors can be fruitful and, in the real world, can produce significant improvements in efficiency. When a sortsub needs to generate a complex sortkey, that is normally done O(N*LogN) times, but there are only N records, hence N sortkeys. What if we were to extract the sortkey only once per record, and keep track of which sortkey belonged to which record?

Caching the sortkeys

The obvious way to associate sortkeys with the records from which they were created is to use a hash. The hash can be created in a preprocessing pass over the data. If the approximate size of the data set is known, preallocating the hash improves performance.

keys my %cache = @in;
$cache{$_} = KEY($_) for @in;

The following sets up the cache more efficiently, using a hash slice:

keys my %cache = @in;
@cache{@in} = map KEY($_) => @in;

Then the sortsub simply sorts by the values of the cached sortkeys.

@out = sort {
$cache{$a} cmp $cache{$b)
} @in;

In essence, we have replaced lengthy computations in the sortsub by speedy (O(1)) hash lookups.

If you want to do a complex multi-key comparison, you either have to use a separate cache for each subkey or combine subkeys in a similar way to the packed-sort optimizations we will describe later. Here is an example of the former:

An important point about cached sorts is that no postprocessing is needed to retrieve the sorted records. The method sorts the actual records, but uses the cache to reduce the sortkey extraction to O(N).

The Orcish Maneuver (OM)

The Orcish Maneuver (invented by Joseph N. Hall [8]) eliminates the preprocessing pass over the data, which might save keeping a copy of the data if they are being read directly from a file. It does the sortkey extraction only once per record, as it checks the hash to see if it was done before. The test and storage of the sortkey is done with the ||= operator (short-circuit or-assignment), which will evaluate and assign the expression on the right to the lvalue on the left, if the lvalue is false. The name `orcish´ is a pun on `or-cache´. The full statement in the sortsub looks like this:

The OM has some minor efficiency flaws. An extra test is necessary after each sortkey is retrieved from the or-cache. Furthermore, if an extracted sortkey has a false value, it will be recomputed every time. This usually works out all right, because the extracted sortkeys are seldom false. However, except when the need to avoid reading the data twice is critical, the explicit cached sort is always slightly faster than the OM. (See Benchmark A3.)

The Schwartzian Transform (ST)

A more efficient approach to caching sortkeys, without using named temporary variables, was popularized by Randal L. Schwartz, and dubbed the Schwartzian Transform [9, 10]. (It should really be called the Schwartz Transform, after the model of the Fourier and Laplace Transforms, but it is too late to fix the name now.)

The significant invention in the ST is the use of anonymous arrays to store the records and their sortkeys. The sortkeys are extracted once, during a preprocessing pass over all the data in the list to be sorted (just as we did before in computing the cache of sortkeys).

The ST doesn´t sort the actual input data. It sorts the references to anonymous arrays that contain the original records and the sortkeys. So we have to postprocess to retrieve the sorted records from the anonymous arrays.

Using the ST for a multi-subkey sort is straightforward. Just store each successive extracted subkey in the next entry in the anonymous array. In the sortsub, do an or between comparisons of successive subkeys, as with the OM and the naive sorts.

For a very illuminating deconstruction and reconstruction of the ST, see [11].

The packed-default sort

Each of the advanced sorting techniques described above saves the operands to be sorted together with their sortkeys. (In the cached sorts, the operands are the keys of a hash and the sortkeys are the values of the hash; in the Schwartzian Transform, the operands are the first elements of anonymous arrays, the sortkeys are the other elements of the arrays.) We now extend that idea to saving the operands to be sorted together with packed-string sortkeys, using concatenation.

This little-known optimization improves on the ST by eliminating the sortsub itself, relying on the default lexicographic sort, which as we showed earlier is very efficient. This is the method used in the new Sort::Maker module.

To accomplish this goal, we modify the ST by replacing its anonymous arrays by packed strings. First we pack into a single string each subkey followed last by the operand to be sorted. Then we sort lexicographically on those strings, and finally we retrieve the operands from the end of the strings.

Several methods can be used, singly or in combination, to construct the packed strings, including concatenation, pack, or sprintf. Several methods can be used to retrieve the operands, including substr (shown here), which is likely to be the fastest, split, unpack or a regex.

Multiple subkeys are simply concatenated, suitably delimited if necessary. Techniques for computing subkeys of various types are presented in Appendix B.

Benchmarks of the packed-default sort

Benchmark A4 compares the two most advanced general-purpose sorting techniques, ST and packed-default. These multi-stage sorts are measured both as individual stages with saved intermediate data and as single statements.

The packed-default sort is about twice as fast as the ST, which is the fastest familiar Perl sorting algorithm.

Earlier, we showed a trivial sort using the lc function. Even for that case, the packed-default sort provides better performance when more than a few data items are being sorted. See Benchmark A5, which shows quasi-O(N) behavior for the packed-default sort (because the sorting time is small relative to the sortkey extraction).

Sorting a list of arrays or hashes

Consider the common problem of sorting a two-dimensional data structure, a list of references to arrays or to hashes, where the sortkeys are functions of the values of the submembers.

If we were to use the packed-default method, the references would be converted to strings and appended to the sortkeys. After sorting, the operands could be retrieved as strings, but would no longer be usable as references. Instead, we must use the indexes of the list members as the operands to be sorted.

The following benchmark compares a packed-sortkey ST sort with an indexed sort that uses the packed-default approach. The list being sorted comprises references to arrays, each of which has two elements: an IP address (which serves as the primary sortkey), and a domain name (which serves as the secondary sortkey). These are the same data as used in the above benchmarks, split into two array elements.

The indexed sort is faster than the ST once again. (See Benchmark A6.)

Indexed sorts and stable sorts

In the indexed sort, the auto-incrementing index $i ensures that no array records will have identical packed sortkeys. It also ensures that the sort will be stable.

Any Perl sort can be stabilized by using such an index as the final tie-breaking subkey. For an indexed sort, the index is actually the operand being sorted. This fact offers another possible performance advantage for the indexed sort. The actual records to be sorted (which may be long strings) need not be appended to the sortkeys, which would create a second copy of each record. Using the indexed sort, the records may be recovered after the sort from the original data, using the sorted indexes.

The Sort::Maker module

Sort::Maker is on CPAN and implements the GRT for
all types of Perl values.

Conclusions

Packing of subkeys into strings that can be compared lexicographically improves the performance of all sorting techniques, relative to the usual method of comparing the individual subkeys in pairs.

Packing the operands with the sortkeys allows the sort to be done using the default ascending lexicographic comparison (without a sortsub). This yields a markedly faster sort than the Orcish Maneuver or the Schwartzian Transform. The sorting process may approximate O(N) behavior, because the O(N*logN) time for the sort itself is small compared to the time required to extract the sortkeys.

The packed-sortkey sort may be written explicitly, or the new Sort::Maker module may be used.

Acknowledgments

This idea was brought to our attention by Michal Rutka [12]. John Porter participated in initiating this project and reviewed a draft of the paper.

A caveat: Useful benchmarking depends on judicious isolation of relevant variables, both in the algorithms being benchmarked and in the data sets used. Different implementations may give different relative results even with the same algorithms and data. Thus all such results should be verified under your own conditions. In short, your mileage may vary.

In the following benchmarks, all data represent the time (in microseconds) per line in the input data, which averages 35 characters per line. All named arrays and hashes are preallocated, which reduces the variance in the measurements due to storage allocation.

To create and combine the subkeys and the operand to be sorted, any combination of concatenation, interpolation, pack, or sprintf may be used, the latter two with simple or compound formats.

Fixed-length strings (ascending):

simple interpolation

pack('... An ...', ...)
sprintf('... %s ...', ...)

Fixed-length strings (descending):

Bit-complement the string first.

$subkey = $string ^
"\xFF" x length $string

Then handle as an ascending fixed-length string.

Null bytes ("\0") are used to terminate string subkeys of varying length, as that ensures lexicographic ordering. If a string subkey may contain a null byte, then it must be of fixed length. If any of the operands to be sorted may contain null bytes, then every subkey must have fixed length.

Varying-length strings (ascending):

Terminate the string with a null byte, to separate it from succeeding subkeys or the operand.

interpolation:

"$string\0"

pack('... A* x ...', ...)

sprintf('... %s\0 ...', ...)

Varying-length strings (descending):

Make a prepass over the data to find the length of the longest string.