Archives

Categories

Subscriptions & Feedback

Faster Command Line Tools in D

Jon Degenhardt is a member of eBay’s search team focusing on recall, ranking, and search engine design. He is also the author of eBay’s TSV utilities, an open source data mining toolkit for delimited files. The toolkit is written in D and is quite fast. Much of its performance derives from approaches like those described here.

This post will show how a few simple D programming constructs can turn an already fast command line tool into one that really screams, and in ways that retain the inherent simplicity of the original program. The techniques used are applicable to many programming problems, not just command line tools. This post describes how these methods work and why they are effective. A simple programming exercise is used to illustrate these optimizations. Applying the optimizations cuts the run-time by more than half.

Task: Aggregate values in a delimited file based on a key

It’s a common programming task: Take a data file with fields separated by a delimiter (comma, tab, etc), and run a mathematical calculation involving several of the fields. Often these programs are one-time use scripts, other times they have longer shelf life. Speed is of course appreciated when the program is used more than a few times on large files.

The specific exercise we’ll explore starts with files having keys in one field, integer values in another. The task is to sum the values for each key and print the key with the largest sum. For example:

A 4
B 5
B 8
C 9
A 6

With the first field as key, second field as value, the key with the max sum is B, with a total of 13.

Fields are delimited by a TAB, and there may be any number of fields on a line. The file name and field numbers of the key and value are passed as command line arguments. Below is a Python program to do this:

The program follows a familiar paradigm. A dictionary (collections.Counter) holds the cumulative sum for each key. The file is read one line at a time, splitting each line into an array of fields. The key and value are extracted. The value field is converted to an integer and added to the cumulative sum for the key. After the program processes all of the lines, it extracts the entry with the largest value from the dictionary.

The D program, first try

It’s a common way to explore a new programming language: write one of these simple programs and see what happens. Here’s a D version of the program, using perhaps the most obvious approach:

Processing is basically the same as the Python program. An associative array (long[string] sumByKey) holds the cumulative sum for each key. Like the Python program, it splits each line into an array of fields, extracts the key and value fields, and updates the cumulative sum. Finally, it retrieves and prints the entry with the maximum value.

We will measure performance using an ngram file from the Google Books project: googlebooks-eng-all-1gram-20120701-0 (ngrams.tsv in these runs). This file is 10.5 million lines, 183 MB. Each line has four fields: the ngram, year, total occurrence count, and the number of books the ngram appeared in. Visit the ngram viewer dataset page for more information. The file chosen is for unigrams starting with the digit zero. Here are a few lines from the file:

The time command was used to measure performance. e.g. $ time max_column_sum_by_key.py ngrams.tsv 1 2. On the author’s MacBook Pro, the Python version takes 12.6 seconds, the D program takes 3.2 seconds. This makes sense as the D program is compiled to native code. But suppose we run the Python program with PyPy, a just-in-time Python compiler? This gives a result of 2.4 seconds, actually beating the D program, with no changes to the Python code. Kudos to PyPy, this is an impressive result. But we can still do better with our D program.

Second version: Using splitter

The first key to improved performance is to switch from using split to splitter. The split function is “eager”, in that it constructs and returns the entire array of fields. Eventually the storage for these fields needs to be deallocated. splitter is “lazy”. It operates by returning an input range that iterates over the fields one-at-a-time. We can take advantage of that by avoiding constructing the entire array, and instead keeping a single field at a time in a reused local variable. Here is an augmented program that does this, the main change being the introduction of an inner loop iterating over each field:

The modified program is quite a bit faster, running in 1.8 seconds, a 44% improvement. Insight into what changed can be seen by using the --DRT-gcopt=profile:1 command line option. This turns on garbage collection profiling, shown below (output edited for brevity):

(Note: The --DRT-gcopt=profile:1 parameter is invisible to normal option processing.)

The reports show two key differences. One is the ‘max pool memory’, the first value shown on the “GC summary line”. The significantly lower value indicates less memory is being allocated. The other is the total time spent in collections. The improvement, 145ms, only accounts for a small portion of the 1.4 seconds that were shaved off by the second version. However, there are other costs associated with storage allocation. Note that allocating and reclaiming storage has a cost in any memory management system. This is not limited to systems using garbage collection.

Also worth mentioning is the role D’s slices play. When splitter returns the next field, it is not returning a copy of characters in the line. Instead, it is returning a “slice”. The data type is a char[], which is effectively a pointer to a location in the input line and a length. No characters have been copied. When the next field is fetched, the variable holding the slice is updated (pointer and length), a faster operation than copying a variable-length array of characters. This is a remarkably good fit for processing delimited files, as identifying the individual fields can be done without copying the input characters.

Third version: The splitter / Appender combo

Switching to splitter was a big speed win, but came with a less convenient programming model. Extracting specific fields while iterating over them is cumbersome, more so as additional fields are needed. Fortunately, the simplicity of random access arrays can be reclaimed by using an Appender. Here is a revised program:

The Appender instance in this program works by keeping a growable array of char[] slices. The lines:

fields.clear;
fields.put(line.splitter(delim));

at the top of the foreach loop do the work. The statement fields.put(line.splitter(delim)) iterates over each field, one at a time, appending each slice to the array. This will allocate storage on the first input line. On subsequent lines, the fields.clear statement comes into play. It clears data from the underlying data store, but does not deallocate it. Appending starts again at position zero, but reusing the storage allocated on the first input line. This regains the simplicity of indexing a materialized array. GC profiling shows no change from the previous version of the program.

Copying additional slices does incur a performance penalty. The resulting program takes 2.0 seconds, versus 1.8 for the previous version. This is still a quite significant improvement over the original program (down from 3.2 seconds, 37% faster), and represents a good compromise for many programs.

Fourth version: Associative Array (AA) lookup optimization

The splitter / Appender combo gave significant performance improvement while retaining the simplicity of the original code. However, the program can still be faster. GC profiling indicates storage is still being allocated and reclaimed. The source of the allocations is the following two lines in the inner loop:

The first line converts fields.data.[keyFieldIndex], a char[], to a string. The string type is immutable, char[] is not, forcing the conversion to make a copy. This is both necessary and required by the associative array. The characters in the fields.data buffer are valid only while the current line is processed. They will be overwritten when the next line is read. Therefore, the characters forming the key need to be copied when added to the associative array. The associative array enforces this by requiring immutable keys.

While it is necessary to store the key as an immutable value, it is not necessary to use immutable values to retrieve existing entries. This creates the opportunity for an improvement: only copy the key when creating the initial entry. Here’s a change to the same lines that does this:

The expression key in sumByKey returns a pointer to the value in the hash table, or null if the key was not found. If an entry was found, it is updated directly, without copying the key. Updating via the returned pointer avoids a second associative array lookup. A new string is allocated for a key only the first time it is seen.

The updated program runs in 1.4 seconds, an improvement of 0.6 seconds (30%). GC profiling reflects the change:

This indicates that unnecessary storage allocation has been eliminated from the main loop.

Note: The program will still allocate and reclaim storage as part of rehashing the associative array. This shows up on GC reports when the number of unique keys is larger.

Early termination of the field iteration loop

The optimizations described so far work by reducing unnecessary storage allocation. These are by far the most beneficial optimizations discussed in this document. Another small but obvious enhancement would be to break out of the field iteration loops after all needed fields have been processed. In version 2, using splitter, the inner loop becomes:

This produced a 0.1 second improvement. A small gain, but will be larger in use cases excluding a larger number of fields.

The same optimization can be applied to the splitter / Appender combo. The D standard library provides a convenient way to do this: the take method. It returns an input range with at most N elements, effectively short circuiting traversal. The change is to the fields.put(line.splitter(delim)) line:

Putting it all together

The final version of our program is below, adding take for early field iteration termination to version 4 (splitter, Appender, associative array optimization). For a bit more speed, drop Appender and use the manual field iteration loop shown in version two (version 5 in the results table at the end of this article).

Summary

This exercise demonstrates several straightforward ways to speed up command line programs. The common theme: avoid unnecessary storage allocation and data copies. The results are dramatic, more than doubling the speed of an already quick program. They are also a reminder of the crucial role memory plays in high performance applications.

Of course, these themes apply to many applications, not just command line tools. They are hardly specific to the D programming language. However, several of D’s features proved especially well suited to minimizing both storage allocation and data copies. This includes ranges, dynamic arrays, and slices, which are related concepts, and lazy algorithms, which operate on them. All were used in the programming exercise.

The table below compares the running times of each of the programs tested:

very nice 🙂 !
Just a question of a newbie: Lets assume I have the code of max_column_sum_by_key_v1.d at hand and the compiler option –DRT-gcopt=profile:1 indicates that the GC is “heavily” used. However, I do not know in which parts or even lines the GC is used. So I am wondering if there is a (profiling) tool to figure out which part/lines of the code trigger the GC?

Thanks for the reading. The code you posted is addressing a more narrow set of use cases than intended in the blog post. The more narrow case is certainly legitimate and interesting, but it is different and not comparable. In particular, the blog post assumes arbitrary strings as keys, as illustrated in the example file shown in the task description. Your code makes two assumptions that are more narrow. First, that keys are integers, and second, that the range of integer values is known in advance and is small enough to fit into an array. These are very significant differences.

Parallelization on the single log level barely have any sense:
1) You can’t split compressed stream as easily.
2) You rather parallelize via parallel parsing of different files than trying to parse a single one in parallel.

In other words, all technics used here are from real world practice. Parallelization on parsing of the single source is totally artificial.

It’s really quite cool that my post inspired all the follow-ups. I hadn’t seen the Kotlin post, thanks for bringing it to my attention. The Kotlin post is rather nice in that it shows a step-by-step approach to analyzing performance bottlenecks and optimizing accordingly, something that can benefit programmers beyond this one example.

I don’t plan to update the blog post to introduce new techniques. There are of course ways to speed up the program further. Indeed, D makes it easy write the same low-level code one can write in C, with equivalent performance. However, employing such low-level techniques would not fit the objective of the post, which was to gain efficiency while retaining simplicity.

Of course, parallelization is an important technique for high level programs also. I can understand your suggestion to add it to the post. However, I believe there are much better use cases for illustrating the techniques involved. My experience is that parallelization is often not a good choice for command line tools. An example where the benefits don’t have such qualifications would be better.