[This page is completely unreadable. Would it make sense to put the programs into their own sub-pages?]

About the k-nucleotide benchmark

Each program should

read line-by-line a redirected FASTA format file from stdin

extract DNA sequence THREE

define a procedure/function to update a hashtable of k-nucleotide keys and count values, for a particular reading-frame � even though we'll combine k-nucleotide counts for all reading-frames (grow the hashtable from a small default size)

write the code and percentage frequency, for all the 1-nucleotide and 2-nucleotide sequences, sorted by descending frequency and then ascending k-nucleotide key

count all the 3- 4- 6- 12- and 18-nucleotide sequences, and write the count and code for specific sequences

We use FASTA files generated by the fasta benchmark as input for this benchmark. Note: the file may include both lowercase and uppercase codes.

Correct output for this 100KB input file (generated with the fasta program N = 10000), is

In practice, less brute-force would be used to calculate k-nucleotide frequencies, for example Virus Classification using k-nucleotide Frequencies and A Fast Algorithm for the Exhaustive Analysis of 12-Nucleotide-Long DNA Sequences. Applications to Human Genomics (105KB pdf).

This shootout benchmark has really nailed a deficiency in the standard libraries. We could work around not having fast ascii strings with a few extra functions, but not having a fast associative data structure is another matter. We cannot win this one on speed; this entry's approach cannot win this on lines of code; and memory usage is dominated by the data. That is why I feel the main goal is making a new entry that reads well and avoids the memory leak of the old entry. -- ChrisKuklewicz

3 On mutable data structures in Haskell

This benchmark is completely bottlenecked by Data.Hashtable (in GHC 6.4.1) and this discussion is based on the assumption that I am using Hashtable optimally. I downloaded the GHD 0.17 compiler and the DMD entry to benchmark on my machine. The DMD entry uses the "associative arrays" built into the language: "int[char[]] frequencies" and places second (runs in 3.0s). The winning entry is interesting, since the c-code does not have a hash table, and so it uses #include "../../Include/simple_hash.h" which pulls in a dead simple, inline, string to int hashtable and runs in 2s.

The entry below runs 16 slower than the DMD entry on my powerbook G4. Profiling shows the bottleneck. I downloaded simple_hash.h and shamelessly optimized it to replace Data.Hashtable for exactly this benchmark code. This sped up the proposed entry by a factor of 4.1 so that it is now about a factor of 4 slower than the DMD entry. This shows that Data.Hashtable is doing *at least* four times more work that is needed for this usage pattern. And even with my over specialized hash table, Haskell is almost 50% slower than OCaml's "module H = Hashtbl.Make(S)" (note that I my code uses the same hash function as OCaml). Unfortunately I cannot submit this optimized hash table entry to the shootout.

The only mutable data structure that come with GHC besides arrays is Data.Hashtable, which is not comptetitive with OCaml Hashtbl or DMD's associative arrays (unless there is something great hidden under Data.Graph). Is there any hope for GHC 6.6? Does anyone have pointers to an existing library at all? Perl and Python and Lua also have excellent built in hashtable capabilities. Where is a good library for Haskell?

4 Data.Map #1

I built the following first using naive IO, and very closely following the OCaml implementation. It ran it about 85 seconds. I then lifted the Seq and advancePtr code from Chris and Don's "Haskell #2" entry. It now runs in under 4 seconds on my machine, faster than anything but the C and D entries. Note that "updateCount" is there to satisfy the shootout's demand for an updater function. --BrianSniffen

Note: don't submit entries with {- -} style comments. They aren't
lexed correctly on the shootout, and end up contributing towards our
line count score. Use -- comments only -- DonStewart

4.2 Data.Map #3 (ByteString)

I cobbled this together based on some other entries. It could be a little clearer and certainly a little faster, but is probably a good starting point.
It runs faster on the file that I have than any of the other versions I benchmarked.

5 KetilMalde : Trie-based entry

I guess this is considered cheating, since it is the easy and natural way to do it :-) While the specifications say "hashtable", I interpret it to mean "standard associative data structure" - so using Data.Map would be okay, and perhaps even the true Haskell way. The reason I still consider this entry cheating, is that it relies on lazyness to compute only the requested frequencies, instead of, as the benchmark specifies calculation of -- and subsequent discarding -- all frequencies of the given word lengths.

If I were a mean spirited person (and I am not, no, really), I would point this out as yet another benchmark artificially penalizing a language where it is easy and natural to avoid unnecessary computation. As it is, this can perhaps go in the "Interesting Alternatives" category (as Chris points out).

Note that most of the time is now spent parsing input, if somebody wanted to further improve it, using FPS or similar would be the way to go.

In my opinion, we can always exploit lazyness. If a spec
requires us to compute something that isn't used, we should just let our
language take care of this. The spec can always be changed if they want
to explicitly penalise lazy languages (which I don't think they do --
they just don't think of the alternatives). So, lazyness is definitely ok. -- Don

5.1 (FASTEST) Generalized Trie with Assoc List #2

This improves on the amount of heap allocation. {{{insertAll}}} constructs the entries in the association lists exacly once, and the TValues are constructed exactly one. This is unlike the iterative construction in the previous code: Trie, Trie w/map, and Trie w/assoc list. -- ChrisKuklewicz

6 Current entry

This is an attempt to make the standard HashTable perform well enough. On my powerbook G4 it is slightly faster than the current shootout entry and slightly faster than Don's packed buffer below. It uses 43 MB of RSIZE. The winning c-code entry uses a hash table at the shootout site. Perhaps we should FFI the same hash table code.

7.3 Bjorn Bringert

I think that the entries above are cheating a bit. In writeCount, they use filter to count the occurences of the given nucleotide, but do not count all the other ones. The benchmark description says: "count all the 3- 4- 6- 12- and 18-nucleotide sequences, and write the count and code for specific sequences". The description also says thet the nucleotide sequences shoule be "sorted by descending frequency and then ascending k-nucleotide key", but the entries above only sort by frequency.

This solution uses less memory (193 MB vs 268 MB on my machine), but takes more than twice as much time (74 s vs 32 s). I have tried using Data.HashTable, but the performace seems to be about same as when using Data.Map. Using FastPackedString reduces the memory use to 91 MB and time to 52 s, but FastpackedString is not available on the benchmark machine, I guess.