Saturday, August 14, 2010

Reading F# Projects, Part II: F# Set

The design patterns in a data structure project like F# Set is simple. However, understanding it really needs time as data structures involve heavy generic programming and the type system in F# (or any other static FP language) is very complex.

In this post, we will explore two aspects of F# Set:

1. The general implementation structure for a generic data structure. After knowing the pattern, we could write other data structures in F# and these data structures are generic enough.

2. Know something about generics in F# and type inference. We will come to this part in later readings.

In F# source code, the immutable set implementation (set.fs) is a good one for us to start for data structures. Why not more fundamental data structures like F# sequences? Because those ones are more involved in .Net platform and F# Set is nearly pure F# code. Also because:

I love trees!

F# Set and Map are implemented using balanced binary search tress, to be exact, AVL tress. Balanced BSTs support insert, delete and find operations all in O(log n). So it is widely used for implementing dynamic sets.

In general, trees are the most beautiful data structures and have so many variants. They are also efficient, e.g. R tree and friends are the fundamental data structures used in every database. Quadtrees are used for computational geometry in theory and spatial database in practice.

The F# Set is implemented by an immutable AVL tree. Here immutable means that every node, once created, its values could be be changed. In CS literature, researchers use “persistent” to name immutable data structures. If you do 100 insertions to an empty tree, you can have 101 version of trees, where common nodes are shared to reduce the space cost. It is proved that in such a tree, inserting operation still costs O(log n) time, and O(log n) space for duplicating some nodes in the tree (usually along the insertion path).

As this post is not intended to be a tutorial on data structures, I won’t go into details for the tree implementation. We mainly study how an industry-strength data structure is implemented in F#.

The main structure

set.fs contains about 840 lines of code. The code skeleton is shown below:

Under Microsoft.FSharp.Collections namespace, there are three things: SetTree<’T> type, Set module and SetTree module.

SetTree is a type constructor, the parameter is a comparable type ‘T.

Most of the code is in SetTree internal module:

the first part contains the code to manipulate the tree structure, e.g. inserting a node, deleting a node, etc.

Type SetIterator<’T> and related functions are for implementing IEnumerable interface. Set<’T> (comparer, tree) is the implementation for Set.

Set module contains the common function to manipulate a set.

In a summary, A container, like Set, usually follows this pattern:

1. An implementation of the underlying data structure, here it is AVL tree.

2. A module for manipulating the data structures; members functions also can do this. Usually we keep functions in both places.

3. Implement the IEnumerable and IEnumerable<’T> interface, so that the container/data structure could be treated as a sequence.

These are kind of rules for implementing new containers in F#, e.g. F# Map.

Detail 1. The generic

It is meaningless if the Set implementation only supports integers, or floats. We want it to be generic, i.e., support any type. Since the underlying implementation is based on trees, so the base type should also support comparison.

The other part of generic, which is more subtle, is the comparer. There are situations, we have different string sets. However, the rules to order them are not the same, i.e., strings maybe compared using different rules. So we should add another parameter into consideration. (Refer to Functor in OCaml if you want to see a more elegant more to deal with this situation.)

Although F# Set currently does not support different comparers. Its internal implementation keeps comparer in mind. (Because F# supports OOP, the different comparers problem could be partially solved by inherit the original element type. Anyway, this is non-elegant.)

We know that once a Set has some concrete values in it, the type is clear. Here are two ways to initialize a Set:

We can see that they get the comparer for type ‘T and use that comparer to pass into the constructor of type Set<’T>.

Detail 2. The comparer

The above code get the comparer for type ‘T using FastGenericComparer, which is in a big file prim-types.fs. The primitive type module should be discussed separately and thoroughly. There are quite a few ad hoc tricks in the implementation.

Here we just need to know that it looks up a table to find a comparer and if the type is in the primitive types, then there are optimizations for them.

Detail 3. From the empty to concrete

F# allows you to write [], Set.Empty. These empties seem to be dynamic. Actually no. You cannot write sth like:

let empties = Array.create 100 Set.empty

the compiler gives the following error:

error FS0030: Value restriction. The value 'empties' has been inferred to have generic type

val empties : Set<'_a> [] when '_a : comparison

Because the compiler cannot generalize these empties into a single type. It is safe to write