Introduction

The Haskell standard libraries currently provide Map and Set
implementations based on balanced binary trees. It is possible
to improve on the performance of these marginally (by a factor
of two or so at best, less typically) by using better and cheaper
tree type, such as AVL trees. However, the fundamental problem with
all such implementations is that the search used by common lookup
and insert operations still requires O(log n) potentially
expensive comparison operations (n is the number of entries
in the Map). To understand how inefficient this can be, imagine
what is involved in lookup on a large map where the keys are all filepaths
that just happen to share exactly the same 20 character prefix.

This solution to this problem for character strings and similar
list like keys has been known for a long time, that is to use
some form of trie. Generalised tries extend this idea to provide
effiecient Maps for arbitrary "sum of product" types (a typical
algebraic data type in Haskell). The basic idea is that for
sum types you maintain separate maps for each data constructor
and for products you use nested maps (maps of maps of maps..).

Googling will reveal many interesting papers on this subject, but
my personal impression is that with many implementations simplicity
and elegance has been given priority over performance and API
completeness. In reality producing a useable and efficient
generalised trie implementation is not at all simple. The sum of
product rules given above will yield a rather inefficient
implementation for lists, for example.

Implementations should also bear in mind (and ideally exploit) the
fact that with nested maps of maps.. singleton maps will be very
common.

There is also an important invariant that must be maintained with
nested maps of maps.. This is that empty "inner" maps are illegal.
If some operation yields an empty "inner" map the corresponding
entry in the "outer" map should be deleted (which may cause the
outer map to also become empty).

Immediate Project Goals

This is what I think could reasonably be achieved in a summer of
code project.

1- Define the generalised trie class API (or APIs)

The class definition should make appropriate use of the new type
families extension provided by ghc. The class API needs to be
complete, in the sense that it must contain all the methods needed
to create new instances from existing instances (not just those
that a typical Map user would need). Writing a few common instances
by hand will help in discover exactly what a complete API should
look like.

It may be that several closely related classes are needed.
For example, tries generally don't store copies of their keys.
This has some efficiency advantages but also means that "withKeys"
operations can be expensive as the actual keys must be reconstructed
from the trie. It may be advantageous to have a separate class for
tries which do store copies of the keys, and this class is where
various "withKeys" methods would live.

There is also a potential issue with having an Ord class constraint
on keys. This has definite performance advantages for some operations and also enables fast trie based sort implementations. But some people feel that Ord instances should be meaningful in some
deeper philosophical sense, not just the arbitrary total ordering of
concrete data types that you would get by deriving Ord. There is
also a practical problem in that despite the Ord constraint, a trie
implementation will not actually use any Ord class methods. It's
ordering will be consistent with that that would have been defined
by a wholly derived Ord instance, but how do we ensure ordering
consistency with hand written Ord instances?

One possible solution might be for the generalised trie class to
define it's own private total ordering of keys (and corresponding
compare method) which may or may not agree with the ordering
defined by the Ord instance (if it exists).

2- Produce a few hand written instances of the class API(s)

These should include common standard types like Ints, Lists,
Strings and Byte Strings. Also a common default implementation
of the same APIs for any Ord instance is needed (based on
AVL trees for example).

The purpose of this exercise is to help ensure that the defined
class APIs are reasonably complete and usable (in the sense
described above) and also to provide the community with something
that is actually useable in a reasonable time scale.

This isn't the perfect solution, but even providing a uniform Map
API with efficient specialised (type specific) implementations for
this limited range of types will still be a big improvement on
the current situation.

3- Write a test and benchmarking suite for any generalised trie class instance

It is obviouly desirable that as far as is possible we ensure
that defined class instances are correct. Being able to assess
relative efficiencies of multiple possible candidate
implementations would be good too.

Future Project Goals

In the long run what we need is some way to automatically derive
generalised trie instances (from data type definitions). I'm
reluctant to make this part of this years SOC project definition
as I think this is not realistically achievable in the time
available (given the likely size and complexity of the generalised
trie API). It might be realistic in future years assuming all
the necessary preliminaries described above are in place.

But I'd be happy to be proven wrong about this too! So if anyone
wants to give it a shot that would be very welcome.

Starting Point (where we're at now - 2008)

..you can find my own initial attempts at this. This includes a preliminary
generalised trie class API (called GT) using functional dependencies, a few
hand written instances for common prelude types and a default instance for
any Ord instance (which uses AVL trees).

This may be modified, finished (or just ignored) at your discretion :-)

I would encourage interested parties to at least look at the implementation
for Lists (called ListGT) and understand how and why it differs from that
that would be obtained from the simple sum of product rules for generalised
tries. Maybe the same idea could be used for arbitrary product types. By
that I mean treat an arbitrary product type as if it were an heterogeneous
list of known length (using tuples perhaps).

Starting Point (where we're at now - 2009)

There seems to have been some interest in this project from students for 2009. Anyone wishing to submit a proposal for this should be aware that Jamie Brandon did a lot of work on this
for 2008. But it's still not finished, so there is room for
2009 applications and further work.

Currently we have preliminary class definitions, a few hand
written instances and a test suite as a package on hackage (gmap).
This makes use of functional dependencies.

The darcs repository for the library is in the process of being converted to use type families (instead of functional dependencies), but can't be compiled at the moment (apparently there's a problem with ghc type families implementation).

One thing that seems to be missing from this post is that we need
a way of deriving instances for arbitrary key types. The class API
is somewhat huge and writing instances by hand is not really a practical proposition for most users.