Fun with Distributed Hash Tables

May 13, 2009

The keen reader might have previously noticed my interest in distributed hash tables (DHTs). This interest is strong enough to motivate me to build a DHT library in Haskell, which I discuss here at an absurdly high level.

A DHT is a sort of peer to peer network typically characterized with no central or “master” node, random node addresses, and uses a form of content addressable storage. Usually implemented as an overlay network and depicted as a ring, we say each node participating in the DHT “has an address on the ring”.

To locate data first you need an address; addresses are generated in different ways for any given DHT but is commonly a hash of either the description (“Fedora ISO”), file location/name (“dht://news”), or even a hash of the file contents themselves. The lookup message is then sent to this address and in doing so will get routed to the node with the closest matching address. The routing is fish-eye: nodes have more knowledge about the nodes with closer addresses and sparse knowledge of further addresses. The result is that the average number of hops to locate a node is logarthmic to the size of the network but so are the size of any one nodes routing table, so the burden isn’t too much.

To ensure correct operation, DHTs keep track of the closest addressed nodes (typically the closest 8 on each side). These nodes make up the ‘leaf set’ and are often used for special purposes such as redundantly storing data if the DHT is for file sharing/storage. It’s “easy” to keep this correct because the final step when joining a ring is to contact the node who’s currently closest to your desired address.

Functionallity of the Haskell DHT Library
Originally I worked toward Pastry like functionallity, but then I read a paper on churn and opted to implement Chord like policies on join, stabilize and key space management.

I’ve implemented the basic operations of routing, periodic leaf set matainence, join operations requiring an atomic operation on only a single (successor) node, and IPv4 support via network-data. This is built on Control-Engine, so you can instruct nodes to route or deliver using as many haskell threads as you want. Beyond that, adding hooks is what Control-Engine is built for, so its easily to plug in modules for load conditioning, authentication, statistics gathering, and arbitrary message mutation.

Using the DHT Library
The main job in starting a node is building the ‘Application’ definition. The application must provide actions to act on delivered messages and notifications about leaf set changes and forwarded messages (giving it the opportunity to alter forwarded messages). Additionally, the app is provided with a ‘route’ method to send messages of its own. I won’t go in to depth on this right now as I’m not yet releasing the library.

Tests/Benchmarks
Using the hook instructions (and a custom Application definition) I’ve instrumented the code to log when it deals with join requests (sending, forwarding, finishing) and to show the leafset when it changes. Using the resulting logs I produced graphical representations of the ring state for various simulations (graphical work was part of a Portland State course on Functional Languages).

My student site has several simulations, but the most instructive one is LeafSet50 (22MB OGG warning!). Joining nodes are shown in the top area, active nodes are displayed in the ring at the center, thick lines are join requests being forwarded around, and groups of thin lines show the latest leaf set. Aside from revealing corrupt states caused by a broken stabilize routine, you can see some interesting facts for such a crude rendering:

A) Some areas are completely void while others are so dense that four or five nodes are overlapping almost perfectly. This tells us that, at least for sparsly populated DHTs, random assignment is horrible and can result in nodes having many orders of mangnitude larger area of responsibility than their counterparts. If I had bothered to read papers about applications based on DHT libraries then I might have known exactly how bad the situation could be, but it’s interesting to see this visually as well.

B) The simulated storm of joining nodes combined with periodic stabilization results in leaf sets being massively out of date. This concerns me less given the less-than realisic cause and the fact that everything eventually settles down (in ~120 seconds), but might bite me when I start thinking about join performance.

Other Ponderings
While I know that simulating 500 nodes takes 40% of my CPU time in the steady state (there is lots of sharing of leaf sets to make sure nothing has changed), this can be dramatically decreased by making the LeafSet data structure more efficient. Other than that, I’m just now considering the types of tests I desire to run. There are no serious benchmarks yet, but I hope to understand much more about the network performance such as:
1) Number of messages for joins
2) Number of bytes used for joins
3) Bandwidth needed for the steady state
4) Message rate in the steady state

Future Alterations:
Its not named! So job one is for me to figure out a name.

Code changes needed:
Remove nodes that don’t respond to stabilize messages (they aren’t deleted currently – they just persist).
Check the liveness of nodes appearing in the route table (only leaf set nodes are checked right now)
Generalize about the network – don’t mandate IPv4!
Polymorphic ‘Message a’ type – don’t assume H-DHT does the serialization to/from lazy bytestrings!
Stabilize is rudementary – check the sequence number to make sure responses are current!
Basic security/correctness mechanisms are still needed. When other nodes send route table rows and leaf sets we just add those entries into our own structure without any confirmation.

Chris:
I’ve barely thought about node failure. Pastry and Chord basically say you can drop failed nodes from your {leaf set, routing table} once discovered and not think on them again. For route tables this is fine as it only effects the local node – agreement amoung multiple nodes wouldn’t make sense.

For leaf sets the Pastry and Chord papers I know are extremely unhelpful here, for example “A Pastry node is considered failed when its immediate neighbors in the nodeId space can no longer communicate with the node”

As for Paxos – it looks a bit expensive wrt the number of messages it takes, but I’ll read about it more carefully. When I get around to true hardening there will be lots of work: node failure agreement, authentication, flash crowds, and join/leave attacks to name a few.

Wren:
I’ve actually used bloom filters in my previous post when introducing Control-Engine by making a toy web crawler. At the same time, I don’t see how they can be used in a generic DHT library (perhaps for certain applications) – let me know if you have a specific idea here, I’d be interested.

If I were to have a case where I want to forward a certain packet only once then I can absolutely see the use of a bloom-filter. Not only would I have a use but an awesome way to implement it as well – each Control-Engine worker thread could add to its own bloom filter that is only periodically recombined with the master filter provided with every packet. This is possible because the union operation is defined for bloom filters and the downside is only a second forwarding of a packet (presumably that doesn’t cause the world to end) while the up side is zero data contention on the filter.

Nothing too specific in mind. You mentioned the problem of redundantly sending the same (large) set of data between peers to ensure consistency, and a common bloomfilter approach is to send the filter over the wire as a cheap proxy for determining if the whole data needs to be sent. The only caveat is that the false-positive rate for element lookup converts into a false-negative rate for when the whole set needs sending. There was a post about this on Planet Haskell in the last few months (perhaps yours?).

wren:
I have to send the leaf sets frequently as they are part of the keep-alive messages, so no filtering them out. What I could do is be smarter about when I take another nodes leaf set and update my own – right now I just add each node individually, but because the received leaf set is ordered there are much cheaper ways to do this.

Chris,
Perhaps I didn’t understand Paxos properly, but it appeared to require proposer to N acceptors, N acceptors back to proposer and then proposer broadcasting the result to the interested parties – so basically 3*N messages with a latency of RTTW + 0.5*RTT where RTTW is the worst round trip time (with a maximum of your timeout).

As for security as an afterthought not being sufficent – I agree completely. What I have built thus far is the bare minimum required for any operations and I have kept it fairly modular. Here are some tentative security plans for the curious (time allowing):

1) Remove nodes from the leaf set in a sensible manner. All the infrastructure is there to keep track of the leaf set itself and entries responses, so now I just need something to do with those responses.
2) Periodically check the route table entries / update. At this point I’m strongly considering Awerbuchs provably competitive adaptive routing. Again, the route table structure is needed no matter what path I take, but perhaps I’ll need to refactor the code so each entry carries more information.
3) Contact new nodes before adding to RT or LS. This is simple, but I have considered the ramifications of omitting this step (thus saving on communication) when we have a timestamped signature from said node.
4) Add public key based addressing, key sharing, per-message authentication and perhaps confidentiallity.

Morten:
I haven’t touched it much. I’ve been giving some thought to what Chris said and perhaps I have put off a full/secure design a little too long – namely I shouldn’t have been so hasty to switch to a Chord join algorithm as the Pastry algorithm appears better for reliability (but worse for churn – there is some contention between these two issues).

Feel free to ping me again if I don’t update the site; I do intend to do more hacking on this on a free weekend. I just wish I could pause the world to give me that time.

Hi, have you make any advance on the DHT library?
I would really like to take a look at the code. I have make a naive implementation of Chord in Haskell. But I’m a beginner Haskell programmer and I would like to see more advanced code to improve my knowledge.