P2P Routing

Authors

Introduction

There are many peer-to-peer networks in existence today, each using different
techniques to route data from one peer to another, to discover resources
on the network and to adapt to changing topologies. This webpage describes
the routing techniques employed by four distinct schemes to overcome these
challenges.

GNUtella

Before one can give a detailed outline of the routing algorithms used by GNUtella
it is essential to give a broad outline of what such a system is trying to
achieve in general. Most operations on the Internet work on a client-server
model. This system means that individual client machines interact with a central
server to retrieve information stored on that server. This means that the
majority of information is moving from the server to the client. The GNUtella
protocol works on the basis that servers can act as clients and vice versa
i.e. a decentralized system. These individual systems are called GNUtella
servents (SERV -er cli- ENT). The servents can accept queries from
another servent which is trying to access files on its system, while at the
same time requesting files from another servent's file system. Communication
between the servents is carried out using the TCP/IP protocol.

Once a servent is connected to the network it can send information into
the network to find out about other servents in the system. The other servents
can respond to this request by sending information about their own state,
including their IP addresses, the number of files it has decided to share
on the network and the total size of these files. A servent can then query
the network for files meeting certain search criteria. If a servent has
files meeting the criteria, it will respond with a list of appropriate file
details. The servent that made the initial request can then ask an appropriate
servent for a particular file and have that file routed to it.

Servents communicate with each other using a set of descriptors.
There are currently 5 descriptors defined: Ping, Pong, Query, QueryHit
and Push. These will be described in detail later.

Each descriptor is preceded by a Descriptor Header which has the
following format:

Descriptor Header

Byte Positions

0 - 15

16

17

18

19 - 22

Contents

Descriptor ID

Payload Descriptor

Time To Live

Hops

Payload Length

Descriptor ID: This is used to uniquely identify the particular message
on the network. It is created by the client and must be unique (in theory)
to ensure that certain other servents can detect when they are seeing a message
from a particular servent that they have processed before.

Payload Descriptor: This defines the type of descriptor which is
following the header. It can be any of the 5 choices.

Time To Live (TTL): This field outlines the maximum amount of servents
that the message can be routed through before it must be discarded. Each
time the message passes through a servent this field is decremented. When
the value reaches 0 the message is discarded. This ensures that a particular
message will not be routed continually around the network. This would cause
an immediate degradation of network performance if all servents were sending
out many packets in succession which were never being removed from the network.

Hops: This is a count of the number of servents through which the
message has been passed. The count is incremented each time it passes through
a new servent.

Payload Length: The length of the payload immediately following
this header. This header is important because the protocol does not define
any flags to define where one payload ends and the next descriptor begins.
This means that if the fields in the descriptor header are invalid then
that message cannot be routed. As a result it is impossible to find the
beginning of the next descriptor and the connection must be terminated.

Ping, Pong, Query, QueryHit and Push

The following are the 5 descriptors which can follow the descriptor header
outlined above:

Ping

A servent uses a ping descriptor as the mechanism to find out about other
servents on the network. This is a message sent with no payload. This means
that the descriptor header contains all the necessary information.

Pong

Byte Positions

0 - 1

2 - 5

6 - 9

10 - 13

Contents

Port

IP Address

File Count

Total File Size

This is a response that a servent may send upon receiving a ping. The response
contains the following fields:

Port: This is the port on which the servent will listen for communication

IP Address: IP address of the servent

File count: This is a count of the number of files that the servent
is sharing with other servents.

Total File Size: The total size of all the files that the servent
is sharing on the network in kilobytes.

Query

Byte Positions

0 - 1

2 - N

Contents

Minimum Speed

Search Query

When a servent wants to find a file on the network it sends out a query descriptor,
which contains the following fields:

Minimum Speed: If a particular servent is to communicate with the
requesting servent it must be able to communicate at or above the specified
minimum speed. Only servents that meet this criterion can send a response
to this query.

Search Query: This is a string of search data that contains the
request of the servent. This string must be null terminated to mark the
end of the descriptor.

QueryHit

Byte Positions

0

1 - 2

3 - 6

7 - 10

11 - (N-1)

N - (N+16)

Contents

Number of Hits

Port Number

IP Address

Speed

Result Set

Servent ID

This descriptor is sent in response to a query placed on the network, but
only if the responding servent can communicate at the speed outlined in the
minimum speed field of a Query and if it contains a file which matches the
search string. It contains the following fields:

Number of hits: The number of files that the servent has which match
the query string. The individual file details are grouped in sets. These
set structures will be described below.

Port Number: The port on which the servent will listen for communication.

IP Address: The address of the servent.

Speed: The speed of the servent in KB per second.

Result set: The result sets contain file details. These outline
the index of the file stored on the host (this is number given to a file
by GNUtella), the size of the particular file (in bytes) and a string giving
the name of the file found. This is terminated by a double null. The index
of the file and the size of the file are both 4 bytes long. The string for
the file name is at least 1 byte long.

Servent ID: This uniquely identifies a servent on a network.

Push

Byte Positions

0 - 15

16 - 19

20 - 23

24 - 25

Contents

Servent ID

File Index

IP Address

Port

If the servent is behind a firewall
that does not allow incoming connections, the client can use a push descriptor
to retrieve the data. The descriptor contains the following fields.

Servent ID: This is the unique ID of the servent which holds the
required files. This is obtained from the QueryHit descriptor send back by that
particular servent.

File Index: This is the mechanism for the
requesting servent to indicate which file it wants from the servent with servent
ID. This index is obtained from the Query Hit message descriptor.

IP
Address: Address of the host to which the file is to be transferred.

Port: Port number to which the file is to be transferred.

Routing Rules for descriptors

To try to minimise the amount of traffic and nodes involved in the transferral
of packets around the network each servent must contain some form of routing
table. This table is used by a servent to determine whether a particular
descriptor has been seen before. In this way the servent can decide if a
message needs to be forwarded or removed from the network. It is best to
describe the process using an example:

Animation 1.1

Take the situation where node A is joining the network and wants to find
out about the other nodes on the network. It does this by sending a ping
to node B. On receipt of a ping node B looks up its routing table to see
if it has seen that particular ping before. Initially is has not, so B enters
details about the message into the routing table. It responds to A by sending
a pong descriptor containing its details. It then forwards the ping from
A to the other nodes to which B is connected i.e. C, E & F. These nodes
carry out the same operation, storing the details in the routing table if
necessary, sending back a response to A and forwarding the message on. Due
to the fact that a loop exists B will receive the ping message back around
the loop. This scenario is dealt with when C & E check their routing
tables and recognise that they have seen that same ping message before.
In these cases the message can be discarded.

When node B receives the pong messages from C, D and E for A it can look
up the routing table and see that it routed the corresponding ping message
for these pong messages and forward these on to A.

If the network were further expanded the ping message from A would propagate
its way through the network until the TTL on the ping message expired, at
which point the message would be removed from the network. This would happen
if the network were large in relation to the TTL.

The same form of routing is used for querying the network and receiving
back queryHits.

Some Advantages

The major advantage of the GNUtella routing system is that unlike the well-known
Napster protocol, GNUtella does not use a centralized server. Less reliance
on central servers means that failures will affect very few servents.
GNUtella was designed to transfer any type of file, unlike Napster which
would only allow the transfer of sound files.

Some Disadvantages

The major disadvantage is that when using descriptors such as ping and
query the descriptor must be forwarded by each node to all its neighbours
or until the TTL is 0. This means that a single ping message causes each
neighbouring servent to send ping messages to their neighbours which in
turn may generate thousands of pong messages going back to the servent which
issued the original ping. The responses to these descriptors are sent in
a more efficient manner, along the path travelled by the initial request.
Another disadvantage is the fact that each servent must store some information
about what packets it has seen over a period of time. This seems to be something
of a dilemma, as to keep a very good history of packet routing a servent
would need to use a lot of memory. On the other hand if the servent has
a small memory store it may end up forwarding packets that it had previously
forwarded, creating unnecessary network traffic. The issue of using small
or large history storage is dependent on the size of the network, and the
GNUtella network does not follow any particular structure which means that
the size of the network is always changing.
The protocol depends on a servent producing a globally unique identifier
for a particular message. This in practice does not happen and as a result
the overall system breaks down if two messages that are supposed to be unique,
i.e. sent from different servents, end up having the same identifier.
There will also be problems if one servent is about to fetch files from
another servent and the sending servent goes off and back onto the network
with a different file index. The receiving servent may then be indexing
incorrectly into the file system for a file that does not exist or retrieving
the wrong file. This case would also mean that the servent would have lost
all of its history.

JXTA

Introduction

JXTA (pronounced "juxta") was developed by SUN and launched on 25 April 2001.
It was intended to be an open standard platform which could be used for the
development of a wide and varying range of distributed applications. In the
relatively short space of time it has been in use it has become very popular.
This platform provides tools and utilities allowing devices such as mobile
phones, PCs, PDAs and servers to communicate and share information over a
virtual network. Each device connected to the network is known as a peer.
Peers are able to set up virtual networks in order to access, find and use
the resources of other peers.

SUN had three main goals for JXTA. These were Interoperability, Platform
Independence and Ubiquity.

Interoperability: this allows peers to locate each other easily as well
as allow them to set up community activities and share resources seamlessly.

Platform Independence: peers should be able to communicate and operate
regardless of what operating system, language or networking protocol is
used.

Ubiquity: JXTA is designed to be usable on any digital device.

Unlike most other P2P systems JXTA has designed its network from a much
lower level and has designed an entirely new infrastructure unrelated to
any other existing P2P scheme. Peers are divided into peer-groups. Essentially
the JXTA virtual network is a collection of peer groups. This approach is
adopted for communication, security and performance reasons. Six core protocols
have been defined by JXTA to allow peers to detect one another as well as
set up peer groups without the need for network management. It can also
operate through firewalls. Message passing is achieved using unidirectional
pipes and the data passed between peers is in the form of XML documents.

Fig. 2.1: JXTA Virtual Network

JXTA Messages

Communication between peers is achieved using XML documents. Messages are
sent from one peer to another as an ordered sequence of bytes in a single
message unit. Messages are sent between peer endpoints. An endpoint is a destination
on any network capable of receiving datagram-style messages. The mapping of
endpoints to a physical address is performed at runtime by a messaging layer.
This layer uses the transport specified by the URI for message passing. JXTA
can use connection oriented (TCP/IP) as well as connectionless transport protocols
(UDP). JXTA can also support emerging transport protocols.

Messages can be viewed as an envelope with a stack of protocol headers.
The envelope contains a header, message digest, source endpoint and destination
endpoint. The source endpoint is optional. The header consists of a protocol
tag and message body length. The protocol tag specifies the protocol in
use. The body's size is dependent on the protocol in use. When using an
unreliable transport protocol messages may arrive at the endpoint more than
once, in a different order than sent or may not arrive at all. Higher layers
above the messaging layer are responsible for message re-ordering and duplicate
message removal.

Also contained within the body of a message is a credential key that is
used at the endpoint to identify a sender. It also ensures that the sender
has a right to send a message to that specific endpoint. This credential
must be used each time a message is sent.

Message Format

jxta:// Envelope Version

Destination Address

Source Address

Message Digest (Kind, Length)

Message Body Header (Protocol Tag)

Message Body (Text, XML, ...)

JXTA Pipes

Pipes form the virtual communication channels uses to send and receive messages
between peers. They are unidirectional, stateless and provide abstraction
from the type of network being used. The pipes can connect one or more endpoints.
At each endpoint it is assumed that there is software that will manage the
sending and receiving of messages using the pipe. However this is not obligatory.
The peer endpoints each have two associated pipes; an input pipe and an output
pipe. Using the pipe binding protocol pipes are bound at runtime. When a peer
sends a message into a pipe all other peers connected to that pipe will receive
it. Pipes that are currently connected can be obtained using the pipe binding
protocol.

Pipes can support two modes of communication. Firstly there is the standard
point-to-point mode. Using this mode exactly two pipes are connected together,
an input pipe and an output pipe. The input pipe will receive messages sent
through the output pipe. There is no support for a reply operation. The
additional information in the message envelope like the credential key is
used to thread messages together. The second mode used is the propagate
mode. This is where multiple input and output pipes are connected to network
endpoints. Using this a message can be sent to all listening input pipes.
Using propagate pipes may cause messages to be duplicated. As a result of
this they are not always favoured, especially when optimal speeds and times
are required for communication.

Animation 2.1

JXTA Message Routing Protocols

As mentioned earlier JXTA uses six protocols specifically designed to work
together to allow for discovery, monitoring, organization and communication
between peers. The protocols are used by peers to advertise and discover available
resources. Messaging routing is cooperative allowing all peers to be connected.
The protocols are:

Peer Discovery Protocol

Peer Resolver Protocol

Peer Information Protocol

Pipe Binding Protocol

Endpoint Routing Protocol

Rendezvous Protocol

Peer Discovery Protocol

This protocol allows peers to advertise their resources and find out what
resources are available from other peers. Advertised resources are represented
as XML documents. This protocol supports finding peers without knowing their
name. Using a process known as screening it can also be used to probe network
peer groups for peers that belong to a specific group. Screening works by
presenting each possible candidate with a peer group name. Any that belong
to the group will respond to the message. A list of peers may be cached allowing
for unicast rendezvous communication instead of using broadcast messages.
Messages to get advertisements from a certain part of the network will not
always return an exhaustive list and could even be empty.

Messages used:

Discovery Query Message - used to find peers or peer groups. Message
contains a threshold value indicating the maximum number of matches required.

Peer Resolver Protocol

This allows peers to send a query to one or multiple peers and receive a response
(responses) to the query. The query is sent with a unique id that is used
in the response to identify that it is the response for that query. Such queries
can be sent to a whole peer group or to specific peers. This protocol performs
confirmation of messages and drops any rogue messages.

Resolver Response Message - used to respond to a resolver query message.
Response sent with unique ID of sender.

Peer Information Protocol

Allow peers to find out status information about other peers i.e. state, uptime,
and capabilities. The message sent by the original sender contains an unique
ID which must be used in the response. This protocol can also be used to obtain
peer information. In the responding message each property of interest is listed
as a name/value pair.

Messages used:

Ping Message - sent to check if the peer is active and get information
about the peer. The ping option specifies whether a simple response or
full response is required.

PeerInfo Message - sent in response to a ping message.

Pipe Binding Protocol

This is a mechanism that allows peers to set up a virtual communication channel
(pipe) between itself and one or more other peers. Essentially it binds the
two ends of a pipe thus creating this virtual channel. Pipes are the most
common mechanism used to allow peers to communicate.

Pipe Binding Answer Message - the response is sent back to the requesting
peer by each peer bound to the end of that pipe.

Endpoint Routing Protocol

This protocol allows peers to discover a route between itself and another
peer. This route would have been that used to send a message to the peer.
If peer X and Y wish to communicate yet there is no direct link between the
two this protocol has to be used. It will find the intermediary peers through
the message will have to be routed. If that route is then broken the protocol
can be used to find the other possible routes.

Messages used:

Route Query Request - sent by one peer router to another requesting
route information.

Route Answer Request - sent in response to a route query request.

Rendezvous Protocol

This protocol allows peers to become members of a propagation service. As
a member of a propagation service peers will receive messages and information
sent out to all members of that service. Thus the protocol allows peers to
send messages to all members of the service. To become a member of a peer
group a form must be obtained listing the requirements that must be met in
order to join. This form is a structured document.

Messages used:

Membership Apply Message - message sent to the application authenticator.
A response will contain an application credential and a group advertisement
that lists its membership service.

Membership Join Message - sent by a peer to the application authenticator.
This is an application to join the group. In order to do so the peer must
pass an application credential. A successful response will include a full
membership credential.

Membership ACK Message - this message is sent by the membership authenticator
to indicate to a peer whether it has been granted an application or full
membership.

Membership Renew Message - sent to renew access to a peer group.

Membership Cancel Message - sent to cancel membership to a peer group.

Fig. 2.2: Message Passing Protocols

These protocols can then be used by the peers and peer groups to discover
one another, distribute resources, partition themselves into smaller virtual
networks, communicate and monitor each other. The protocols do not however
require peers to use a particular computing language, network transport
system or special encryption method, thus achieving SUNs three original
goals for JXTA.

Tapestry

Introduction

Traditional approaches to location services and routing over networks place
a large strain on components within the network infrastructure. Tapestry
aims to provide an alternative to the traditional network method. Using
only point-to-point links and no centralized resources, Tapestry is an overlay
location and routing infrastructure providing location-independent routing
of messages directly to its closest copy of an object or service. Tapestry
can boast features such as load balancing, fault-resilience, robustness,
scalability, and self-organisation.

The Tapestry routing architecture

Using the Plaxton location and routing mechanism Tapestry's goals are adaptivity,
self-management and fault-resilience. Tapestry's routing architecture efficiently
routes requests to content even in the presence of node faults and heavy
loads on the network. Through the use of randomness it achieves routing
locality and load distribution. Tapestry will deal with network problems
by bypassing failed routes, removing nodes under attack from the service,
transparently masking faulty components, and rapidly adapting communication
topologies to current circumstances depending on the type of problem being
encountered.

Due to Tapestry having its routes firmly in the Plaxton mechanisms, it
is appropriate to discuss what the Plaxton method basically is in order
to gain an understanding of the Tapestry routing mechanisms.

The Plaxton routing architecture

Plaxton routing mechanism

The Plaxton routing mechanism has at each node local routing maps called
neighbour maps that incrementally route overlay messages to the destination
ID. Each node has a neighbour map with multiple levels. Any level in the
neighbour map has a number of entries equal to the base of the ID, where
the i-th entry in the j-th level is the ID and location of the closest node.
In order to find the next node, we look at its n + 1th level map, and look
up the entry matching the value of the next digit in the destination ID.
Using this routing method in an N-size namespace using IDs of base b guarantees
that in at most LogbN logical hops any existing unique
node in the system will be found, assuming consistent neighbour maps in
each node.

As every single neighbour map at each node assumes that the preceding digits
of each node all match the current node's suffix, it only needs to keep
a small constant size (b) number of entries at each route level:

NeighbourMapSize = (entries/map) * (# of maps)
= b * LogbN

This results in a constant-size neighbour map being produced.

Fig. 3.1: Plaxton routing example.

Plaxton location mechanism

The Plaxton location mechanism allows a client to locate and send messages
to a named object residing on a server. This is achieved by the server advertising
that it has an object. It does this by routing a message to the root node,
a unique node in the network that is used to hold the root of the object.
Due to the root node's critical purpose and being the only one of its kind
it becomes a single point of failure. Publishing is a process whereby a
message is sent to the root node, and at each node along the path the <Object-ID
(O), Server-ID (S)> location information is stored.

On the other hand a location query allows clients to send messages to objects.
As the message progresses through its path, it checks the nodes it passes
through. The message is redirected to the server containing the object if
the message encounters a node that has the location information for the
object sought. Otherwise, the message is forwarded one step closer to the
root. The location information or mapping for the object is guaranteed to
be found upon the message reaching the root. The Plaxton location mechanism
chooses root nodes using a globally consistent deterministic algorithm.

Plaxton routing architecture - Review

Overall the Plaxton mechanism provides many good features for routing architecture.
These include:

· Simple Fault Handling

· Scalability

· Exploitation of Locality (Plaxton has an explicit notion of locality)

· Proportional Route Distance (Total network distance travelled
by a message during both location and routing phases is proportional to
the underlying network distance)

However, there are some bad features to its design which include:

· Global Knowledge (Plaxton requires global knowledge of the network
which complicates the addition and removing of nodes)

· Root Node Vulnerability (Having a critical purpose and being the
only one makes the root node a single point of failure)

Tapestry location and routing mechanisms

The main features of the Tapestry mechanism are based heavily on the Plaxton
mechanisms. Just like the Plaxton routing mechanism every node in the network
has a neighbour map with multiple routing levels and each entry describes
the location of its closest neighbour. The Tapestry location mechanism is
similar to that of Plaxton's.

In the Plaxton method each node only stores the location information: <Object-ID
(O), Server-ID (S)> stores the closest replica object to the node. Tapestry
on the other hand stores the location information of all object replicas.
This has the effect of increasing semantic flexibility.

Fig. 3.2: A single Tapestry node. The complete components of a Tapestry
node that acts as a client, object server, and router.Components include
a neighbour map, hotspot monitor, object location pointers, and a store
of objects.

The next section explains how Tapestry mechanisms detect, operate under,
and recover from faults affecting routing and location functionality.

Fault Handling
A Tapestry design decision was that components tackle the problem of fault
adaptivity by using soft state to maintain cached content for graceful fault
recovery, rather than hard state which would require the providing of guarantees.
The cache content is updated by regular refreshment messages. Rather than
treat them as a special case, Tapestry handles faults as part of its normal
operation.

Fault-tolerant Routing
Neighbour map table corruption, server failures and link failures are types
of expected faults that need to be handled by the network. The system should
ideally quickly detect failures, operate under them, and recover router
state when failures are repaired. Tapestry relies on TCP timeouts in order
to detect link and server failures during normal operations. Each node in
the network contains back pointers that send regular UDP packets to its
closest node neighbour. This is done to ensure that routing can be processed
through that node. Corrupted tables can be quickly and easily detected by
using a message to check the ID of each node it passes through. The neighbour
map of a node maintains two backup neighbours in every entry in addition
to its closest neighbour. If the primary neighbour of a node ever fails
then the node uses the backup neighbours instead. When a node fails, it
is usually found and repaired in a short time. Instead of a node removing
the failed node from its neighbour map, it sets the node as invalid. Routing
for the time being is done through an alternative route until the node is
repaired. However the node will only remain in the network in its invalid
state for fixed time (e.g. one day) to give it a chance to be repaired.
Messages from the failed route are checked to see when it has been repaired.
If after the specified time of repair the node has not been fixed then the
node is removed from the neighbour map. This process is carried out in order
to avoid costly reinsertion of recovered nodes after the failure has been
repaired.

Fault-tolerant Location
In the Plaxton mechanism the root node was vulnerable in that it was a single
point of failure. In Tapestry multiple roots are used. Each object will
now have many roots assigned to them. Adding a constant sequence of numbers
called salt values to each object ID achieves this. During the publishing
process described previously the roots are used to insert location information
into the nodes. Objects stored at regular intervals have their location
information republished by the storage servers.

Surrogate Routing
Tapestry uses a surrogate routing mechanism, which is really just a distributed
algorithm, that incrementally computes a unique root node. This algorithm
is deterministic, scalable, and can arrive at consistent results from any
point in the network. Surrogate routing operates by choosing an object's
root node to have the same name as its ID. It is however unlikely that this
node will actually exist due to the sparse nature of a nodes name space.
Tapestry operates as if a node does exist by attempting to route to it.
A route to a non-existent identifier will encounter empty neighbour entries
at various positions along the way. Here, the algorithm selects an existing
link that acts as an alternative to the desired link. This selection is
done with a deterministic selection among existing neighbour pointers. Routing
finishes once a neighbour map has been reached where the only non-empty
entry belongs to the current node. That node is then designated as the surrogate
root for the object.

Kademlia

Introduction

Kademlia is a new (circa March 2002) peer-to-peer information system that
returns search results from the entire network in O(log n) time, where n
is the number of nodes in the network. Obviously this scheme is extremely
scalable, but it must be realised from the outset that a given piece of
content must reside on a particular node in order for it to be found and
retrieved. Given that no account is taken of the bandwidth available to
a node, a PC with a 56k dial-up connection may be expected to host, for
example, a large media file, and to make it available for download to anyone
who wants it. Clearly the system is not suitable for large-scale Naspster-style
file sharing, rather as a distributed file store on a high-speed, reasonably
homogenous network. Kademlia's searching and routing techniques are novel,
and are worthy of investigation.

Keys & Content

In order to decide where on the network on piece of content should reside,
a 160-bit hash of the content is first obtained. Hashes are assumed to be
unique for different files. We can imagine a hashing scheme where a file
is split into 160 pieces, and the first bit of each piece is recorded. These
160 bits could be used as our hash, as it would be very unusual to find
two different files with the same bits in the given positions. Kademlia
uses a technique similar to this; a hash is obtained, which it refers to
as the key. The file (or other content) is called the value, and in Kademlia
the key/value pair is stored on the node whose 160-bit nodeID is closest
to the key. The concept of "closeness" will be discussed later.
The content must reside on this particular node so that it can easily be
found. For this reason Kademlia is referred to as a Distributed Hash Table.
In order to find a file Kademlia simply obtains its key and retrieves the
file from the node with ID closest to the key. The responsibility of having
to host someone else's content makes it very unlikely that Kademlia could
ever be used as a replacement for a music-sharing network like Napster or
GNUtella.

Distance

Given that a piece of content can have any 160-bit key, but
only a tiny fraction of 160-bit nodeIDs are used, there must be a way of
deciding where to store content whose key is not exactly equal to any nodeID.
Kademlia solves this problem by inventing a measure of "closeness"
between two 160-bit numbers (in this case, one a key and the other a nodeID).
This is referred to as the XOR metric, because the two numbers are simply
XORed together and the resulting value is treated as a distance. For example
(with two 6-bit numbers):

A key/value pair is stored at the nodeID closest to the key, in the terms
defined above. This metric is also used in determining how close two nodes
are to one another, which is important when searching.

Searching

Each node in the network keeps contact information for only log n other
nodes, its contacts. The information is made up of the nodeID, IP address
and UDP port of the node. We can think of the network as a binary tree of
nodes organised according to their nodeIDs (see Figure 1). Any given node
must have a contact in each subtree in which it itself is not contained
(in Figure 1 the black node has a contact in each of the grey ovals).

Fig. 4.1: from [Kademlia 1]

In reality each node has multiple (typically 20) contacts in each subtree.
For each subtree the contacts are stored in a bucket. Figure 2 shows an
example where each bucket holds at most 2 contacts.

Fig. 4.2: from [Kademlia 2]

In order to find a particular key, the searching node first determines
the subtree that will contain the required node (based on the start of the
key). It then asks all its contacts in that subtree to return their contacts
that are in the Target subtree. The searching node "homes in"
on the Target by iteratively finding contacts closer to it (see animation
1). Each step reduces the pool of candidate nodes by 50%.

Animation 4.1

The algorithm given above will find the node with ID closest to the key
sought. By definition this is where the required key/value pair resides.

Nodes Joining and Leaving

Whenever one node asks another for its contacts, the called node stores
the contact information of the caller. This is the primary mechanism by
which each node's routing table is kept up to date. The symmetric property
of XOR (X Å Y = Y Å X) ensures that a node will get the right
distribution of contacts, i.e. it will get equal numbers of contacts for
each oval in Figure 1. Given that each contact bucket has finite capacity
it becomes necessary to discard some contacts when recording new ones. Kademlia
implements a "least recently seen" eviction policy, removing contacts
that have not been heard from for the longest period of time.

When a node joins the network it takes some of the contacts of an arbitrary
node and uses them as its own. It then does a search for itself. This results
in other nodes being called, which makes them aware of the new node's existence.
Because a new nodeID has been assigned, the new node may have become the
closest node to certain keys. Upon becoming aware of this, the previous
closest nodes will replicate the appropriate key/value pairs to the new
node. In this manner all key/value pairs will be found at the expected nodes.
Ignoring replication the cost of a node joining is only O(log n) messages.

When a node leaves the network no action is required. If a searching node
tries to contact this node it will timeout and will be removed from the
former's routing table. Given that a searching node has multiple contacts
that can be pursued in parallel, the impact on performance is minimal.

Conclusions

Kademlia is a very new peer-to-peer information system that is at present
largely untested. Although it is not suited to arbitrary file-swapping its
architecture could provide a sound basis for a distributed file store. Its
O(log n) scalability is its most attractive advantage, and if it can be
modified to take account of different node and network types (as its inventors
suggest) it could become a major architecture in Peer-To-Peer.

Conclusion

The vastly different approaches to data
routing, resource discovery and changing topologies described above show that
in many respects peer-to-peer is in its infancy. Each routing scheme has advantages
and disadvantages, which means that at present there is no perfect peer-to-peer
technology. However, if the history of the Internet and its growth is anything
to go by, the one attribute that a routing technology must possess above all
others is scalability.