Peer to Peer Routing

Introduction

Peer to Peer (P2P) systems can take many forms. Email, Internet Relay
Chat and Napster are all examples of P2P systems. Routing on these networks
is either centralised or statically configured and is therefore unproblematic.
Another class of P2P networks is the overlay network. Overlay networks
build a virtual topology on top of the physical links of the network. Nodes
leave and join this network dynamically and the average uptime of individual
nodes is relatively low. The topology of an overlay network may change
all the time. Once a route is established, there is no guarantee of the
length of time that it will be valid. Routing in these networks is therefore
very problematic and will be the focus of our report. Some of the issues
facing designers of P2P routing algorithms are:

Scalability

Complexity

Anonymity

Scalability is a measure of how a system performs when the number of nodes
and/or number of messages on the network grows. Complexity is the order
of steps required for a packet to travel from one host to another in a
worst case scenario. Anonymity is not a requirement of most P2P networks,
however if a network is to be designed to provide anonymity then this is
a problem that must be solved at the routing level. We will look at a few
examples of routing algorithms from each of these perspectives.

Gnutella Routing

Gnutella was arguably the first mainstream overlay network to enjoy widespread
use. The concept behind it was simple. To join the network a client had
to know the address of at least one node already on the network. Once the
client had a connection to this node it could then broadcast a ping to
find the addresses of other nodes. The basic idea is that each node maintains
a connection to a number of other nodes, normally about five. To search
the network for a resource the client sends a "Query" message to each of
the nodes it's connected to. They then forward the message and when a resource
is found the result i.e. resource name and address is propogated back along
the path. The number of nodes that get queried can be somewhat controlled
using a Time-To-Live counter. This type of routing is the simplest kind
possible for an overlay network, it is not however without its problems.

Gnutella style routing, or flooding, works well for small to medium sized
networks. It has been shown that the cost of searching on a Gnutella style
network increases superlinearly as the number of nodes increases. When
node saturation occurs the network can become fragmented. Gnutella, therefore,
while a simple solution does not scale well. Searching in a Gnutella network
is roughly of exponential complexity, as each search will take about n^d
steps where n is the time to live and d is the number of peers per node.
Flooding is obviously not the most optimal solution for routing, and it
wasn't long before other P2P routing algorithms emerged which were more
efficient. We will discuss a number of these including semantic routing
and distributed hash tables. Gnutella wasn't designed to be anonymous and
discovering who is making specific resources available is a simple matter
of performing a search. One network which was designed specifically to
be anonymous is Freenet. Freenet's routing algorithm was designed completely
with anonymity in mind. We will take a look at the issues that anonymity
presents in terms of scalability and complexity.

Chords main advantage is the guarantee that you will get a reply within log(n) time. Also a significant advantage is the lack of redundant overhead. These both give it a huge edge on any flooding algorithm. But in general, given that DHT algorithms store their data references in an organized way, they will always beat flooding algorithms in this area. Chord scales well, given that the search is of order log(n), and also has relatively low complexity because of this.

Distributed Hash Tables

Distributed hash table (DHT) algorithms are useful for sharing files or
other data across a peer-to-peer network. A hash function takes a variable
length string of bytes and returns a number that it generates from this.
DHT algorithms work by hashing all file/data identifiers and storing their
locations in a giant hash table, which is distributed across the participating
nodes. One research group in MIT has developed a system called Chord, which
is a good example of a DHT algorithm.

Chord works as follows. Assume we have a single function 'hash', which
functions as above. All files/data items in the network will have an identifier,
which will be hashed using this function to give a key for that particular
resource. If a node needs a file/data, they will hash its name and then
send a request using this key.

All n nodes also use this function to hash their IP address, and conceptually,
the nodes will form a ring in ascending order of their hashed IP. Obviously
not all possible nodes will exist, but some will actually be real nodes,
and so all real nodes deal with requests to any nodes before them in the
ring, down to the previous real node (and of course requests to themselves).
This gives us the notion of a successor, which is the next actual node
in the circle. So, successor(x) is the next actual node in the logical
ring after x. When a node wants to share a file or some data, it hashes
the identifier to generate a key, and sends its IP and the file identifier
to successor(key). These are then stored at this node. In this way, all
resources are indexed in a large distributed hash table across all participating
nodes. If there are two or more nodes that hold a given file or resource,
the keys will be stored at the same node in the DHT, giving the requesting
node a choice.

So when a node wants something, it will hash its identifier and send a
request to successor(key), which will reply with the IP of the node that
holds the actual data. But how does a node request information from successor(key),
when it doesn't know its IP, but only the key? Chord has a system whereby
every node holds what is known as a finger table. This contains a list
of keys and their successor IP's, and is organized such that each node
holds the IP of a exponential sequence of nodes that follow it, i.e. entry
i of node k's finger table holds the IP of node k + 2^i.

Searching for a node is therefore a log(n) procedure. Each node knows the
IP of the next real node in the ring. If the key lies between k and the next
real node, then successor(key) is the next node. Otherwise the finger table
is searched for the closest predecessor of the key. The request is forwarded
to this node and it repeats the same procedure until the node is found. The
request contains the IP of the requesting node so when the search terminates
at the key node, it can reply instantly, with no back propagation required.

When a node joins or leaves the network, a series of messages are sent
to re-distribute parts of the hash table. On joining, a node hashes its
IP to form a number and sends a request to any node on the network to find
successor(number). It then takes a certain amount of its successors hast
table and a message is sent to the predecessor to inform it of its new
successor. When a node leaves the network, it sends its table to its successor
and also tells its predecessor of its departure. On both these events,
finger tables will end up with wrong values, and so each node runs a periodic
process, which sends messages around the ring to find new nodes.

Chords main advantage is the guarantee that you will get a reply within
log(n) time. Also a significant advantage is the lack of redundant overhead.
These both give it a huge edge on any flooding algorithm. But in general,
given that DHT algorithms store their data references in an organized way,
they will always beat flooding algorithms in this area.

Node 1 finger table

X

Successor(X)

3

4

5

6

Node 6 finger table

X

Successor(X)

0

0

2

4

Semantic Routing:

Semantic Routing is a method of routing which is more focused on the nature
of the query to be routed than the network topology. Essentially semantic
routing improves on traditional routing by prioritising nodes which have
been previously good at providing information about the types of content
referred to by the query.

In order to be able to search for information on a p2p network semantically
the data needs to have a semantic description associated with it, one popular
solution is the use of RDF meta-data[1] for this purpose.
Tagging documents/data with RDF would provide a rich 'semantic web' which
could be structured in a p2p fashion. A schema-based p2p network such as
this would benefit greatly from semantic routing. Semantic routing differs
fundamentally from other routing techniques because prospective nodes are
selected because of another node's confidence in their ability to respond
correctly to a given query irrespective of their position within the network.

There have been several different projects started with a view to examining
the power of semantic routing, for instance: Neurogrid[2],
Nejdl,Wolpers,Siberski et al[2] and Tempich,Staab,Wranik[4].
Arguably the most interesting of these is the Remindin[4] system
which incorporates a semantic routing algorithm developed with the intention
of mimicking social networks. In addition to this many of the techniques
developed for collaborative filtering[5] are equally
applicable to the ranking of peers within a semantically routed network.

Each time a node answers a query its peers adjust their confidence in
that node with regard to that type of query. The nature of this adjustment
depends upon whether the node answered correctly i.e. if the search result
was selected by the searcher. The information associated with 'types' depends
greatly on the kind of semantic data being dealt with by the network and
the strictness of the peer confidence ranking algorithm. How this data
is stored and how the semantic routing algorithm uses this data is beyond
the scope of this article, however it has been shown[4] that
automated relaxation of queries can lead to more robust searches.

An important factor in order for the routing algorithm to be effective
in the long term is persistence. Nodes must have a constant identifier
within the network's namespace if they are to retain their confidence ratings.
There is an initial forming stage where none of the peers have ratings
for any nodes, and nodes might be returned randomly (or indeed using traditional
routing methods). It has been observed[5] that a certain
amount of random responses to all requests (even after the forming stage)
can avoid an effect called 'overfitting', which is when the confidence
data associated with the nodes becomes too rigid and inflexible.

Semantic routing is a reasonably new idea and as such is in its infancy
as a technique. A universal consensus on a specific algorithmic approach
has not been reached and there is not yet a paradigm under which research
into p2p semantic routing is conducted. The advent of semantic networking
is making this an important topic, and one that is sure to grow and flourish
in future.

Freenet

Introduction:

Freenet is a peer-to-peer networking system with very specific goals -
both political (anonymity, and freedom of speech) and technical (decentralisation
of all network functions and efficient dynamic storage and routing of information).

System Architecture

Each node on the network acts as a data store, and has no say in what data it contains - any other node may read from or write to this. This enables the network to act as a distributed file-system. The data on each node is encrypted, providing further anonymity.

Search protocol:

Files on the system are referenced and located using hash-keys. In order
to search for a file a user must know it's hash-key, or at least how to
calculate it. Once this is known the user sends a request to their own
freenet node. Each request has a hops-to-live value, and a randomly generated
id so it can be recognised and rejected by nodes which have seen it before.
When a node receives a request it first checks it's own data-store to see
if it contains the relevant file. If it does then it returns the file along
with a message identifying it as the source of the data. If not it decrements
the hops-to-live value and forwards the request to one of the neighbours
in it's routing table. If this request fails then it asks another neighbour
from the routing table. If none of it's neighbours return a positive response
then it returns a failed message itself. If the hops to live value is exceeded
then a failed message is also returned. If a positive response is received
from a node then this node sends the data to the node it received the request
from, and caches a copy of the data in it's own data-store. If the data-store
is already full, then the least recently used files are deleted to make
room for the new one. In this way, more popular data will be cached on
multiple nodes throughout the network, and data that no-one requests will
eventually be removed entirely.

Inserting Data into the Network:

In order to insert a file into the network, a request is made using the
file's hash-key. If the file is found by one of the other nodes before
the hops-to-live expires, then the file is returned. Thus there will be
no need to re-insert the data into the network. If the request fails however,
the user can then send the data onto the network. This data will follow
the same path as the initial request, and will be copied by all nodes along
the way. Each node can claim that it or any other node is the source of
the data, thus providing anonymity for the original source.

Routing algorithm:

Choosing the next neighbour to request content from is an important part
of freenet's behaviour, as it is this that it allows it to adapt to usage
patterns and network changes.

Each node has a routing table which contains the address of it's neighbours,
and also their performance in returning particular keys. This performance
rating was initially just an indication of how successful the node had
been in retrieving the key, but has been improved to include response-time
and transfer time. When a node receives a request for a particular key
it searches it's table to find the node which was most successful in returning
a similar key. It is important to note that it is the keys which are being
compared for closeness, not the files, so there is no semantic similarity
implied. It then requests the file from this node.

When a node receives a successful response from another node, it will
create a new entry in it's table associating the source node with the requested
key. An emergent property of this is that nodes will start to specialise
in particular keys. When the network is set up there will be no performance
records for any nodes, so requests are sent to node which is essentially
chosen at random. If it successfully returns the requested file, then an
entry will be created for that node with the relevant key. Requests for
similar keys will then be routed towards the successful node, and as a
result it will eventually start to specialise in these keys.

Advantages:

Popular data is replicated throughout the system.

Data is distributed anonymously and freely, the main goals of the
system.

The routing algorithm is designed to adapt and improve efficiency
over time.

Disadvantages:

People are hesitant to donate part of their hard-drive to the system,
particularly when it could be used to store information they don't approve
of.

The network is not easily searchable - users need to know the key to
find a file.

The network is slow, as the data must pass through all intermediate
nodes. Searches can be particularly slow because they don't use multicasting.