Storage and beyond

The first elliptics distributed network draft.

POHMELFS was created as a client (and besides that it is already a very high-performance network filesystem) for this kind of distributed network. It was called elliptics network.

Each node has ID in flat space with modulo operation in it.

Each object has ID from the same finite space. Object can only be identified by its id, there are no dirs, files or whatever else.

Server with ID N hosts objects from (N-1; N] where N-1 is ID of the neighbour node.

Any object can have multiple IDs, in this case it will be copied to different nodes.
Each additional ID can be obtained from the first one via known mathematical operations (like sum in finite field). IDs (including additional) should not cross for different objects.

Information about object and its clones will be stored in object itself and in objects, which correspond to the parent directory in classical path-based lookup, thus to get the object client will create ID from its path, fetch the data, and if it is not available, it can created ID from parent’s directory path and request information about clones from that object.

Each write operation is transaction based, thus each transaction is a separate object in the filesystem, which will be stored on different servers according to client’s settings (how many clones). Information about how to apply transactions on top of some object will be stored in the parent’s object the same way as described above.

Transaction is committed (and can be remoevd) after all clones have it applied, otherwise it should live in the network until explicitely removed.

Node join protocol.
Node looks up a pair of node, whose IDs surround ID of the joining node (called next and prev nodes here according to how IDs correlate). It sends a request to the next node to return list of objects (and its attributes) which correspond to the joining node ID. Let’s assume next node has ID N, prev node – P and joining node has id J.
1. Joining node J sends request to next node N to grab list of objects it hosts in (P; J] and to get routing table next node has.
1a. Next node forwards all write requests to the joining node J, which will block util step 3 is completed.
2. Joining node J runs through received list and selects objects which are newer than those which are presented on joining node itself.
3. Those objects are fetched from the next node.
3a. All write requests forwarded from the next node are applied.
4. Joining node connects to previous node P and announce that it is now in the network, so that previous node updated its routing table. This announce can be sent to all nodes in the routing table received from the next node in step 1.

Each node has a routing table, which corresponds to the tree indexed by node IDs. Each object in the tree hosts an address of the remote node. Even large enough routing table will not take lots of RAM, but this redundacy (and not only addresses of the immediate neighbours) allows to greatly reduce amount of lookup messages neeeded to find an appropriate node. When node receives some request which does not correspond to IDs it hosts, node should forward this request to another node according to its routing table. If routing table does not have a node, which corresponds to given ID, request should be forwarded to the nearest to that ID node.
Some heueristics should be introduced to determine when to stop a lookup and return no-entry error, for example when next node in the routing table has ID less than requested ID, then error should be returned. This relies on correct routing table, which should be updated when next node leaves, or nodes next to the next node join.

When node joins and fetches list of updated objects from the next node, it may notice that some objects were changed and request transaction list from another nodes to apply. Transaction list can be obtained from the object, which corresponds to parent directory in the path-based object name.