Oh no, a White Paper!

The crypto currency/blockchain world loves its white papers, and IPFS is no exception. It started with the famous Bitcoin: A Peer-to-Peer Electronic Cash System white paper by Satoshi Nakamoto (which in turn references to another white paper you don't want to know about). We will have a look at the Bitcoin white paper in a future post when we dive into crypto currencies.

Two things you need to know about white papers:

They are distributed in PDF format (because it is harder to change than HTML!)

white papers are hard, scary and contain at least one mathematical formula.

To understand how IPFS works it's a good idea to walk through this white paper step by step. I'll limit this post to chapter 2 of the white paper, having a look at the underlying technologies:

Distributed Hash Tables

Block Exchanges - BitTorrent

Version Control Systems - Git

Self-Certified Filesystems - SFS

Mapped on a ISO like stack:

OK, here we go!

Distributed Hash Tables (DHT)

What is a DHT?

A DHT is like a Python dict or Perl hash (if you have the key, you can retrieve the value), but the data is distributed over multiple nodes. The Wikipedia article Distributed hash table gives a good introduction.

In the case of IPFS, the key is a hash over the content. So ask an IPFS node for the content with hash QmcPx9ZQboyHw8T7Afe4DbWFcJYocef5Pe4H3u7eK1osnQ and the IPFS node will lookup in the DHT which nodes have the content.

How the specific value is found efficiently (fast, with a little as possible network requests) and how to manage the DHT so that changes (nodes that enter/leave the network, or new entries to the table) are absorbed easily is different for the various DHT implementations that exist.

One such an implementation is called Pastry, and I like these two videos where the routing (how to find values) and the dynamics (how to handle node additions/removals) are explained really well.

Kademlia and friends

Back to the white paper. Three DHT implementations are mentioned explicitly, starting with Kademlia. Kademlia has a white paper of its own but let's not go there...

Kademlia is the DHT protocol that is used in almost all popular P2P systems and again the Wikipedia article on Kedemlia is a great introduction.

In short, Kedemlia uses the ID of the nodes to get step by step closer to the node with the desired hash (from the Wikipedia article):

When searching for some value, the algorithm needs to know the associated key and explores the network in several steps. Each step will find nodes that are closer to the key until the contacted node returns the value or no more closer nodes are found. This is very efficient: Like many other DHTs, Kademlia contacts only O(log(n)) nodes during the search out of a total of n nodes in the system.

The details can start to get quite complicated and I don't think it won't add to our goal of understanding IPFS, so we move on. If you like this sort of thing: this is a nice introduction and if you want to dig deeper the link to the Kademlia white paper is ^^^.

IPFS mentions two other DHT implementations that are added to standard Kedemlia:

So now we have that text there we would like to access it from the DHT, but apparently, the only ipfs dht get request that is supported is for keys starting with /ipns/.

OK, and so we create an IPNS and see if we can query that directly:

$ ipfs name publish QmfQKcXMLGvCxx6opNDwb1ptD1LJER6MPHdsMHCB1CXpFF # point a IPNS address to our content
Published to QmYebHWdWStasXWZQiXuFacckKC33HTbicXPkdSi5Yfpz6: /ipfs/QmfQKcXMLGvCxx6opNDwb1ptD1LJER6MPHdsMHCB1CXpFF
$ ipfs resolve QmYebHWdWStasXWZQiXuFacckKC33HTbicXPkdSi5Yfpz6 # check it# never returns, hmm IPNS doesn't seem to be ready for production
$ ipfs dht get /ipns/QmYebHWdWStasXWZQiXuFacckKC33HTbicXPkdSi5Yfpz6 # same thing but directly# does return binary data starting with 4/ipfs/QmfQKcXMLGvCxx6opNDwb1ptD1LJER6MPHdsMHCB1CXpFF

It sort of does work, but it is limited to IPNS for now it seems.

What else can we do with the DHT?

Use the DHT to find out which nodes can provide some data:

$ ipfs dht findprovs QmfQKcXMLGvCxx6opNDwb1ptD1LJER6MPHdsMHCB1CXpFF
QmYebHWdWStasXWZQiXuFacckKC33HTbicXPkdSi5Yfpz6
QmRf4ERGvYpVo6HRa2VueZT8pWi8YvyLS3rW6ad2y83tdN
# ^^^ there are both my nodes, so that works# Now ask the DHT for the address of the first peer that was returned
$ ipfs dht findpeer QmYebHWdWStasXWZQiXuFacckKC33HTbicXPkdSi5Yfpz6
/ip4/176.92.234.78/tcp/4001
/ip4/85.74.239.218/tcp/38689
/ip4/127.0.0.1/tcp/4001
/ip4/192.168.1.30/tcp/4001
/ip6/::1/tcp/4001
/ip4/192.168.1.3/tcp/4001
/ip4/176.92.234.78/tcp/40443
# Oooh nice! We will visit this address notation in the next episode.

All in all, we saw a bit of the DHT in action, but not as intuitive and functional as I hoped for.

Time to move on to the next exciting technology applied by IPFS:

BitTorrent

How does BitTorrent work?

We all know about BitTorrent, but some of us (yeah, me too) need to dig deeper to really get it. This presentation is a great introduction: Feross Aboukhadijeh: WebTorrent - JSConf.Asia 2014. The speaker talks about how he implemented a BitTorrent client that can run in the browser, thanks to WebRTC (which is a cool technology that might come in handy later on for our project). It also explains the working of the DHT again to drive it home.

BitTorrent and IPFS

The exchange of data (blocks) in IPFS is inspired by BitTorrent but is not 100% BitTorrent. The white paper mentions two BitTorrent features that IPFS uses:

tit-for-tat strategy (if you don't share, you won't receive either)

get rare pieces first (improves performance and more, see the first PDF above)

A notable difference is that where in BitTorrent each file has a separate swarm of peers (forming a P2P network with each other) where IPFS is one big swarm of peers for all data. The IPFS BitTorrent variety is called BitSwap and I'll go over that in the next episode.

Let's take a break

Pfew, that was a lot of theory and dry material, and we are not even halfway! So to let it all sink in and to free up the brain for the rest go for a walk in the park and listen to this podcast: The Quiet Master of Cryptocurrency — Nick Szabo. It goes all over the place, but I found it very interesting. It zooms out from bit level to things like "what is money?".

Highly recommended listening, and when you're done...

Version Control Systems - Git

The white paper's entire section on Version Control Systems is reproduced here:

Version Control Systems provide facilities to model files changing over time and distribute different versions efficiently.

The popular version control system Git provides a powerful Merkle DAG object model that captures changes to a filesystem tree in a distributed-friendly way.

Git only adds data. So blobs, trees, and commits are immutable. What the last version of either of these is, is determined by special references (see point 4 and 5).

Ad 2. Objects are content-addressed, by the cryptographic hash of their contents.

In git, the file or directory name is not used when they are referenced. Git hashes the content (or listing or commit) with SHA1 and uses these hashes in its database.
This article makes it very insightful how that works: Git under the hood (Start reading at "Now let’s look at how Git does all this")

Ad 3. Links to other objects are embedded, forming a Merkle DAG. This provides many useful integrity and workflow properties.

Oh my, a Merkle DAG! No worries, it is actually not as scary as it sounds.

A Merkle tree is a binary tree where the parent contains the hash of the concatenation of the hashes of the two children. This explains the integrity property: any change in a data block results in a change of the root node. With just a little bit of meta-data (uncles and parents, which can be untrusted) and a trusted root node, we can verify the validity of the block.