Storage and beyond

POSIX filesystem interface for Elliptics distributed storage

We in Reverbrain create storages. Starting from single-node search engines to multi-petabyte clouds spawning hundreds of servers. All of them are accessed mostly via HTTP API and sometimes using programming interfaces for Python, Go and C/C++

But there is a huge number of clients who do not need complexity of the HTTP API, instead they want to have locally connected storage directory with virtually unlimited free space. They will store and load data using existing applications which operate with files. It is very convenient for user to mount remote storage into local directory and run for example backup application which will copy local files into this directory and this will end up in several physically distributed copies spread around the world. Without any additional steps from the user.

The first incarnation of the POSIX filesystem interface for some simpler distributed storage we created was a network block device which contained a connection layer and depending on the block offset it selected a server to work with. That was a bit ugly way to work, since block device doesn’t know which object is being stored or read, and what locking should be performed. Locking was always either too coarse or too fine, it ended up performing a lot of operations for simple block transfer, it became obvious that locking has to be performed on the higher layer, namely in the filesystem. This distributed block device was named DST and it lived in linux kernel for couple of years.

The second approach was to implement a filesystem. We created POHMELFS – Parallel Optimized Host Message Exchange Layered FileSystem. Its first commits were imported at the very beginning of January 2008. Actually POHMELFS was not very active and it clearly became visible that existing Linux VM stack doesn’t really scale to billions of objects, we do not have enough resources to change that – that’s a huge task both from technical and political sides. We implemented several features which were then found in Parallel NFS and Ceph. POHMELFS lived in linux kernel for several years.

We removed both project from the linux kernel back then and concentrated on Elliptics distributed storage. And now its time to resurrect POSIX filesystem interface.

But we decided to move another way. Native filesystem interface is fast, but you have to implement it for every OS, this requires a lot of resources which will be wasted supporting different versions for different OSes. Do you know inode allocation differences between Windows 8 and 10?

We found that our clients do not need this, instead they want network attached directory which works pretty well using WebDAV protocol. Well, not exactly, since Windows clients do not support authenticated webdav, and some applications like NetDrive has to be installed, but it happened to be almost standard application for NAS/SAN surprisingly.

We implemented WebDAV server which supports HTTP authentication and connects to Elliptics storage. There are limitations both in WebDAV protocol and in our server, in particular we do not allow locking to be transferred among servers, i.e. if client connected to storage via one gateway and the reconnected using the other, interlocking will not see each other. But that should not be a problem, since webdav prohibits parallel update of any object.

We can create many private folders for every user, it is even possible to add features on top of user files like indexing for search, version control and so on, but that’s a different story.

The main advantage is that this distributed storage is cheap per gigabyte. You can add many commodity servers into Elliptics cluster this will just increase the size of the storage without interruption – system scales linearly to ~4Tb/day writes in our setups and 200+Tb/day in Yandex for example. You can also put replicas of your data into different datacenters – this is inherent feature of Elliptics, and if connection to one datacenter drops down, client will just work with the other replicas.

And all those features are now accessible via usual filesystem interface. It is possible to access data via HTTP or other APIs though.