It's a girl

Now that we actually have the storage interface nailed down, let us talk about how we are going to actually expose this over the wire. The storage interface we defined is really low level and not something that I would really care to try working with if I was writing user level code.

Because we are in the business of creating databases, we want to allow this to run as a server, not just as a library. So we need to think about how we are going to expose this over the wire.

There are several options here. We can expose this as a ZeroMQ service, sending messages around. This looks really interesting, and would certainly be a great research to do, but it is a foreign concept for most people, and it would have to be exposed as such through any API we use, so I am afraid we won’t be doing that.

We can write a custom TCP protocol, but that has its own issues. Chief among which, just like ZeroMQ, we would have to deal, from scratch with things like:

Authentication

Tracing

Debugging

Hosting

Monitoring

Easily “hackable” format

Replayability

And probably a whole lot more on top of that.

I like building the wire interface over some form of HTTP interface (I intentionally don’t use the overly laden REST word). We get a lot of things for free, in particular, we get easy ability to do things like pipe stuff through Fiddler so we can easily debug them. That also influence decisions such as how to design the wire format itself (hint, human readable is good).

I want to be able to easily generate requests and see them in the browser. I want to be able to read the fiddler output and figure out what is going on

We actually have several different requirements that we need to do.

Enter data

Query data

Do operations

Entering data is probably the most interesting aspect. The reason is that we need to handle inserting many entries in as efficient a manner as possible. That means reduce the number of server round trips. At the same time, we want to keep other familiar aspects, such as the ability to easily predict when the data is “safe” and the storage transaction has been committed to disk.

Querying data and handling one off operations is actually much easier. We can handle it via a simple REST interface:

Nothing much to it, so far. We expose it as JSON endpoints, and that is… it.

For entering data, we need something a bit more elaborate. I chose to use websockets for this, mostly because I want two way communication and that pretty much force my hand. The reason that I want to use two way communication mechanism is that I want to enable the following story:

The client send the data to the server as fast as it is able to.

The server will aggregate the data and will decide, base on its own consideration, when to actually commit the data to disk.

The client need some way of knowing when the data it just sent is actually flushed so it can rely on it.

the connection should stay open (potentially for a long period of time) so we can keep sending new stuff in without having to create a new connection.

As far as the server is concerned, by the way, it will decide to flush the pending changes to disk whenever:

over 50 ms passed waiting for the next value to arrive, or

it has been at least 500 ms from the last flush, or

there are over 16,384 items pending, or

the client indicated that it wants to close the connection, or

the moon is full on a Tuesday

However, how does the server communicate that to the client?

The actual format we’ll use is something like this (binary, little endian format):

1: enter-data-stream =

2: data-packet+

3:

4: data-packet =

5: series-name-len [4 bytes]

6: series-name [series-name-len, utf8]

7: entries

8: no-entries

9:

10: entries =

11: count [2 bytes]

12: entry[count]

13:

14: no-entries =

15: count [2 bytes] = 0

16:

17: entry =

18: date [8 bytes]

19: value [8 bytes]

The basic idea is that this format should be pretty good in conveying just enough information to send the data we need, and at the same time be easy to parse and work with. I am not sure if this is a really good idea, because this might be premature optimization and we can just send the data as json strings. That would certainly be more human readable. I’ll probably go with that and leave the “optimized binary approach” for when/if we actually see a real need for it. The general idea is that computer time is much cheaper than human time, so it is worth the time to make things human readable.

The server keep tracks of the number of entries that were sent to it for each connection, and whenever it is flushing the buffered data to disk, it will send notification to that effect to the client. The client can then decide “okay, I can notify my caller that everything that I have sent has been properly saved to disk”. Or it can just ignore it (usual case if they have a long running connection).

And that is about it as far as the wire protocol is concerned. Next, we will take a look at some of the operations that we need.

Rafal,
A large part of the time is spent in just establishing the connection.
In TCP, you have to do the 3 way dance, in HTTP, you have to do auth from scratch, etc.
If you keep the stream open, you can bypass a lot of those costs.

I read everywhere that connection should be opened at the last time and closed asap. How will you manage when you'll have 500 client connected to your server ?
And is "human readable" really an advantage here ? Ticks are not human readable anyway.

Remi,
Human readable is a huge advantage in things like debugging, devops, troubelshooting, etc.
The actual on disk format is rarely of interest. But the actual wire format is essential for human readable.

And 500 clients what?
In this case, we are actually talking about persistent connections per server. It is unlikely that you'll have direct connections from 500 machines. You usually go through a web server for that. And even if you do, so what?
It isn't hard to hold that many connections.

I understand the ease of debugging via the wire with a human readable format, intercepting the message as it moves across the HTTP wire. I have two problems with this approach. First, you sacrifice performance by serializing and deserializing binary data to readable text at client and then server, respectively. Sometimes this is a huge cost. Second, my personal opinion is that debugging on the wire is the wrong place to debug. You're not in the code on the client or the server, so you really cannot diagnose the problem. The only thing you can do is determine if your serialization is working and whether your message is correct. And I would argue that this is better done on the client or the server, rather than in the middle. Just my two cents.

Ayende and Karhgath, I guess we'll have to agree to disagree on this one. For me, it seems that observing requests and responses is more properly done at one end or the other with logging. Debugging, at least for me, connotes being able to observe execution flow, including the use of break points and watching specific values. The value of observing data across the wire in human readable format when I have proper logging enabled and switchable, seems redundant and of little value, constraining your choices as to wire format. And with respect to tacking on an optimized binary protocol later, it has been my experience that such intentions are rarely ever realized.

You should go with ZeroMQ.
1. Performance is crazy.
2. Separating data stream from acks stream is piece of cake, you just need to created more sockets.
3. Almost all buffering is handled for you (e.g, if you write 1M messages from the client, they are sent as batches on the wire).
4. The support for message parts allow you to create a binary format which is very fast and very simple to work with for the data insertion.
5. For the query interface, you could use json over the sockets.
6. Will be easy to scale to multiple machines.
7. Has support for UDP connections.

Ohadr,
You'll be surprised just how much you can figure out just from the request URI, when you are doing debugging.
If there is no simple solution, there really isn't no option for me here. I consider this to be pretty much a top priority requirement.