Before I start discussing this topic, I want to talk a bit about the speed of light. That pesky limit basically means that there in an inherent lag in passing information between any two points in space. On your daily life, you can mostly ignore it. The human brain is far too slow to perceive it, and even if you are working with computers, you can usually ignore the speed of light for anything less than about 500 miles.

But the speed of light is merely the hard upper limit of our ability to send information from one location to another. In practice, the lag time between any two computers connected to a network is much higher. In fact, if you are a gamer, you are very well acquainted with that fact.

Let’s assume that we have a need to do the following:

Hold a mutable state of some kind.

Modify it in a consistent manner.

Distribute it to many locations, where many is at least hundreds and potentially tens of thousands of locations.

Let’s break this up a bit to its component parts. A mutable state basically means that we can modify it over time, and that these modifications shows up in all locations. The speed of light (and network lag) ensures that we can’t have this happen instantaneously. And, of course, the killer requirement here is that we need to do this in a consistent manner.

I find it really hard to talk about abstract problems concretely so let’s use an example. We have a set of tax rules that determine how certain products should be taxed. The ruleset is big, changes on a regular basis and it is quite important to get things right. Oh, and we also need to do push those rules to tens of thousands of locations that would independently compute taxes for purchases.

In this case, the solution is quite easy. We have a single source of truth that all the locations can pull the tax ruleset from (directly or through mirrors). We’ll also ensure that tax rules are applied only at some future date from their creation, to allow time for all these locations to be updated. If a location hasn’t had an updated ruleset for over a certain amount of time, they can refuse to process any payments until they get an update. This example has really very little for us to do in term of design.

A better example would have multiple actors changing the data as we work with it. An actual example for when this is required can be when we have an ad provider that needs to run ads in many (physical locations). Each location is actually an IoT device that includes a computer, a screen and a camera. Each such device decides independently what ads should be shown (for example, if the camera see a baby stroller, they show ads for baby toys).

I actually had to search hard for an example that would work for this scenario but I think this is a good one. We obviously have a lot of activity on the ad provider with many people registering and modifying ads settings. We need that portion of our business to be consistent (this is what we are charging people for, after all). However, given the probably high amount of ad devices spread all over, we can’t rely on them always being connected to a central server or even on them being connected at all.

And even if there is no connectivity and nothing new to show, we must show something. Even if we aren’t getting paid for showing the ad, not showing something or (actually worse) showing an error is something that we can’t tolerate.

Consider the graph on the right. I already talked about this graph when I wrote about permission based graph queries.

In this post, I want to show off another way to deal with the same problem, but without using graph queries, and using only the capabilities that we have in RavenDB 4.1.

The idea is that, given a user, I want to be able to issue a query for all the issues that this user has access to, either directly (like Sunny in the graph), via a group (like Max, via project-x group) or via a recursive group, like (Nati, via project-x –> team-nati groups).

As you can image from the name of this post, this requires recursion. You can read the documentation about this, but I thought to spice things up and use several features all at once.

Let’s look at the following index (Issues/Permissions):

This is an JS index, which has a map() function over the Issues collection. For each of the issues, we index the Users for the issue and the groups (recursively) that are allowed to access it.

Here is what the output of this index will be for our issue in the graph:

Now, let’s look at how we can query this, shall we?

This query has two clauses, either we are assigned directly or via a group. The key here is in the recurse_groups() and inside that, the load() call in the index. It scans upward through the defined groups and their parents until we have a simple structure in the index that is easily searchable.

RavenDB will ensure that whenever a document that is referenced by a load() in the index is updated, all the documents that are referencing it will be re-indexed. In the case we have here, whenever a group is updated, we’ll re-index all the relevant issues to match the new permissions structure.

One of the core principles of RavenDB is that you can push more work to the indexing and keep your queries fast and simple. This is a good example of how we can arrange the data in such a way that we can push work to background indexing in a pretty elegant manner.

We test RavenDB as thoroughly as we can. Beyond just tests, longevity runs, load test and in general beating it with a stick and seeing if it bleats, we also routinely push early build of RavenDB to our own production systems to see how it behaves on real load.

This has been invaluable in catching several hard to detect bugs. A few days after we pushed a new build to production, we started getting errors about running out of disk space. The admin looked, and it seemed to have fixed itself. Then it repeated the issue.

It took a bit of time to figure out what was going on. A recent change caused RavenDB to hang on to what are supposed to be transient files until an idle moment. Where idle moment is defined as 5 seconds without any writes. Our production systems don’t have an idle moment, here is a typical load on our production system:

We’ll range between lows of high 30s requests per seconds to peeks of 200+ requests per second.

So we never have an idle moment to actually clean up those files, so they gather up. Eventually, we run out of disk space. At this point, several interesting things happen. Here is what our topology looks like:

We have three nodes in the cluster, and currently, node B is the leader of the cluster as a whole. Most of our operations relate to a single database, and for that database, we have the following topology:

Note that in RavenDB, the cluster as a whole and a particular database may have different topologies. In particular, the primary node for the database and the leader of the cluster can and are often different. In this case, Node A is the primary for the database and node B is the leader of the cluster.

Under this scenario, we have Node A accepting all the writes for this database, and then replicating the data to the other nodes. Here is what this looks like for node C.

In other words, Node A is very busy handling requests and writes. Node C is just handling replicated writes. Just for reference, here is what Node A’s CPU looks like, the one that is busy:

In other words, it is busy handling requests. But it isn’t actually busy. We got a lot of spare capacity at hand.

This make sense, our benchmark scenarios is starting with tens of thousands of requests per seconds. A measly few hundreds per second aren’t actually meaningful load. The servers in questions, by the way, are t2.large with 2 cores and 8GB of RAM.

So we have a bug in releasing files when we aren’t “idle”. The files just pile up and eventually we run out of disk space. Here is what this looks like in the studio:

And clicking on details gives us:

So this looks bad, but let’s us see how this actually turns out to work in practice.

We have a failure on one node, which causes the database to be unloaded. RavenDB is meant to run in an environment where failure is likely and is built to handle that. Both clients and the cluster as a whole know that when such things happen, either to the whole node or to one particular database on it, we should respond accordingly.

The clients automatically fail over to a secondary node. In the case of the database in question, we can see that if Node A is failure, the next in like would be Node C. The cluster will also detect that are re-arrange the responsibilities of the nodes to indicate that one of them failed.

The node on which the database has failed will attempt to recover from the error. At this point, the database is idle, in the sense that it doesn’t process any requests, and will be able to cleanup all the files and delete them. In other words, the database goes down, restarts and recover.

Because of the different write patterns on the different nodes, we’ll run into the out of disk space error at different times. We have been running in this mode for a few days now. The actual problem, by the way, has been identified and resolved. We just aren’t any kind of pressure to push the fix to production.

Even under constant load, the only way we can detect this problem is through the secondary affects, the disk space on the machines that is being monitored. None of our production systems has reported any failures and monitoring is great across the board. This is a great testament to the manner in which expecting failure and preparing for it at all levels of the stack really pays off.

My production database is routinely running out of space? No worries, I’ll fix that on the regular schedule, nothing is actually externally visible.

One of the things that I’m going to try to do while I move the code over is properly handle all error conditions. We’ll see how that goes.

I already have the concept of a server_state_run() method that handles all the network activity, dispatching, etc. So that should make it easy. I’m going to start by moving all the libuv code there. I’m also going to take the time to refactor everything to an API that is more cohesive and easier to deal with.

There is some trouble here, with having to merge together two similar (but not quite identical) concepts. My libuv & openssl post dealt with simply exposing a byte stream to the calling code. My network protocol code is working at a higher level. Initially I tried to layer things together, but that quickly turned out to be a bad idea. I decided to have a single layer that handles both the reading from the network, using OpenSSL and parsing the commands over the network.

The first thing to do was to merge the connection state, I ended up with this code:

There are a few things that are interesting here. On the one hand, I want to keep the state of the connection private, but on the other, we need to expose this out to the user to use some parts of it. The way libuv handles it is with comments denoting what are considered public / private portions of the interface. I decided to stick it in a dedicated struct. This also allowed me to get the size of the private members, which is important for what I wanted to do next.

The connection state struct have the following sections:

private / reserved – 64 bytes

available for user to use – 64 bytes (and aligned on 64 bytes boundary)

msg buffer – 8,064 bytes

The idea here is that we give the user some space to keep their own data in, and that the overall connection state size is exactly 8KB, so can fit in two OS pages. On Linux, in most cases, we’ll not need a buffer that is over 3,968 bytes long, we can even save the second page materialization (because the OS lazily allocate memory to the process). I’m using 64 bytes alignment for the user’s data to reduce any issues that the user have for storing data about the connection. It will also keep it nicely within the data the user need to handle the connection nearby the actual buffer.

I’m 99% sure that I won’t need any of these details, but I thought it is best to think ahead, and it was fun to experiment.

Here is how the startup code for the server changed:

I removed pretty much all the functions that were previously used to build it. We have the server_state_init_t struct, which contains everything that is required for the server to run. Reducing the number of functions to build this means that I have to do less and there is a lot less error checking to go through. Most of the code that I had to touch didn’t require anything interesting. Take the code from the libuv/openssl project, make sure it compiles, etc. I’m going to skip talking about the boring stuff.

I did run into an couple of issues that are worth talking about. Error handling and authentication. As mentioned, I’m using client certificates for authentication, but unlike my previous code, I’m not explicitly calling SSL_accept(), instead, I rely on OpenSSL to manage the state directly.

This means that I don’t have a good location to put the checks on the client certificate that is used. For that matter, our protocol starts with the server sending an: “OK\r\n” message to the client to indicate successful connection. Where does this go? I put all of this code inside the handle_read() method.

This method is called whenever libuv has more data to give us on the connection. The actual behavior is on ensure_connection_intialized(), where we check a flag on the connection, and if we haven’t done the initialization of the connection, we check i OpenSSL consider the connection established. If it is established, we validate the connection and then send the OK to start the ball rolling.

You might have noticed a bunch of work with flags CONNECTION_STATUS_WRITE_AND_ABORT and CONNECTION_STATUS_INIT_DONE. What is that about?

Well, CONNECTION_STATUS_INIT_DONE is self explanatory, I hope. This just tells us whatever the connection has already been checked or not. This save us the cost of validate the client cert of each packet. Usually, SSL handshake means that we could do this check only inside the “need to read more from the network”, but I think that there are certain communication patterns in which the SSL handshake could be completed and the packet will already have additional encrypted information for the connection. For example, I’m pretty sure that TLS 1.3 0-RTT is one such case. This is why the ensure_connection_initialized() is called twice in the code.

Of more interest is the CONNECTION_STATUS_WRITE_AND_ABORT flag. This is set in one of two locations. First, if we fail to validate the certificate for the connection. Second, if we failed to process the message that was sent to us (inside read_message()).

In either case, we want to close the connection, but we have a problem: Error handling. We use libuv for all I/O, and that is asynchronous in nature. We want to write an error to the other side, to be nice, and in order to do that, we need to process the write and keep the connection around long enough that we’ll actually send it to the other side. Because of this, when this flag is set, we have the following behaviors:

Any newly available data on that connection is immediately discarded

The next write will flush all the data to the network, wait for confirmation that this was sent and close the connection.

This works very nicely to allow me to abort on an error and still get really nice errors on the other side.

I want to move my simple blocking socket based code to use libuv, so to allow more than a single connection per thread. The catch is that I also want to do that with TLS, and that seems to be much harder. There are a bunch of GitHub projects that talks about this, but as I know nothing about libuv (and very little about OpenSSL) I decided to write own TLS echo server with libuv to get better understanding of how it all play together.

Sit tight, this might take a while to explain. This is a complex topic and it took me a couple of nights of hacking to get it work, and then a lot of thinking into simplifying this to something that I actually like.

There seems to be great documentation for libuv, which is awesome. I went over the simple echo server sample and it seems relatively straightforward. Making the jump to using TLS is a bit harder. OpenSSL make it really easy to setup SSL on a socket file descriptor and read/write to it. There is even support for non blocking operations, but I didn’t want to be forced to write my own select()/poll() code, so how can I integrate these two libraries?

OpenSSL has the notion of a BIO abstraction, which stands for Basic I/O. Basically, this is a stream abstraction. One of the options that OpenSSL has available is the memory BIO. So the overall idea is to:

Setup libuv to accept a connection

Setup OpenSSL with the server side configuration

When a new connection comes through, setup a new SSL instance from SSL

Read data from the socket and pass it to the SSL instance and vice versa

Enjoy encrypted communication

The devil is in the details, naturally. The most complex part, after getting the initial handshake to work, in my experience, is the fact that you can get re-negotiation at any time which mean that a write request will fail with need more read data. That really complicate the amount of state that you have to manage.

Basically, on every SSL_write when managing your own state, you may need to do SSL_read and then retry to previous write. The simplest scenario that we have here is when SSL_accept() on the connection, which results in the following code to manage this state:

To handle a read, we need to check, after every read if the act of reading caused us to need to write (client wants to renegotiate the connection, so OpenSSL needs to send data on the connection, which we need to orchestrate) before we can do the actual read. For writes, we need to remember what we are writing, read and write from the network and then repeat our read. This is awkward to do when using synchronous calls, but the amount of state that we have to keep in async and callback driven programming is a lot. I got it working, but it was really hard and mostly a big house of cards.

I really didn’t like that approach, and decided that I should go about it in a very different way. I realized that I had a very conceptual error in how I approach libuv. Unlike standard async programming in C#, for example, libuv is based on the idea of a loop. In other words, unlike in the code above, you aren’t going to setup the next read from the network after each one. That is already done for you. You just call un_read_start() and you’ll get served the data from the network whenever it is available. You can also inject your own behaviors into the loop, which make things really interesting for ourselves.

Here is the logic, we continuously read from the network and pass the buffer to OpenSSL. We then try to read the decrypted data from SSL_read(). This can fail because we are waiting for more data, and that is fine. We’ll be called again when there is such data. However, we’ll also add a step at the end of the I/O loop to check if there are any pending buffers that needs to be flushed to the network. For writes, if we fail to do the write because we need to read, we’ll register the write to be executed later and wait for the network to send us the read operation.

Given that C isn’t an OO language, I think that I’ll start explaining what is going on from the structs that hold the system together and then the operations that are invoked on them:

The first thing to note here is that we have clear layers in the code. We have the connection_handler_t in here, which is a bunch of function pointers that allow higher level code to work with a connection abstraction. The first portion of the code defines the interface that I expect callers to use. As you can see, we have a few functions that deal with creating, establishing and tearing down a connection. We also have the most common operations, reads and writes.

The write method is pretty obvious, I think. You give it a buffer and it takes care of writing it to the other side. Note that this is an asynchronous process, and if there are any errors in the process, you’ll get them in the connection_closed callback. Reading, on the other hand, is completely out of your hands and will be invoked directly by the lower level code whenever it feels like it. This inversion of control may feel strange for people who are used to invoking I/O directly, but it likely allow you better overall performance.

Now that we have the interface, let’s build a TLS echo server with it. Here is how that looks like:

You can see that there isn’t really much done here. On connection creation, we simply allocate a space for tls_uv_connection_state_t. This is a callback because your code might want to allocate more space for whatever stuff you want to do in the per connection structure. When the connection is established (after the SSL negotiation, etc), you get a chance to initiate things from the server side. In the code above, we simply let the client know that the connection has been successful. From that point on, we simply echo back to the client anything that they send us.

The SSL and libuv initialization are the bare bones stuff and not really interesting. The nice bits happen in the end of the snippet, where we define the overall server state and wire together the protocol definition.

That is great, but where the part where stuff actually gets done?

A note about this code. I’m writing this primarily for ease of reading / understanding. I’m ignoring a lot of potential errors that in production code I would be obliged to handle. That would significantly complicate the code, but must be done if you want to use this code for anything but understanding the overall concept.

Let’s finish setting up the libuv machinery before we jump to any other code, shall we. Here is what this looks like:

This is fairly straightforward. We are listening to a socket and binding any incoming connection to the on_new_connection() callback. There is also the after_io preparation stuff, which we use to handle delayed operations (I’ll talk about this later). For now, I want to focus on accepting new connections and processing them.

There is quite a lot that is going on this method, and not all of it is obvious. First, we handle accepting the connection and binding its input to the libuv event loop. Then we create a connection and setup some of the SSL details.

We create an SSL instance for this connection and create two Basic I/O instances that reside in memory. One for the incoming stream and one for the outgoing stream. We’ll be using them to pass data through the OpenSSL encryption, negotiation, etc. We also mark this as a server instance.

Once that is done, we invoke the connection_established() callback and then tell the libuv event loop to start pumping data from this socket to the handle_read() callback. For now, I want to ignore the connection_established() callback, it isn’t important to understand the flow of the code at this point (but we’ll circle back to it). It is important to understand that by the time we call to this callback, the connection is ready to use and can receive and send data. Well, not receive, because we don’t provide a way to pull data from the connection, we’ll be pushing that data to the provided callback. This will happen by libuv calling to the handle_read() method whenever there is data on the socket. Here is how we handle this:

When libuv calls us with some data, we write this data into the read buffer for OpenSSL and then call SSL_read() to get the unencrypted data that was sent to us. There are some issues here. First, the SSL/TLS has framing, and the amount of data that your read from the network isn’t going to be the amount of unencrypted bytes that you get in the end. Another issue is that we need to be careful about re-negotiations, which are generally permitted at any point, but can cause a read to do a write (and may require a write to read).

You might have noticed that this code contains absolutely no indication of this. Instead, we call SSL_read() to get the plaintext data from OpenSSL. We continue to do this until we get an error from SSL_read(). This can be either a real error or an indication that we need to read more from the network. Whenever I get some bytes from OpenSSL, I pass them directly to the read() callback that was provided to us.

If you examine the code carefully, you’ll see that when we run out of data to read, we try to flush the SSL state of the connection. Let’s look at what that method do:

We check if the connection is already in the queue and if it isn’t we check whatever it should be added. There are two reasons why a connection should be added to the pending_writes queue. First, we may have data buffered in the write buffer of the SSL connection, which needs to be sent over the network. Or, we may have failed writes that we need to retry after we read more data into the SSL connection.

You might notice that we are doing some pointer hopping in the process of registering the connection in the queue. This is basically using a double linked list and will be important later. If we are putting stuff into a queue, what is going to be reading from this queue?

Remember that when we setup the libuv stuff, we used the after_io prepare handle? This is called as the first step in the loop, just before we check if there is any I/O to process. This give us the chance to deal with the confusing read on write and write on read nature of OpenSSL in a more structure manner. Let’s first look at the code, and then see how this all play together.

This is what actually handle writing to the network. We take data from the SSL write buffer and send it to the network. Once the write is done, we free buffers that were held for this operation and check if there was any issue with the write (if so, we abort the connection). This is all being driven by this method, which is called before we check for available I/O.

There is quite a lot that is going on in here. First, we iterate through the pending writes for all the connections we have. For each of the connections, we flush the SSL buffer and then check if we have pending writes to process. If we don’t, we can remove the connection from the queue, our work is done. If we do have any pending writes, we need to handle them.

I do that by using SSL_write(), which will write them into in memory buffer. I continue doing so until one of the following happens:

I run out of pending writes.

I run out of buffer space and need to flush.

I need to re-negotiate and need to read from the network

In the first case, I’ve successfully pushed the data to the SSL buffer, so I can call flush_ssl_buffer() and then remove the connection from the queue. In the second case, I’ll flush the SSL write buffer and try again.

However, in the last case, I’m just aborting the writes. I need to do a read, and that will be handled on the next iteration of the libuv loop. There is some bookkeeping there to make sure that if we successfully wrote data into the SSL buffer, we won’t be writing that again, but this is pretty much it. You’ll note that I’m playing games with pointers to pointers there to get clean code on the code that consumes the queue but allow me to skip one of the steps in the linked list without removing it from the list.

This is pretty much it, I have to say. We now have a system where both writes and reads work in conjunction to get the proper SSL behavior, even when we have renegotiation going on.

One thing you’ll not find in this code is a call to SSL_accept(), or indeed any behavior related to explicitly managing the SSL state. I’m letting OpenSSL handle all of that are rely on the fact that I SSL_write() and SSL_read() will handle renegotiations on their own for me.

Let’s do a simple walk through of what is going on with the connection of the TLS echo server.

On connection established (and before we read anything from the network), we call to connection_write():

This is fairly straightforward. We try to write to the buffer, and if we are successful, great. The check_if_need_to_flush_ssl_state() will take care of actually sending that to the client.

If the write buffer is full, we empty it and try again. The interesting thing happen when we need to read in order to complete this write. In this case, we copy the data to write and store it on the side, then we proceed normally and wait or the libuv to deliver the next read buffer for this connection. When that is done, we’ll be sending the deferred write to the client.

It may be easier to explain the flow with a real example. When a new connection comes into the server, we create a new SSL context and then we call:

connection_write(connection, "OK\r\n", 4);

This is the very first time that we actually interacts with the SSL instance and the call to SSL_write() is going to fail (because we haven’t established the SSL connection) with a SSL_ERROR_WANT_READ message. In response for this, we’ll copy the buffer we got and place it into the pending_writes of this connection. We also start listening to new data on the connection. The client will send the ClientHello message, which we’ll read and then feed into the SSL instance. That will cause us to write the SeverHello to the in memory buffer. When the check_if_need_to_flush_ssl_state() will be called, it will flush that message to the client.

Eventually, we’ll get the connection established and at this point we’ll be sending the deferred write to the client.

There are a bunch of other details, but they aren’t crucial to understanding this approaching. You can find the whole code sample here. I’ll reiterate again that it doesn’t have proper error handling, but it is less than 350 lines of C code that does something that is quite nice and expose an API that should be quite interesting to consume.

I’m really interested in feedback on this blog post, both on whatever this approach make any sense and what do you think about the code.

Now that I’m actually doing real work with input from the network, I thought it would be a good time to stop and take a look at whatever I’m exposing stuff. C is known for buffer overruns and security issues, and compounding that with network software that accepts untrusted input, that is something that we should take a look at.

The first line of defense is to use Valgrind and see if it reports any errors. It reported a memory leak (I didn’t free the command’s buffer, it seemed), which was easy to fix. But it also reported a much more serious issue:

Conditional jump or move depends on uninitialised value(s)

This is not something that I wanted to see. The nice thing about Valgrind is that it prints a nice stack trace, even if in this case, it didn’t make any sense. It was deep inside the strcmp() function, which I assume is fine. I dug around on how this warning in implemented before I got what was going on. Basically, I was handing strcmp memory that was never initialized, which caused this warning.

Here is the relevant piece of the code:

Take a look and see if you can see the error.

It might be helpful if I told you that this is an error that would only be raised if I’m writing to the stream manually, not via any client API.

It took me a while to figure out what was going on. This piece of code is meant to be called multiple times, building a single buffer (of up to 8KB in size) from potentially multiple network reads.

I had a small optimization there to avoid scanning from the beginning of the string and just scan from where we already scanned. This is the to_scan variable. As it turned out, this darling had nasty consequences.

Look at line 7 in the code sample, I’m telling strnstr() to start reading from the string from the specified position, but I pass the original size. I read past the end of the buffer. Likely still in my own memory, but that would almost certainly have caused issues down the road, and it is easy to construct a sequence of operations that would cause me to thing that a message is over when I haven’t finished actually sending it (reading the \r\n\r\n divider from a previous message).

Once that was fixed, it was green across the board. But I’m not a C programmer, and I’m not sure if there are other stuff that I should be looking at. I’m using string functions with explicit length, doing proper error checking, etc. Code review for this hasn’t show any issue, but I’m sure that there are more stuff there.

The actual code is about 100 lines of C code that I think is fairly straightforward. I would be very happy to hear what else I can do to this piece of code to increase my confidence in it.

So far, I did a whole lot of work around building the basic infrastructure of just building a trivial echo server with SSL. But the protocol I have in mind is a lot more complex, let’s get started with actually implementing the parsing of messages.

To start with, we need to implement parsing of lines. In C, this is actually a decidedly non trivial operation, because you need to read the data from the network into someplace and parse it. This area is rife with errors, so that is going to be fun.

To make things simple for myself, I’m going to define the maximum size of a message as 8KB (this is the common size in HTTP as well). Here is how I decided to represent it in memory:

The key here is that I want to minimize the amount of work and complexity that I need to do. That is why the entire message is limited to 8KB. I’m also simplifying how I’m going to be handling things from an API perspective. All the strings are actually C strings, null terminated, and I’m using the argv, argc convention for naming, just like in the main function.

This means that I can simply read from the network until I find a “\r\n\r\n” in there. Here is how I do this:

There is a bit of code here, but the gist of it is pretty simple. The problem is that I need to handle partial state. That is, a single message may come in two separate packets, or multiple messages may come in a single packet. I don’t have a way to control that, so I need to be careful about tracking past state. The connection has a buffer that is used to hold the state in memory, whose size is large enough to hold the largest possible message. I’m reading from the network to a buffer and then scanning to find the message separator.

If I couldn’t find it, I’m recording the last location where it could be starting, and then issuing another network read and will try searching for \r\n\r\n again. Once that is found, the code will call to the parse_commnad() method that operates over the entire command in memory (which is much easier). With that done, my message parsing is actually quite easy, from a conceptual point of view, although I’ll admit that C make it a bit long.

I’m copying the memory from the network buffer to my own location, this is important because the read_message() function will overwrite it in a bit, and it also allow me to modify the memory more easily, which is required for using strtok(). This basically allow me to tokenize the message into its component parts. First on a line by line basis (with splitting on space for the first line and then treating this as headers lines).

I added the ability to reply to a command, which means that we are pretty much almost done. You can see the current state of the code here.

You might be familiar with Moore’s law, which states that the number of transistors in a dense integrated circuit doubles about every two years. In effect, that performance doubles every 24 months. For many years, that has certainly held true. But that hasn’t been the case for the past 10 years or so. Even when Moore’s law held true, there was a snag. Wirth’s law is also in effect (as an aside, read his article “A Plea for Lean Software”, 23 years, and it holds true today) and Wirth’s law states that software is slower quicker than hardware is becoming faster. It’s good to be a software developer, because even when the CPU clock speed doesn’t jump all the time, we still get more CPUs to play with. A common approach to handle performance issues today is just to throw more parallelism at the problem until it shuts up.

To a certain extent, this make a perfect economic sense. In the calculus between developers’ salaries and the cost of hardware, you’ll usually find that buying a couple extra servers is drastically cheaper than spending another 6 months improving the performance of the system. Jeff Atwood wrote about the topic a decade ago and I think that he is still very much correct, to a degree. There are other factors to consider, which is the overhead over time and in particular in the cloud. One of the major factors to take into account here is that when you are running in the cloud, you aren’t running on your servers and you are charged on usage. That can change the math by quite a bit.

For example, if I bought a couple of servers, the number of IO operations that I make is pretty much meaningless to me. If I hitting the disk ten times a second or ten thousands times a second, I’m paying the same cost. Oh, sure, I might need to buy a better hard disk to get 10,000 IOPS, but that is a one time cost, and usually not that meaningful in the grand scheme of things. But when you are on the cloud, getting higher IOPS cost more, and it cost more over time. In the same sense, in my data center, the cost of querying a database is zero. In the cloud, you will typically be charged some (miniscule) amount on a per query basis. Nothing to worry about, except that your software still work according to Wirth’s law, so you are making more queries than you should, which means that you are charged for each and every one of them.

I used to make my living by being a database performance consultant. I would go to a customer, look at how they are using their database and optimize the access pattern. It was common to see 90%+ savings in the number of queries for common operations. I was doing that because that directly translated to better responsiveness of the application. Applications that respond faster are more pleasant to use and there are numerous studies about faster applications generating higher revenues. I remember talking to clients and explaining to them why they should invest in the overall performance of the application before they hit the total resource depletion that would take them down.

Today, in the cloud, I would have a much simpler task. Let’s assume a simple application using CosmosDB, as an example. With 200 page views / sec on the site, and each page view generating 80 requests to the database, that gives us a total of 16,000 requests a second, which translates to an end of the month bill of about 10,000$. I’m using the 80 queries / page view as a reasonable low ball estimate, mind. Drupal, for example, does 300 – 400 queries per page view and it is easy to get to these numbers without paying attention. Dropping the number of queries per page to 10 (which is usually pretty easy to do with proper queries, attention to details and some caching) gives you a database bill that is around than 1,000$.

Over the span of a year, that is enough to pay a full time developer that can go and find other places where your software can be improved. And unlike before, you don’t need to justify with studies or any indirect causation. You can point directly to the bottom line in an invoice and show how much money is saved.

“I’m getting a 403 Forbidden error” is one of the more annoying things to debug. Something, somewhere, in a distributed system, has decided that a request is not authorized and blocked it.

In RavenDB 3.5, we supported OAuth and Windows Authentication to authenticate clients talking to RavenDB. That meant that we had to field support questions exactly like the one above. That was not fun. First of all, there seem to be an inclination in the security community to hide errors really well. “I don’t like you because the difference between our clocks, let’s send an EPERM to this socket and close it, and if I’m feeling nice, log it to /var/logs/obscure/dev/null”.

I have scars from trying to work out “Why doesn’t Windows Auth work for this user” that involved talking to a DBA who wasn’t neither the domain admin nor even aware of the domain topology in the organization (Fallacies: There is one administrator). At one point, we had a production issue that had RavenDB refusing all access because Windows couldn’t validate credentials because of a disconnect in a linked domain that was down for scheduled maintenance and the credential cache expired.

When we designed RavenDB 4.0, we decided that for security, we need something that would be debuggable. That means:

Secured – that goes without saying, if it is security measure that is debuggable but isn’t actually secured…

Rely only on two parties – the client & server in each connection.

Provide enough information to solve the problem

We selected TLS / SSL for this, with client certificate as the authentication mechanism. This answers the first requirement, because TLS has been analyzed enough that I’m certain that it is secured. It is also well known and familiar to administrators.

TLS uses PKI, so it isn’t technically a two parties solution, you may have to deal with certificates revocation, trust chains, etc. But those are well understood and you have really good errors on those. On the client certificate authentication side, however, we require that we’ll have the actual list of trusted certificates, so there is no need to check anywhere else.

We have also taken debuggable security a couple of steps further. For example, we’ll accept an unsecured connection and let it establish itself. Then send a single message down the line, explaining why we aren’t going to use it.

I decided that this is time to take my network protocol and make it cross platform, so I tried to compile it on the Linux subsystem for Windows. As an aside, the fact that I could edit everything in Visual Studio while compiling using GCC and having immediate feedback is amazing, given the usual workflow that this entails. And I would very much like to hear about an IDE that is comparable to Visual Studio out there.

I got to say, the situation for dependencies on C/C++ is flat out horrible. I’m depending on OpenSSL for this code, and I have used VCPkg for actually setting up the dependency. On WSL, on the other hand, the OpenSSL packages are from 2014(!), so they were incompatible. I re-compiled the lastest stable on WSL and tried to get it to work. It didn’t, first it didn’t put the new code in the right place and when I forced it to use the right paths, it failed with missing methods. It looks like I’m spoiled from the point of backward compatibility, but several methods has been deprecated or flat our removed.

I fixed that stuff only to find out that what the WSL version requires is actually newer than what the OpenSSL version I have on the Winodws machine has. I could compile the latest on Windows as well but for now it was just easier to conditional compile the stuff that I needed. I find it sad, but I want to get things done.

After a lot of grunt work, I’m happy to say that this looks like it is actually working. You can find the code for that in this link. I also wrote a small C# code to connect to the server, which looks like:

I gotta admit, the difference in the lines of code and complexity between the two code bases is pretty staggering.

I have to admit, I had an ulterior motive behind wanting to run on Linux, I wanted to see just how badly I managed to manage memory in this short bit of C code, and Valgrind is one of the more common tools to do that.

Of course, Valgrind doesn’t run on WSL, so I had to build that from source as well. I have to say, I was really careful about the actual memory usage and freeing everything. Valgrind was still able to find an issue within the first two minutes of me running it. I got to say, I’m really impressed. Here is the leak:

As you can see, this is happening because I’m using client certificates and it is deep inside OpenSSL (narrator voice: it wasn’t a problem with OpenSSL).

This was a bit of a problem to try to figure out, because I was really careful and properly match each allocation call with the call to release it (in this case, SSL_new() with SSL_free()). I went over the code to establish and tear down a connection many times, but couldn’t see any problem.

The issue here is that Valgrind, in this case, shows me where the memory was allocate, but the leaks is actually elsewhere. OpenSSL implements reference counting for many objects, and as part of the client certificate usage I’m getting the client’s certificate an examining it. The act of getting the certificate increment the ref count on that certificate. I have to explicitly release that certificate once I’m done with it. Naturally Valgrind wasn’t able to figure all of that and just told me that I actually have a leak.

Lesson learned, look at the allocation stack, but don’t marry it or assume it is known correct. Here is the latest drop of code, which is able to pass Valgrind with no leaks detected.