And once it was pointed out, I was cringing inside because it was both obvious what was wrong and I knew it was wrong, and we still had this in the code (which is why audits are great).

What is the problem here? Conceptually, this is what this method does:

This works perfectly. Except that it leaks information to a potential attack. In particular, it leaks timing data. You might expect that the time spent comparing a couple of 32 bytes values wouldn’t matter, given that an attack is probably over a network connection, but it does. There is a great paper that shows how you can use such timing attacks to get private keys information.

In short, anything that uses security needs to use constant time comparisons. Again, conceptually, this means changing the code to be:

So regardless of the result, we’ll always do the same amount of work and won’t expose details through different processing times. In general, by the way, algorithms and execution of code in cryptography attempt to avoid anything that branches on the secret information, because that leaks details to attackers. This is also why AES cannot be securely implemented in software, you always leak in that mode.

The fix, by the way, was to use the sodium_memcmp, which ensures constant time operation.

The second finding in the security report was Inaccurate Key Size. Basically, we initialized a 512 bytes buffer as the master key we will use if the user didn’t give us one. The problem is that our encryption algorithms were using 256 bits out of this range.

The 512 bytes value wasn’t selected at random. This is the minimum sector size on all hard disks that you are likely to encountered and it was chosen to ensure that writing this value to disk would be atomic.

Fortunately, nothing in the code actually depended on that atomic property and it was fairly misleading. While it isn’t as if we are likely to run out of random numbers, reading the code and understanding that you use a 4096 bits buffer as the encryption key but only expect the first 256 bits to be used (note the difference between bits & bytes in this post, they matter) is confusing.

Another issue was that we reused this value to several different encryption algorithms, all of them were taking the same size key, but while that works, for code clarity and ensuring the longevity of the code, it is better to explicitly separate them.

In this case, it was very much the case of premature planning causing us to use something that was a fair bit more complex in practice than what we needed. The solution there was to be explicit and separate the requirements, even if we had to write a bit more code, the cost over time would be much lower because it is not clearer what is going on and you don’t have to guess from prior knowledge.

Continuing our discussion on nonce reuse issues that were raised in the security report, I want to talk about the way we encrypt the most important thing, your data.

RavenDB uses an algorithm called XChaCha20Poly1305 with 256 bits key to encrypt the data. But as we have learned, just using a key is not good enough, we need to use a nonce as well. This is easy when you need to encrypt a message in one go, but the way encryption in RavenDB works, we need to encrypt pieces of the data (randomly, depending on the way users are accessing the system).

In order to do that, RavenDB encrypt each page (usually 8KB in size) independently of each other. We actually use a different key for each page, derived from the master key, but I’ll touch on that in a different post. Here, I want to talk about nonce.

Encryption today is not just about hiding data, it is also about being able to detect if the value has been tampered with, typically called authenticated encryption (AEAD). The algorithm we use requires 16 bytes for the message authentication code (MAC). The problem is that we need to store that MAC somewhere, and the nonce as well. And that value cannot be in the page itself, since that is encrypted.

Luckily for us, we have the page header, a 64 bytes that are reserved at the beginning of each page. And we planned things accordingly to ensure that RavenDB will use only 32 bytes out of the header, giving us 32 bytes free for the encryption to use. The problem is that the XChaCha20Poly1305 algorithm uses a 16 bytes MAC and a 24 bytes nonce. And that is a bit too much to fit in 32 bytes, as you can imagine. Here is the kind of space allocation we have:

Increasing the size of the page header will have repercussions throughout our system, very late in the game, so we didn’t want to do that. Instead, we cheated. The 16 bytes of the nonce are generated using a cryptographic random number generator, but we pass a pointer to the page header 8 bytes before the nonce, so the encryption algorithm also takes the last 8 bytes of the page header itself into account in the nonce. We are guaranteed at least 128 bits of strong randomness there, and the page header itself will change from time to time, obviously, but we rely on the nonce random bytes to ensure uniqueness.

In this manner, we are able to fit the encryption requirements into our existing file structure and have strong encryption without uprooting everything.

Nonce reuse was an issue in four separate locations in the RavenDB security report. But what is a nonce? And what does this matter? A cryptographic nonce is a number that can only be used once.

Let’s consider what encryption does. Given some initial state (a key, for example) it takes an input and outputs what to an outside observe should look like completely random noise. Let’s assume that I have the following secret message that I want to send: “Attack at dawn”. I run it through my most sophisticated encryption algorithm (with a pre-shared key) and get the following secret message:

✏

Assume that I have an adversary that is capable of intercepting such messages, even if they don’t have the key. What can do with this knowledge?

Well, if I’m always using the same key, and encryption is a pure mathematical computation, that means that encrypting the same string twice with the same key is going to result in the same encrypted output. Now, assume that I have some way to get you to encrypt a message of my choosing. For example, if I know that in reaction to something that I will do you’ll send a message saying “Attack imminent”, I can move some troops and then watch for a message to go by:

✏

By comparing the two messages I can deduce that this: “✏” = “Attack”. From there, I can probably crack everything else in a short order.

Now, to be fair, anything above is very far from how things actually behave, but it should allow you to build a mental model of what it going on and why this is important. If you are interested in learning cryptography, I highly recommend the book Serious Cryptography.

One way of avoid these issues to to not generate the same output for the same input each time. In order to do that we need to add something to the mix, and that is the nonce. The nonce is some number that is added to the state of the encryption / decryption and will ensure that two identical messages are not going to generate the same output (because they aren’t going to use the same nonce).

It’s important to understand that without a nonce, you don’t actually need to have identical inputs. In fact, the risk is that an attacked will get two different encrypted messages with the same key. At which point, depending on the exact encryption algorithm used, the attacker can get quite far into breaking the encryption. Again, I’m skipping over a lot of details here, trying to give you the general idea rather than the details.

Pretty much all cryptographic protocol have the notion of a nonce. Something it is called IV, but that generally has the same purpose and it seems like nonce is a more popular term these days.

That leads to an interesting issue, if you reuse the same (key, nonce) pair to encrypt two different messages, it is game over, from a cryptographic point of view. So you really want to avoid that. In general, there are two ways to do that. Either use a counter and increment that each time you encrypt a message or generate a random number that is big enough that collisions don’t matter (usually, 192 bits number).

The first finding in the report was the use of a 64 bits randomly generated nonce. The problem is that this is suspect to a birthday attack and a 64 bits value gives us only 232 level of security, and that is low, given today’s standards. A proper way to handle that is to use a 192 bits number. In order to attack that you’ll need 296 attempts, and that is 79,228,162,514,264,300,000,000,000,000 attempts, which is safe. The answer here was to change the encryption algorithm to one that indeed uses a 192 bits nonce and generate that using a cryptographically secured random number generator.

The third finding in the report had the same issue of 64 bits value, but in a somewhat nastier form. We were accepting the secret and entropy from our callers, and that gave them too much control over what we can do. We changed the code so we’ll only accept the secret to be encrypted and handled all the cryptographic details (now using 192 bits randomly generated nonce) directly, instead of exposing details that can be incorrectly used.

The final nonce reuse is a bit more complex to explain, and I’ll dedicate a post just for that.

In Jan 2018 we asked Edge Security to do a thorough review of RavenDB security and cryptography usage. We wanted to get an outside expert opinion before the RTM release, to make sure that we put out a system that is robust and secured.

As an aside, I strongly recommend doing such a thing on major version releases (at least). Especially if you need an expert opinion, and security is certainly one area in which you want to have things verified.

In the spirit of full transparency, we have made the full report available here. I want to point out that all the issues that were raised in the report were fixed before the RTM release, but I think that it it worth going over each of the items that were brought up in the report and explore them. We have tried our best to create a secured system and it was… humbling to get the report and see fifteen different locations where we failed to do so.

Security is hard to do and even harder to get right. The good news from our perspective was that all those issues were high risk in terms of their impact on the security of the product, but minor in terms of their effect on the overall architecture, so we were able to fix them quickly.

I’m going to take the time now to address each type of failure that was brought up in the report, discuss what kind of risk it represents and how it was resolved. I’ll deal with that in the next posts in this series.

The most important parts the report are quoted below:

RavenDB deploys cryptography essentially on two different fronts: symmetric
cryptography of all data on disk, and asymmetric cryptography via X.509
certificates as a means of authentication between clients and servers.

All symmetric encryption uses Daniel J. Bernstein’s XChaCha20Poly1305 algorithm, as implemented in libsodium, with a randomized 192-bit nonce.
While opting for XChaCha20 over ChaCha20 means more calls to the RNG
and a computation of HChaCha20, it also means that there is no possibility
of nonce-reuse, which means that it is considerably more resilient than adhoc
designs that might make a best-effort attempt to avoid nonce-reuse,
without ensuring it. Symmetric encryption covers the database main data
store, index definitions, journal, temporary file streams, and secret handling.

Such secret handling uses the Windows APIs for protected data, but only
for a randomly generated encryption key, which is then used as part of the
XChaCha20Poly1305 AEAD, to add a form of authentication. All long-term
symmetric secrets are derived from a master key using the Blake2b hash
function with a usage-specific context identifier.

At setup time, client and server certificates are generated. Clients trust
the server’s self-signed certificate, and the server trusts each client based on a
fingerprint of each client’s certificate. All data is exchanged over TLS, and TLS
version failures for certificate failures are handled gracefully, with a webpage
being shown indicating the failure status, rather than aborting the TLS
handshake. Server certificates are optionally signed by Let’s Encrypt using a
vendor-specific domain name. Certificates are generated using BouncyCastle
and are 4096-bit RSA.

Keys, nonces, and certificate private keys are randomly generated using
the operating system’s CSPRNG, either through libsodium or through
BouncyCastle.

If you aren’t familiar with cryptographic terms, this can be pretty scary. There are lots of terms and names that are thrown around. I want to increase the knowledge of my readers, and after seeing the reactions of the guys internally to the report, I think it would do a lot of good to actually go over a real world report and its mitigations and discuss how we resolved them. Along the way, I’ll attempt to cover many of these cryptographic terms and dechiper (pun intended) their meaning.