https://blog.filippo.io/Ghost 0.11Thu, 08 Feb 2018 14:19:28 GMT60tl;dr: you can install cross-compiler toolchains to compile C/C++ for Windows or Linux from macOS with these two Homebrew Formulas.

brew install FiloSottile/musl-cross/musl-cross
brew install mingw-w64

Cross-compiling C and C++ is dreadful.

While in Go you just need to set an environment

]]>https://blog.filippo.io/easy-windows-and-linux-cross-compilers-for-macos/c10489b3-22ee-405c-9242-f0accbafd95dWed, 07 Feb 2018 22:57:39 GMTtl;dr: you can install cross-compiler toolchains to compile C/C++ for Windows or Linux from macOS with these two Homebrew Formulas.

brew install FiloSottile/musl-cross/musl-cross
brew install mingw-w64

Cross-compiling C and C++ is dreadful.

While in Go you just need to set an environment variable, for C you need a whole separate toolchain, that might require an intermediate toolchain to build, and you need to know what you are targeting very well.

musl-cross-make

Thankfully, Rich Felker built a Makefile set to build musl-based cross-compilers, musl-cross-make. It took a few patches, but it runs well on macOS.

musl-cross-make builds beautifully self-contained cross-compilers, so you don't have to worry about pointing to the right libraries path or about where you keep the toolchain. Also, it can target Linux running on a number of different architectures.

Maybe most importantly, it's based on the musl C standard library. This means that the binaries will only run on a musl-based system, like Alpine. However, if you build them as static binaries by passing -static as a LDFLAG they will run anywhere, including in scratch Docker containers. musl is specifically engineered to support fully static binaries, which is not recommended with glibc.

homebrew-musl-cross

Still, I'm a big Homebrew fan. It lets you build software in a well defined sandbox, and only the binaries are linked into your PATH, GNU Stow style. Also, it manages resources and offers powerful dev tools.

So, I wrapped up musl-cross-make in a Homebrew Formula, FiloSottile/homebrew-musl-cross. It takes a long time to build, but it generates a full cross-compiler toolchain, and links into /usr/local/bin just the prefixed binaries, like x86_64-linux-musl-gcc.

brew install FiloSottile/musl-cross/musl-cross

It comes with a precompiled Homebrew Bottle for High Sierra, so if you want to build everything from source use brew install --build-from-source.

Other architectures are supported. For example to get a Raspberry Pi cross-compiler use:

]]>https://blog.filippo.io/live-streaming-cryptopals/baa133e4-0891-494b-a82e-6a2934a36f33Sat, 14 Oct 2017 19:48:05 GMTtl;dr: I'm livecoding the Cryptopals in Go on Twitch, one set every Sunday. The recordings are on YouTube.

This is an experiment, as I've never live-streamed before. I'm planning to solve one set per week, ramping up in difficulty, until Set 8 which will be a proper speed-run. Set 1 aired on October 1st and Set 2 will air on October 15th (tomorrow!) at 2pm ET / 11am PT / 7pm BST.

For context the Cryptopals, once known as the Matasano Crypto Challenges, are an extraordinary set of exercises that have you build and break cryptosystems with attacks that did and do apply to real world implementations. I personally owe a lot to them, as they made me discover I liked building and breaking crypto, revealed that what I had was in fact a marketable skill, and for almost a year sat at the top of my CV.

I remember the thrill of playing them early on when unlocking each set required mailing the solutions to the previous one, and the satisfaction in being the first to complete Set 7 in a 30 hours run while at Recurse Center in Fall 2013. It feels only right to redo them now that I'm attending a new batch of Recurse. I really look forward to Set 8, which I haven't played yet.

The recordings will be on YouTube and the code on GitHub. Below you can see the recording of the first set. Instead of donations, I hope to raise money for the Internet Archive, in particular with the Set 8 speed-run; receipts can be sent to the gmail address filippo.donations to be tracked and thanked on-stream on a honour basis.

Even if at this point I memorized the three numbers (N=16384, r=8, p=1) I only have a vague understanding of their meaning, so I took some time

]]>https://blog.filippo.io/the-scrypt-parameters/dcf0e577-9fb2-4565-88b9-43e9ca797799Wed, 04 Oct 2017 14:49:00 GMTThe recommended scrypt parameters in the Go docs were recently brought up for discussion given they haven't changed since 2009.

Even if at this point I memorized the three numbers (N=16384, r=8, p=1) I only have a vague understanding of their meaning, so I took some time to read the scrypt paper.

It's an enjoyable and witty read, even if mathy at times, with lots of future predictions and reality modeling. Really drives across how scrypt is a fine piece of engineering. Also, it's single column with numbered pages, which earns it 100 points in my book.

The definitions are nested, each building on top of the previous one. In this post I summed up how each parameter impacts the whole scrypt algorithm. Finally, I had a look at what parameters you should use in 2017.

𝑟

𝑟 is the second parameter, but we start with it because it's used by the deepest nested function, BlockMix.

BlockMix turns a hash function with 𝑘-bit long inputs and outputs into a hash function with 2𝑟𝑘-bit long inputs and outputs. That is, it makes the core hash function in scrypt 2𝑟 wider.

It does that by iterating the hash function 2𝑟 times, so both memory usage (to store the hash values) and CPU time scale linearly with it. That is, if 𝑟 doubles the resources double.

That's useful because scrypt applies the hash to "random" memory positions. CPUs load memory in fixed-size blocks called cache lines. If the hash block size is smaller than the cache line, all the rest of the loaded line will be wasted memory bandwidth. Also, it dilutes the memory latency cost. Percival predicted both cache line sizes and memory latencies would increase over time, so made the hash size tunable to prevent scrypt from becoming latency-bound.

I have read that 𝑟 tunes memory usage, and believed it meant it is a memory-only work factor. That is incorrect because both CPU and memory scale with 𝑟. Also, while 𝑟 acts as a work factor, it's unclear increasing it provides the same security as 𝑁 (since there is no added randomization in memory accesses, see below), so it shouldn't be used as one.

𝑁

𝑁 is the one and only work factor.

Memory and CPU usage scale linearly with 𝑁. The mixing function, ROMix, stores 𝑁 sequential hash results in RAM, to then load them in a random order and sequentially xor and hash them.

The reason 𝑁 must be a power of two is that to randomly select one of the 𝑁 memory slots at each iteration, scrypt converts the hash output to an integer and reduces it mod 𝑁. If 𝑁 is a power of two, that operation can be optimized into simple (and fast) binary masking.

Estimating scrypt memory usage

scrypt requires 𝑁 times the hash block size memory. Because of BlockMix, the hash block size is 2𝑟 the underlying hash output size. In scrypt, that hash is the Salsa20 core, which operates on 64-bytes blocks.

So the minimum memory requirement of scrypt is:

𝑁 × 2𝑟 × 64 = 128 × 𝑁 × 𝑟 bytes

For 𝑁 = 16384 and 𝑟 = 8 that would be 16 MiB. It scales linearly with 𝑁 and 𝑟, and some implementations or APIs might cause internal copying doubling the requirement.

𝑝

𝑝 is used in the outmost function, MFcrypt. It is a parallelization parameter. 𝑝 instances of the mixing function are run independently and their outputs concatenated as salt for the final PBKDF2.

𝑝 > 1 can be handled in two ways: sequentially, which does not increase memory usage but requires 𝑝 times the CPU and wall clock time; or parallelly, which requires 𝑝 times the memory and effective CPU time, but does not increase wall clock time.

So 𝑝 can be used to increase CPU time without affecting memory requirements when handled sequentially, or without affecting wall clock time when handled parallelly. However, it offers attackers the same opportunity to optimize for processing or memory.

Parameters for 2017

We apply the same methodology of the paper to pick recommended 𝑁 values for interactive logins and file encryption: the biggest power of two that will run in less than 100ms and 5s respectively on "the CPU in the author's laptop" (a 3.1 GHz Intel Core i5).

Curiously enough, the execution time of 𝑁 = 2^20 is exactly the same as in the paper's Table 1, while the sub-100ms value went from 2^14 to 2^15.

Cache line sizes have not significantly increased since 2009, so 8 should still be optimal for 𝑟.

If we really wanted to insist that CPUs have changed in 10 years we could say that more cores are now available, and increase the 𝑝 factor. However, common implementations don't spread the load of 𝑝 and instead compute each instance sequentially. Also, many use cases involve processing multiple parallel requests, so the available cores are not idle. So it seems ok to leave 𝑝 at 1.

Final miscellaneous notes

Since the final output of scrypt is generated by PBKDF2(HMAC‑SHA256, Password, MixingOutput, 1), even if everything about scrypt were broken, it would still be a secure KDF as long as PBKDF2 with 1 iterations is. (While scrypt uses PBKDF2, it doesn't use it for its work factor.)

Best quote from the paper:

those few organizations which have the resources and inclination to design and fabricate custom circuits for password-cracking tend to be somewhat secretive

Session Tickets, specified in RFC 5077, are a technique to resume TLS sessions by storing key material encrypted on the clients. In TLS 1.2 they speed up the handshake from two to one round-trips.

Unfortunately, a combination of deployment realities and three design flaws makes them the weakest link in modern TLS, potentially turning limited key compromise into passive decryption of large amounts of traffic.

How Session Tickets work

A modern TLS 1.2 connection starts like this:

The client sends the supported parameters;

the server chooses the parameters and sends the certificate along with the first half of the Diffie-Hellman key exchange;

the client sends the second half of the Diffie-Hellman exchange, computes the session keys and switches to encrypted communication;

the server computes the session keys and switches to encrypted communication.

This involves two round-trips between client and server before the connection is ready for application data.

The Diffie-Hellman key exchange is what provides Forward Secrecy: even if the attacker obtains the certificate key and a connection transcript after the connection ended they can't decrypt the data, because they don't have the ephemeral session key.

Forward Secrecy also translates into security against a passive attacker. An attacker that can wiretap but not modify the traffic has the same capabilities of an attacker that obtains a transcript of the connection after it's over. Preventing passive attacks is important because they can be carried out at scale with little risk of detection.

Session Tickets reduce the overhead of the handshake. When a client supports Session Tickets, the server will encrypt the session key with a key only the server has, the Session Ticket Encryption Key (STEK), and send it to the client. The client holds on to that encrypted session key, called a ticket, and to the corresponding session key. The server forgets about the client, allowing stateless deployments.

The next time the client wants to connect to that server it sends the ticket along with the initial parameters. If the server still has the STEK it will decrypt the ticket, extract the session key, and start using it. This establishes a resumed connection and saves a round-trip by skipping the key negotiation. Otherwise, client and server fallback to a normal handshake.

Fatal flaw #1

The first problem with 1.2 Session Tickets is that resumed connections don't perform any Diffie-Hellman exchange, so they don't offer Forward Secrecy against the compromise of the STEK. That is, an attacker that obtains a transcript of a resumed connection and the STEK can decrypt the whole conversation.

How the specification solves this is by stating that STEKs must be rotated and destroyed periodically. I now believe this to be extremely unrealistic.

Session Tickets were expressly designed for stateless server deployments, implying scenarios where there are multiple servers serving the same site without shared state. These server must also share STEKs or resumption wouldn't work across them.

As soon as a key requires distribution it's exposed to an array of possible attacks that an ephemeral key in memory doesn't face. It has to be generated somewhere, and transmitted somehow between the machines, and that transmission might be recorded or persisted. Twitter wrote about how they faced and approached exactly this problem.

Moreover, an attacker that compromises a single machine can now decrypt traffic flowing through other machines, potentially violating security assumptions.

Finally, if a key is not properly rotated it allows an attacker to decrypt past traffic upon compromise.

TLS 1.3 solves this by supporting Diffie-Hellman along with Session Tickets, but TLS 1.2 was not yet structured to support one round trip Diffie-Hellman (because of the legacy static RSA structure).

Fatal flaw #2

Session Tickets contain the session keys of the original connection, so a compromised Session Ticket lets the attacker decrypt not only the resumed connection, but also the original connection.

This potentially degrades the Forward Secrecy of non-resumed connections, too.

The problem is exacerbated when a session is regularly resumed, and the same session keys keep getting re-wrapped into new Session Tickets (a resumed connection can in turn generate a Session Ticket), possibly with different STEKs over time. The same session key can stay in use for weeks or even months, weakening Forward Secrecy.

TLS 1.3 addresses this by effectively hashing (a one-way function) the current keys to obtain the keys for the resumed connection. While hashing is a pretty obvious solution, in TLS 1.2 there was no structured key schedule, so there was no easy agnostic way to specify how keys should be derived for each different cipher suite.

Fatal flaw #3

The NewSessionTicket message containing the Session Ticket is sent from the server to the client just before the ChangeCipherSpec message.

The ChangeCipherSpec message enables encryption with the session keys and the negotiated cipher, so everything exchanged during the handshake before that message is sent in plaintext.

This means that Session Tickets are sent in the clear at the beginning of the original connection.

(╯°□°)╯︵ ┻━┻

An attacker with the STEK doesn't need to wait until session resumption is attempted. Session Tickets containing the current session keys are sent at the beginning of every connection that merely supports Session Tickets. In plaintext on the wire, ready to be decrypted with the STEK, fully bypassing Diffie-Hellman.

TLS 1.3 solves this by... not sending them in plaintext. There is no strong reason I can find for why TLS 1.2 wouldn't wait until after the ChangeCipherSpec to send the NewSessionTicket. The two messages are sent back to back in the same flight. Someone suggested it might be not to complicate implementations that do not expect encrypted handshake messages (except Finished).

1 + 2 + 3 = dragnet surveillance

The unfortunate combination of these three well known flaws is that an attacker that obtains the Session Ticket Encryption Key can passively decrypt all connections that support Session Tickets, resumed and not.

It's grimly similar to a key escrow system: just before switching to encrypted communication, the session keys are sent on the wire encrypted with a (somewhat) fixed key.

Passive attacks are the enablers of dragnet surveillance, what HTTPS aims to prevent, and the same actors that are known to engage in dragnet surveillance have specialized in surgical key extraction attacks.

There is no proof that these attacks are currently performed and the aim of this post is not to spread FUD about TLS, which is still the most impactful security measure on the Internet today despite all its defects. However, war-gaming the most effective attacks is a valuable exercise to ensure we focus on improving the important parts, and Session Tickets are often the single weakest link in TLS, far ahead of the CA system that receives so much more attention.

Session Tickets in the real world

The likeliness and impact of the described attacks changes depending on how Session Tickets are deployed.

In some cases, the same STEK can be used across national borders, putting it under multiple jurisdictional threats. A single compromised machine then enables an attacker to decrypt traffic passively across the whole world by simply exfiltrating a short key every rotation period.

Mitigating this by using different STEKs across geographical locations involves a trade-off, since it disables session resumption for clients roaming across them. It does however increase the cost for what appears to be the easiest dragnet surveillance avenue at this time, which is always a good result.

Attack surface. Even if it worked reliably, you wouldn't want to use the OS automatic captive portal browser for security reasons. Since it can be triggered by a network attacker without user interaction, it's the perfect target. It had vulnerabilities before and there is no information about whether it's up to date and sandboxed. There's also no option to disable Javascript or install security extensions.

DNS. I want to pick my own DNS server, and more specifically run my own unbound, for a number of reasons: clean results, local zones, overrides, DNSSEC... However most captive portals will block UDP traffic to anything except their DNS resolver (or would be trivially bypassed). So every time getting past a captive portal involves opening Network Settings, removing the custom DNS, logging in, and hopefully (that is, rarely) remember to put the custom DNS server back in.

HTTP. Finally, since a captive portal literally relies on a MitM attack, it results inaccessible when using HTTPS Everywhere in "Block all unencrypted requests" mode.

A dedicated Chrome captive browser

To scratch this itch I decided to make my own captive portal browser based on Chrome, such that it can be secure and configured as I please.

The main challenge is reaching the DHCP-provided captive portal DNS resolver without changing system settings. Chrome lacks the ability to configure DNS upstreams, but supports SOCKS5 which proxies name resolution.

With 100 lines of Go I built a small SOCKS5 proxy based on github.com/armon/go-socks5 that handles name resolution via a custom net.Resolver that always dials a fixed IP for the DNS server.

It automatically discovers the DHCP DNS server on macOS with this command:

ipconfig getoption en0 domain_name_server

Finally it starts a Chrome instance configured to use the SOCKS5 proxy with the following command and waits for it to quit:

The 1600 LoC patch allows userspace to pass the kernel the encryption keys for an established connection, making encryption happen transparently inside the kernel.

The only ciphersuite supported is AES-128-GCM as per RFC 5288, meaning it only supports TLS version 1.2. Most modern TLS connections on the Internet use that.

The kernel only handles the record layer, that is, it only takes care of encrypting packets. Handshake, key exchange, certificate handling, alerts and renegotiation are left out of kernelspace. The userspace application, like OpenSSL, will do all that and then delegate to the kernel once the keys are established.

Moreover, only encryption is supported, not decryption. This wasn't clear to me until I failed to find the TLS_RX constant.

These limitations are very good to contain complexity and attack surface, but they mean that kTLS won't replace any userspace complexity as you still need a TLS library to do the handshake, for all other cipher suites, and for the receiving side of the connection. That makes kTLS purely a performance feature.

Keys are passed to the kernel with a setsockopt(2) call on the TCP socket. Once that call is made everything written to that socket is transparently encrypted. A TLS record is made for each send(2) call unless a flag is used to request buffering. Alerts and any other messages which are not application data are sent via CMSG.

The main motivation seems to be to allow use of sendfile(2) on TLS connections. sendfile(2) allows data to be transferred from a file descriptor (like a file) to another (like a TCP connection) without paying the price of a copy round-trip through user space. If the kernel is handling TLS, sendfile(2) can be used also for encrypted connections. The original paper by Facebook claims a significant improvement in 99th percentile performance.

I also saw a mention of using BPF filtering rules on plaintext data, which is clever.

It seems all very sensibly executed, but I can't help being terrified nonetheless. First because even just the record layer of TLS 1.2 with AEAD has significant legacy baggage and attack surface, and secondly because any compatibility issue introduced by this code will add a dimension to the compatibility matrix.

Moreover, the end goal seems to be to do TLS offloading on dedicated hardware managed by kernel drivers, and poorly-implemented hard-to-update hardware is exactly why TLS 1.3 hasn't been deployed yet.

So yeah, not a huge fan. But of course, it wouldn't be me if I didn't fork the Go crypto/tls package to work with it.

To test kTLS I'll need a Linux VM running Linux 4.13. My favorite Linux distribution (Alpine) is not that cutting-edge with kernel versions, but one can of course trust Arch to have an up-to-date linux-mainline package. 3 painful hours of learning Arch and compiling Linux from AUR follow...

I tried toying with mkerrors.sh to get the TLS constants into the syscall package, but eventually gave up and redefined them in crypto/tls. There is a CL for golang.org/x/sys/unix to update to Linux 4.13, but it's compiled without kTLS.

The easiest place to hook kTLS into crypto/tls seemed to be (*halfConn).changeCipherSpec(). That's where encryption is enabled for that half (receiving or sending) of the connection. However that happened before sending the Finished handshake message, which we would have to send with a CMSG.

Instead, I added a hook at the end of both client and server handshake. The new function (*Conn).enableApplicationDataEncryption() checks that the cipher is a *fixedNonceAEAD (i.e. AES-GCM) with the right key length, and that the connection is a *net.TCPConn, and then invokes kTLSEnable() with all the key material.

kTLSEnable constructs the tls_crypto_info structure as per the docs, hoping Go won't add padding for alignment purposes. It then uses syscall.RawConn.Control and syscall.SetsockoptString to make the setsockopt syscalls to pass it to the kernel. There's a lot of unsafe involved, but no need for cgo (I just redefine the types manually).

For a moment I thought it would be enough to then switch the cipher internally to unencrypted, but while it's not documented, for sendfile(2) to work the data must be sent with no framing at all. Even unencrypted TLS packets are framed. So as a last hack I added a dummy kTLSCipher that strips the record header instead of encrypting the record before sending it on the wire.

I ran a simple HTTPS web server with net/http, loaded a page on Chrome, and instead of causing a kernel panic...

]]>https://blog.filippo.io/restic-cryptography/fb79683f-2aa6-4961-bd2c-ccf131923e4aTue, 29 Aug 2017 20:32:59 GMTtl;dr: this is not an audit nor an endorsement and I take no responsibility, but I had a quick look at the crypto and I think I'm going to use restic for my personal backups.

I keep hearing good things about restic. I am redoing my storage solution, and restic seems to tick all the boxes for my personal backups:

Open Source

written in Go

runs on OpenBSD

B2 backend

good docs

verifiable backups

reversable format

encryption

But it does not look like the encryption has ever been audited.

Today I have to wait a couple hours to get a passport (I'm Italian, this involved rolling dice for Charm Person) so I figured I would have a look at it.

Important: this does NOT qualify as a professional audit, nor am I endorsing restic's encryption beyond "I looked at it in a noisy waiting room for an hour I guess".I am available to review, design or implement cryptosystems, but this is not that.

This post also does not attempt to fully explain all the cryptography it mentions, so if you find something particularly curious, confusing, or fascinating do let me know and I'll try to write properly about it.

Data model

restic backs up to repositories. A repository is stored at a given remote. There's deduplication across the repository.

Repository contents are content-addressed by SHA-256 at the encrypted file level, not at the backed up file level. That's good not to leak hashes of files, but I wonder how deduplication works.

Everything else is abstractions built out of JSON inside those files. This is very reminiscent of Camlistore. I wonder if there are caches/indexes or if it knows how to walk the tree from scratch. (Turns out, both! See below.)

Threat model

It does speak more about implementation that adversaries, but now I'm asking too much.

Essentially: confidentiality and authentication for stored files and metadata against an adversarial repository. Rollbacks and snapshot correlation are possible. Checks out.

I imagine file/backup sizes are not protected, but it's not mentioned.

Encryption

Every in-repo file (the ones of which the SHA-256 is taken) is encrypted with AES-256-CTR-Poly1305-AES. Format is "16 bytes random IV + ciphertext + MAC": encrypt-then-MAC, good.

Unusual choice, not using AES-GCM. To be clear: GCM is awful, but for TLS reasons it's the AEAD with the fastest implementations. Case in point, in Go AES-CTR is much slower with no good reason. Also unusual is not using the TLS pairing ChaCha20-Poly1305. Choosing AES makes sense to use hardware acceleration, but then in practice ChaCha20-Poly1305 might be faster than AES-CTR for the reason above. All in all, it's a rational choice in theory, but unusual and a bit sub-optimal in practice. While still secure, a "self-rolled" AEAD is a bit of a yellow flag (but not a red flag).

I admit I haven't studied Poly1305-AES before, but it's the original design of Poly1305. The Poly1305 authenticator as I know it and as implemented by Go uses one-time keys; Poly1305-AES uses a fixed-key AES operation to encrypt a nonce to make one-time keys. Poly1305 keys are 32 bytes, used in two halves as r (which is fixed in Poly1305-AES and gets masked) and s (which is the only part derived with AES in Poly1305-AES). Might be important to check that this Poly1305-AES implementation is keeping the right half fixed, since golang.org/x/crypto/poly1305 simply takes an undocumented *[32]byte "one-time key" and I suspect switching derived and fixed half can be catastrophic. The Go package does not document which half is which.

Good news is that even if the paper says...

There are safe ways to reuse k for encryption, but those ways are not analyzed in this paper.

... restic took the high road and uses separate keys for encryption and Poly1305-AES.

Password to key derivation is classic double-wrapping: plaintext key metadata files (which include hostname/username) contain scrypt parameters, salt and a data blob. scrypt with the supplied password is used as a KDF to generate the keys to decrypt (AES256-Poly1305-AES) the blob, which contains the master key. An attacker can prevent password revocation, so changing a password after it got compromised doesn't protect future backups to that repository. This could be improved by making a new master key for subsequently generated blobs.

I wonder if it's possible, manipulating N/r/p/salt, to make an unknown password generate a predictable key. If it is, an attacker can force a client to make a backup with keys they control. It probably isn't, but I don't have time to figure out which scrypt property it boils down to. (Exercise for the reader!)

Deduplication

Data and tree blobs are encrypted individually and packed together in pack files with an encrypted header. A full index of all plaintext blobs is kept cached by the client, and encrypted in the repository.

The data from each file is split into variable length Blobs cut at offsets defined by a sliding window of 64 byte. The implementation uses Rabin Fingerprints for implementing this Content Defined Chunking (CDC). An irreducible polynomial is selected at random and saved in the file config when a repository is initialized, so that watermark attacks are much harder.

Hmm. So this is how deduplication happens. Like Camlistore.

Not a fan of the sentence "attacks are much harder". I know little about Rabin Fingerprints, but I can imagine the attack relies on leaks by the chunker algorithm through blob sizes. And packs don't help because I bet you can spot the Poly1305 authenticators in them, allowing you to split the blobs up without reading the header.

I'll add to the TODO to learn more about CDC. (There's a post about them on the restic blog!) In the meantime I'll trust this irreducible polynomial to make leaks not too obvious, and remember not to backup potentially attacker-supplied data in a reliable manner.

That's important also because attacker-supplied data can lead to straightforward fingerprinting attacks on any kind of deduplicated system. (This should have been in the threat model.)

Implementation

I started from github.com/restic/restic/internal/crypto, at commit 3559f9c7760ffadd32888c531d0d08f0c1aa98e3.

Random bytes are generated with crypto/rand with panic() on error. Good.

scrypt parameters are checked with (*github.com/elithrar/simple-scrypt.Params).Check which is reassuring since they are attacker controlled.

There are explicit Key/EncryptionKey/MACKey structures with separate K and R for Poly1305-AES, which is good. The amount of indirection in filling them with random data makes me uncomfortable—one missing * and pass-by-value would result in an empty key—but it seems implemented right. Also, hey, that's what (*Key).Valid() is checking for, neat!

The Poly1305-AES implementation seems to put r and AES(k)(s) in the right order for golang.org/x/crypto/poly1305 (see above), with r in the first 16 bytes of the key.

However, the application code insists on masking r as the paper says before calling poly1305, even if that package does not require it. And that got me worried, because the masking done by poly1305looks different. What if poly1305 decided to use a different representation (since it can, since the key parts and format are unspecified) and now restic is masking off meaningful bits? 😱

And now I'm reading 130-bit arithmetic mapped to 26-bit integer registries to figure out what DJB's magic mask actually does to the numbers (deja-vu... coughclampingcough). This is what happens when you roll your own crypto. You make cryptographers hurt. Think about your cryptographer.

Turns out that it's the same masking in 26-bit little-endian windows. Here's the ASCII art sketch I had to make to figure it out. (It compares the paper masks, above, and the poly1305 ones, below, after reversing the bit shifts and swapping endianness).

Anyway applying the mask is pointless, dangerous, and took 45+ minutes to audit. I'm going to submit a PR to remove it when I recover.

I simplified the rest of Encrypt/Decrypt while reviewing it. The Decrypt API (as opposed to the Encrypt one), does not return a new slice, but only an int for the caller to slice the plaintext with. That felt like an easy and bad thing to forget, so I inspected the callers with Sourcegraph. Found no issues.

(FWIW, I'm a fan of append()-like interfaces, where you can still reuse buffers by slicing to [:0] but can't forget to reassign the resulting slice.)

There seems to be a problem with the Decrypt overlap rules, though. To save memory you might want to use the same buffer for plaintext and ciphertext. That is allowed with some constraints. The cipher.Stream interface implemented by CTR mode says this at the method XORKeyStream:

However, calling Decrypt with the same buffer for plaintext and ciphertext (as done throughout the code and allowed by the docs) means that buf[16:] is decrypted into buf[:] because of the IV. That's theoretically(?) not ok.

I opened an issue about it, but I don't think it's a problem right now (but it might become one as the CTR implementation is optimized). I suggested adopting the standard cipher.AEAD interface, which is append-style and makes it easy to get exact overlap easy.

I finally had a quick look at internal/repository/key.go. I made sure scrypt is the only KDF, and clarified that the data field in a key object is just encrypted like any other blob. The design docs had the MAC size wrong and got me confused.

Conclusion

The design might not be perfect, but it's good. Encryption is a first-class feature, the implementation looks sane and I guess the deduplication trade-off is worth it.

Go has good support for calling into assembly, and a lot of the fast cryptographic code in the stdlib is carefully optimized assembly, bringing speedups of over 20 times.

However, writing assembly code is hard, reviewing it is possibly harder, and cryptography is unforgiving. Wouldn't it be nice if we could write these hot functions in a higher level language?

This post is the story of a slightly-less-than-sane experiment to call Rust code from Go fast enough to replace assembly. No need to know Rust, or compiler internals, but knowing what a linker is would help.

Why Rust

I'll be upfront: I don't know Rust, and don't feel compelled to do my day-to-day programming in it. However, I know Rust is a very tweakable and optimizable language, while still more readable than assembly. (After all, everything is more readable than assembly!)

Go strives to find defaults that are good for its core use cases, and only accepts features that are fast enough to be enabled by default, in a constant and successful fight against knobs. I love it for that. But for what we are doing today we need a language that won't flinch when asked to generate stack-only functions with manually hinted away safety checks.

So if there's a language that we might be able to constrain enough to behave like assembly, and to optimize enough to be as useful as assembly, it might be Rust.

Finally, Rust is safe, actively developed, and not least, there's already a good ecosystem of high-performance Rust cryptography code to tap into.

By using the C ABI as lingua franca of FFIs, we can call anything from anything: Rust can compile into a library exposing the C ABI, and cgo can use that. It's awkward, but it works.

We can even use reverse-cgo to build Go into a C library and call it from random languages, like I did with Python as a stunt. (It was a stunt folks, stop taking me seriously.)

But cgo does a lot of things to enable that bit of Go naturalness it provides: it will setup a whole stack for C to live in, it makes defer calls to prepare for a panic in a Go callback... this could be will be a whole post of its own.

As a result, the performance cost of each cgo call is way too high for the use case we are thinking about—small hot functions.

Linking it together

So here's the idea: if we have Rust code that is as constrained as assembly, we should be able to use it just like assembly, and call straight into it. Maybe with a thin layer of glue.

We don't have to work at the IR level: the Go compiler converts both code and high-level assembly into machine code before linking since Go 1.3.

This is confirmed by the existence of "external linking", where the system linker is used to put together a Go program. It's how cgo works, too: it compiles C with the C compiler, Go with the Go compiler, and links it all together with clang or gcc. We can even pass flags to the linker with CGO_LDFLAGS.

Underneath all the safety features of cgo, we surely find a cross-language function call, after all.

It would be nice if we could figure out how to do this without patching the compiler, though. First, let's figure out how to link a Go program with a Rust archive.

Thankfully go/build is nothing but a frontend! Go offers a set of low level tools to compile and link programs, go build just collects files and invokes those tools. We can follow what it does by using the -x flag.

I built this small Makefile by following a -x -ldflags "-v -linkmode=external '-extldflags=-v'" invocation of a cgo build.

That looks like an interesting pragma! //go:linkname just creates a symbol alias in the local scope (which can be used to call private functions!), and I'm pretty sure the byte trick is only cleverness to have something to take the address of, but //go:cgo_import_static... this imports an external symbol!

Armed with this new tool and the Makefile above, we have a chance to invoke this Rust function (hello.rs)

Well, it crashes when it tries to return. Also that $2048 value is the whole stack size Rust is allowed (if it's even putting the stack in the right place), and don't ask me what happens if Rust tries to touch a heap... but hell, I'm surprised it works at all!

Calling conventions

Now, to make it return cleanly, and take some arguments, we need to look more closely at the Go and Rust calling conventions. A calling convention defines where arguments and return values sit across function calls.

The Go calling convention is described here and here. For Rust we'll look at the default for FFI, which is the standard C calling convention.

The caller, seen above, does very little: it places the arguments on the stack in reverse order, at the bottom of its own frame (rsp to 16(rsp), remember that the stack grows down) and executes CALL. The CALL will push the return pointer to the stack and jump. There's no caller cleanup, just a plain RET.

Then there's the rsp management, which subtracts 0x108, making space for the entire 0x100 bytes of frame in one go, and the 8 bytes of frame pointer. So rsp points to the bottom (the end) of the function frame, and is callee managed. Before returning, rsp is returned to where it was (just past the return pointer).

Finally the frame pointer, which is effectively pushed to the stack just after the return pointer, and updated at rbp. So rbp is also callee saved, and should be updated to point at where the caller's rbp is stored to enable stack trace unrolling.

Finally, from the body itself we learn that return values go just above the arguments.

Virtual registers

The Go docs say that SP and FP are virtual registers, not just aliases of rsp and rbp.

Indeed, when accessing SP from Go assembly, the offsets are adjusted relative to the real rsp so that SP points to the top, not the bottom, of the frame. That's convenient because it means not having to change all offsets when changing the frame size, but it's just syntactic sugar. Naked access to the register (like MOVQ SP, DX) accesses rsp directly.

The FP virtual register is simply an adjusted offset over rsp, too. It points to the bottom of the caller frame, where arguments are, and there's no direct access.

Note: Go maintains rbp and frame pointers to help debugging, but then uses a fixed rsp and omit-stack-pointer-style rsp offsets for the virtual FP. You can learn more about frame pointers and not using them from this Adam Langley blog post.

We care little about this, since in Go all registers are caller-saved.

The stack must be aligned to 16-bytes.

(I think this is why JMP worked and CALL didn't, we failed to align the stack!)

Frame pointers work the same way (and are generated by rustc with -g).

Gluing them together

Building a simple trampoline between the two conventions won't be hard. We can also look at asmcgocall for inspiration, since it does approximately the same job, but for cgo.

We need to remember that we want the Rust function to use the stack space of our assembly function, since Go ensured for us that it's present. To do that, we have to rollback rsp from the end of the stack.

CALL on macOS

CALL didn't quite work on macOS. For some reason, there the function call was replaced with an intermediate call to _cgo_thread_start, which is not that incredible considering we are using something called cgo_import_static and that CALL is virtual in Go assembly.

callq 0x40a27cd ; x_cgo_thread_start + 29

We can bypass that "helper" by using the full //go:linkname incantation we found in the standard library to take a pointer to the function, and then calling the function pointer, like this.

Is it fast?

The point of this whole exercise is to be able to call Rust instead of assembly for cryptographic operations (and to have fun). So a rustgo call will have to be almost as fast as an assembly call to be useful.

Benchmark time!

We'll compare incrementing a uint64 inline, with a //go:noinline function, with the rustgo call above, and with a cgo call to the exact same Rust function.

Rust was compiled with -g -O, and the benchmarks were run on macOS on a 2.9GHz Intel Core i5.

To build the .a we use cargo build --release with a Cargo.toml that defines the dependencies, enables frame pointers, and configures curve25519-dalek to use its most efficient math and no standard library.

Packaging up

Now we know it actually works, that's exciting! But to be usable it will have to be an importable package, not forced into package main by a weird build process.

This is where //go:binary-only-package comes in! That annotation allows us to tell the compiler to ignore the source of the package, and to only use the pre-built .a library file in $GOPATH/pkg.

If we can manage to build a .a file that works with Go's native linker (cmd/link, referred to also as the internal linker), we can redistribute that and it will let our users import the package as if it was a native one, including cross-compiling (provided we included a .a for that platform)!

The Go side is easy, and pairs with the assembly and Rust we already have. We can even include docs for go doc's benefit.

//go:binary-only-package
// Package edwards25519 implements operations on an Edwards curve that is
// isomorphic to curve25519.
//
// Crypto operations are implemented by calling directly into the Rust
// library curve25519-dalek, without cgo.
//
// You should not actually be using this.
package edwards25519
import _ "unsafe"
//go:cgo_import_static scalar_base_mult
//go:linkname scalar_base_mult scalar_base_mult
var scalar_base_mult uintptr
var _scalar_base_mult = &scalar_base_mult
// ScalarBaseMult multiplies the scalar in by the curve basepoint, and writes
// the compressed Edwards representation of the resulting point to dst.
func ScalarBaseMult(dst, in *[32]byte)

The Makefile will have to change quite a bit—since we aren't building a binary anymore we don't get to keep using go tool link.

A .a archive is just a pack of .o object files in an ancient format with a symbol table. If we could get the symbols from the Rust libed25519_dalek_rustgo.a library into the edwards25519.a archive that go tool compile made, we should be golden.

.a archives are managed by the ar UNIX tool, or by its Go internal counterpart, cmd/pack (as in go tool pack). The two formats are ever-so-subtly different, of course. We'll need to use the platform ar for libed25519_dalek_rustgo.a and the Go cmd/pack for edwards25519.a.

(For example, the platform ar on my macOS uses the BSD convention of calling files #1/LEN and then embedding the filename of length LEN at the beginning of the file, to exceed the 16 bytes max file length. That was confusing.)

To bundle the two libraries I tried doing the simplest (read: hackish) thing: extract libed25519_dalek_rustgo.a into a temporary folder, and then pack the objects back into edwards25519.a.

Well, it almost worked. We cheated. The binary would not compile unless we linked it to libresolv. To be fair, the Rust compiler tried to tell us. (But who listens to everything the Rust compiler tells you anyway?)

note: link against the following native artifacts when linking against this static library
note: the order and any duplication can be significant on some platforms, and so may need to be preserved
note: library: System
note: library: resolv
note: library: c
note: library: m

Now, linking against system libraries would be a problem, because it will never happen with internal linking and cross-compilation...

But hold on a minute, libresolv?! Why does our no_std, "should be like assembly", stack only Rust library want to resolve DNS names?

I really meant no_std

The problem is that the library is not actually no_std. Look at all that stuff in there! We want nothing to do with allocators!

So how do we actually make it no_std? This turned out to be an entire side-quest, but I'll give you a recap.

If any dependency is not no_std, your no_std flag is nullified. One of the curve25519-dalek dependencies had this problem, cargo update fixed that.

Actually making a no_stdstaticlib (that is, an library for external use, as opposed to for inclusion in a Rust program) is more like making a no_stdexecutable, which is much harder as it must be self-contained.

A friend thankfully suggested making sure that I was using --gc-sections to strip dead code, which might reference things I don't actually need. And sure enough, this worked. (That's three layers of flag-passing right there.)

But umh, in the Makefile we aren't using a linker at all, so where do we put --gc-sections? The answer is to stop hacking .as together and actually reading the linker man page.

We can build a .o containing a given symbol and all the symbols it references with ld -r --gc-sections -u $SYMBOL. -r makes the object reusable for a later link, and -u marks a symbol as needed, or everything would end up garbage collected. $SYMBOL is scalar_base_mult in our case.

Why wasn't this a problem on macOS? It would have been if we linked manually, but the macOS compiler apparently does dead symbol stripping by default.

The last missing piece is internal linking on Linux. In short, it was not linking the Rust code, even if the compilation seemed to succeed. The relocations were not happening and the CALL instructions in our Rust function left pointing at meaningless addresses.

At that point I felt like it had to be a silent linker bug, the final boss in implementing rustgo, and reached out to people much smarter than me. One of them was guiding me in debugging cmd/link (which was fascinating!) when Ian Lance Taylor, the author of cgo, helpfully pointed out that //cgo:cgo_import_static is not enough for internal linking, and that I also wanted //cgo:cgo_import_dynamic.

I still have no idea why leaving it out would result in that issue, but adding it finally made our rustgo package compile both with external and internal linking, on Linux and macOS, out of the box.

Redistributable

Now that we can build a .a, we can take the suggestion in the //go:binary-only-package spec, and build a tarball with .as for linux_amd64/darwin_amd64 and the package source, to untar into a GOPATH to install.

Once installed like that, the package will be usable just like a native one, cross-compilation included (as long as we packaged a .a for the target)!

The only thing we have to worry about is that if we build Rust with -Ctarget-cpu=native it might not run on older CPUs. Thankfully benchmarks (and the curve25519-dalek authors) tell us that the only real difference is between post and pre-Haswell processors, so we only have to make a universal build and a Haswell one.

As the cherry on top, I made the Makefile obey GOOS/GOARCH, converting them as needed into Rust target triples, so if you have Rust set up for cross-compilation you can even cross-compile the .a itself.

Turning it into a real thing

Well, this was fun.

But to be clear, rustgo is not a real thing that you should use in production. For example, I suspect I should be saving g before the jump, the stack size is completely arbitrary, and shrinking the trampoline frame like that will probably confuse the hell out of debuggers. Also, a panic in Rust might get weird.

To make it a real thing I'd start by calling morestack manually from a NOSPLIT assembly function to ensure we have enough goroutine stack space (instead of rolling back rsp) with a size obtained maybe from static analysis of the Rust function (instead of, well, made up).

It could all be analyzed, generated and built by some "rustgo" tool, instead of hardcoded in Makefiles and assembly files. cgo itself is little more than a code-generation tool after all. It might make sense as a go:generate thing, but I know someone who wants to make it a cargo command. (Finally some Rust-vs-Go fighting!) Also, a Rust-side collection of FFI types like, say, GoSlice would be nice.

#[repr(C)]
struct GoSlice {
array: *mut u8,
len: i32,
cap: i32,
}

Or maybe a Go or Rust adult will come and tell us to stop before we get hurt.

EDIT: It was pointed out to me that if we simply named the Rust object file libed25519_dalek_rustgo.syso, we could skip all the go tool invocations and simply use go build which automatically links .syso files found in the package. But what's the fun in that.

Thanks (in no particular order) to David, Ian, Henry, Isis, Manish, Zaki, Anna, George, Kaylyn, Bill, David, Jess, Tony and Daniel for making this possible. Don't blame them for the mistakes and horrors, those are mine.

P.S. Before anyone tries to compare this to cgo (which has many more safety features) or pure Go, it's not meant to replace neither. It's meant to replace manually written assembly with something much safer and more readable, with comparable performance. Or better yet, it was meant to be a fun experiment.

]]>tl;dr: use the script at the bottom to go get into the Homebrew "Cellar" and keep your GOPATH clean.

I personally like GOPATH and import paths, but while trying to reduce my laptop to a thin reproducible client, I felt the pain of keeping track of the hundreds of

]]>https://blog.filippo.io/cleaning-up-my-gopath-with-homebrew/15b1c363-c8f2-48a2-921a-5baba581a79bSat, 12 Aug 2017 23:39:07 GMTtl;dr: use the script at the bottom to go get into the Homebrew "Cellar" and keep your GOPATH clean.

I personally like GOPATH and import paths, but while trying to reduce my laptop to a thin reproducible client, I felt the pain of keeping track of the hundreds of repositories that end up in there.

The problem is that there are too many reasons for things to be in there:

code you're actively working on

dependencies of that code

tools you installed and their dependencies

tools your editor installed and their dependencies

1 belongs there. So much so that after cleaning it up I set my GOPATH to $HOME and got rid of my ~/code folder. All my code--whatever the language--is now in ~/src, with unpublished code in ~/src/filippo.io.

2 doesn't, and should just go directly into the vendor folder. That's why I made gvt back then, and I'm happy to see that dep will do the same, if I understand correctly.

4 can be fixed, for example for Visual Studio Code:

"go.toolsGopath": "~/.vscode/gopath"

That leaves only 3. The repositories that show up because you want to install some nice Go tool.

The fact is, go get was never meant to be a software distribution tool. Homebrew is. Homebrew also comes with hashes for everything, so when a tool is available from Homebrew I now prefer installing it from there. But sometimes it isn't.

(I know I could have made a Homebrew command instead, but Ruby. If you feel like making one, it would be super cool if it could also handle upgrades automatically!)

]]>Fully reproducible builds are important because they bridge the gap between auditable open source and convenient binary artifacts. Technologies like TUF and Binary Transparency provide accountability for what binaries are shipped to users, but that's of limited utility if there is no way (short of reverse engineering) of proving that]]>https://blog.filippo.io/reproducing-go-binaries-byte-by-byte/b2c6cea7-abe0-43aa-8396-fb29f252c0e8Sun, 23 Apr 2017 19:56:10 GMTFully reproducible builds are important because they bridge the gap between auditable open source and convenient binary artifacts. Technologies like TUF and Binary Transparency provide accountability for what binaries are shipped to users, but that's of limited utility if there is no way (short of reverse engineering) of proving that the binary is in fact the result of compiling the intended source.

That's why the Debian project is putting tremendous effort into making packages reproducible. The good news is that Go builds are reproducible by default.

Prerequisites

There are a few common sense requirements.

Of course, the builds must be reproducible in the weaker sense: that means the source code must match perfectly.

This includes dependencies, so the project has to vendor them strictly. This is important beyond binary reproducibility: you don't want for "version 1.3" of a software to mean different things based on when it was built.

The compiler version must be the same.

GOPATH and GOROOT must match (#16860), annoyingly, as they are all over the binary in debug file paths.

Note: the default GOROOT, the one that the compiler will use if the environment variable is not set, must also match, since it will be copied into binaries (#17943). You can only change that by recompiling the toolchain in the right directory.

Detecting parameters

To start, we need to figure out the GOPATH and GOROOT values they were built with. This is easy to figure out using debug/gosym and debug information to query the file path of known functions. (PE support is... left as an exercise to the reader.)

Debugging

The first thing to look at is the Build ID. The Build ID is a hash of the filenames of the compiled files, plus the version of the compiler (and other things in zversion.go, like the default GOROOT). See pkg.go.

What got me with rclone was not rebuilding the compiler in the new location to get the right default GOROOT—the make.bash step of the Dockerfile. If you enjoy debugging, here's the tootstorm on Mastodon.

FileVault 2 is the full-disk encryption system of macOS. Normally, it's turned on from System Preferences, and locks the disk with the passwords of all the users allowed to unlock the machine.

Overloading the login/unlock/sudo password is an understandable UX simplicity choice, but makes it very hard to manage the security tradeoff: you want an easy to type password for login (which can't be bruteforced offline), but you want a complex long passphrase for FDE.

There is no documented way of setting different passwords for the disk encryption and the OS user. However, support for it is present in the firmware.

How FileVault 2 works and boots

macOS can do a lot of things before booting the main system. If you boot pressing ⌘-R, for example, it will boot into a recovery mode capable of reinstalling the system.

How it does that is with an EFI firmware, and the support of a couple hidden partitions, one of them called Recovery HD. This is the system that decrypts the main partition and then boots from it when you have FDE enabled.

So the default FileVault 2 FDE setup involves a unencrypted hidden Recovery HD, and an encrypted container partition, with your actual partitions inside. The FileVault 2 encryption is controlled by the resident OS and unlockable by a set of username/password accounts.

FileVault 2 can also be used to encrypt external drives. In that case there is no Recovery HD, and there is a single encrypted partition, which can be unlocked not by username/password pairs, but by plain disk passwords.

Setting a custom passphrase

What we want is a mix of the external drive encryption with its custom passphrase and the Recovery HD boot process. Getting there is not easy, but once we do the firmware just happily asks for our "Disk Password", unlocks the disk, and continues booting.

Here are the two easiest ways I found.

Fresh install

If you are installing a new machine and don't care about wiping the entire thing, it's fairly easy.

First, boot into recovery mode by pressing ⌘-R while starting the machine, and select Disk Utility.

Then, select the Macintosh HD partition (not the whole drive, you don't want to kill the Recovery HD and make the installer shrink your partition to make a new one) and click Erase. Choose "Mac OS Extended (Journaled, Encrypted)" and select your FDE passphrase.

Existing unencrypted macOS

This is a little trickier, and I wouldn't do this without a backup.

You first need to boot from an external drive. The easiest way to do this is by using Carbon Copy Cloner to make a bootable drive. CCC is also excellent to manage the Recovery HD partition if you end up nuking it.

Boot pressing option to select the boot disk.

Once booted into an external drive, open Finder, right click on the Macintosh HD disk in the sidebar and select Encrypt.

Don't forget to securely wipe the external drive.

Existing encrypted macOS

This is the hardest. (Really make a backup.) You have two options: either unencrypt and follow the instructions above, or wipe the drive and use Carbon Copy Cloner.

If you choose the latter, then:

use CCC to make a bootable copy of your system on an external drive

boot into the external drive

use Disk Utility to Erase the target drive, making a single "Mac OS Extended (Journaled)" partition

]]>Ticketbleed (CVE-2016-9244) is a software vulnerability in the TLS stack of certain F5 products that allows a remote attacker to extract up to 31 bytes of uninitialized memory at a time, which can contain any kind of random sensitive information, like in Heartbleed.

If you suspect you might be affected

]]>https://blog.filippo.io/finding-ticketbleed/b13d5f5f-51b2-4ddb-a5b3-4c091e95e3dcThu, 09 Feb 2017 02:14:44 GMTTicketbleed (CVE-2016-9244) is a software vulnerability in the TLS stack of certain F5 products that allows a remote attacker to extract up to 31 bytes of uninitialized memory at a time, which can contain any kind of random sensitive information, like in Heartbleed.

If you suspect you might be affected by this vulnerability, you can find details and mitigation instructions at ticketbleed.com (including an online test) or in the F5 K05121675 article.

In this post we'll talk about how Ticketbleed was found, verified and reported.

JIRA RG-XXX

It all started with a bug report from a customer using Cloudflare Railgun.

Matthew was unable to replicate by using a basic TLS.Dial in Go so this seems tricky so far.

A bit of context on Railgun: Railgun speeds up requests between the Cloudflare edge and the origin web site by establishing a permanent optimized connection and performing delta compression on HTTP responses.

The Railgun connection uses a custom binary protocol over TLS, and the two endpoints are Go programs: one on the Cloudflare edge and one installed on the customer servers. This means that the whole connection goes through the Go TLS stack, crypto/tls.

That connection failing with local error: unexpected message means that the customer’s side of the connection sent something that confused the Go TLS stack of the Railgun running on our side. Since the customer is running an F5 load balancer between their Railgun and ours, this points towards an incompatibility between the Go TLS stack and the F5 one.

However, when my colleague Matthew tried to reproduce the issue by connecting to the load balancer with a simple Go crypto/tls.Dial, it succeeded.

PCAP diving

Since Matthew sits at a desk opposite of mine in the Cloudflare London office, he knew I've been working with the Go TLS stack for our TLS 1.3 implementation. We quickly ended up in a joint debugging session.

Here's the PCAP we were staring at.

So, there's the ClientHello, right. The ServerHello, so far so good. And then immediately a ChangeCipherSpec. Oh. Ok.

A ChangeCipherSpec is how TLS 1.2 says "let's switch to encrypted". The only way a ChangeCipherSpec can come this early in a 1.2 handshake, is if session resumption happened.

And indeed, by focusing on the ClientHello we can see that the Railgun client sent a Session Ticket.

A Session Ticket carries some encrypted key material from a previous session to allow the server to resume that previous session immediately instead of negotiating a new one.

RFC diving

When presenting a ticket, the client MAY generate and include a
Session ID in the TLS ClientHello. If the server accepts the ticket
and the Session ID is not empty, then it MUST respond with the same
Session ID present in the ClientHello.

So a client that doesn't want to guess whether a Session Ticket is accepted or not will send a Session ID and look for it to be echoed back by the server.

The code in crypto/tls, clear as always, does exactly that.

func (hs *clientHandshakeState) serverResumedSession() bool {
// If the server responded with the same sessionId then it means the
// sessionTicket is being used to resume a TLS session.
return hs.session != nil && hs.hello.sessionId != nil &&
bytes.Equal(hs.serverHello.sessionId, hs.hello.sessionId)
}

Uh oh. Wait. Those are not zeroes. That's not padding. That's... memory?

At this point the impression of dealing with a Heartbleed-like vulnerability got pretty clear. The server is allocating a buffer as big as the client's Session ID, and then sending back always 32 bytes, bringing along whatever unallocated memory was in the extra bytes.

Browser diving

I had one last source of skepticism: how could this not have been noticed before?

The answer is banal: all browsers use 32-byte Session IDs to negotiate Session Tickets. Together with Nick Sullivan I checked NSS, OpenSSL and BoringSSL to confirm. Here's BoringSSL for example.

/* Generate a session ID for this session based on the session ticket. We use
* the session ID mechanism for detecting ticket resumption. This also fits in
* with assumptions elsewhere in OpenSSL.*/
if (!EVP_Digest(CBS_data(&ticket), CBS_len(&ticket),
session->session_id, &session->session_id_length,
EVP_sha256(), NULL)) {
goto err;
}

BoringSSL uses a SHA256 hash of the Session Ticket, which is exactly 32 bytes.

(Interestingly, from speaking to people in the TLS field, there was an idle intention to switch to 1-byte Session IDs but no one had tested it widely yet.)

As for Go, it’s probably the case that client-side Session Tickets are not enabled that often.

Disclosure diving

After realizing the security implications of this issue we compartmentalized it inside the company, made sure our Support team would advise our customer to simply disable Session Tickets, and sought to contact F5.

After a couple misdirected emails that were met by requests for Serial Numbers, we got in contact with the F5 SIRT, exchanged PGP keys, and provided a report and a PoC.

The report was escalated to the development team, and confirmed to be an uninitialized memory disclosure limited to the Session Ticket functionality.

It's unclear what data might be exfiltrated via this vulnerability, but Heartbleed and the Cloudflare Heartbleed Challenge taught us not to make assumptions of safety with uninitialized memory.

In planning a timeline, the F5 team was faced with a rigid release schedule. Considering multiple factors, including the availability of an effective mitigation (disabling Session Tickets) and the apparent triviality of the vulnerability, I decided to adhere to the industry-standard disclosure policy adopted by Google's Project Zero: 90 days with 15 days of grace period if a fix is due to be released.

By coincidence today coincides with both the expiration of those terms and the scheduled release of the first hotfix for one of the affected versions.

I'd like to thank the F5 SIRT for their professionalism, transparency and collaboration, which were in pleasant contrast with the stories of adversarial behavior we hear too often in the industry.

The issue was assigned CVE-2016-9244.

Internet diving

When we reported the issue to F5 I had tested the vulnerability against a single host, which quickly became unavailable after disabling Session Tickets. That meant having both low confidence in the extent of the vulnerability, and no way to reproduce it.

This was the perfect occasion to perform an Internet scan. I picked the toolkit that powers Censys.io by the University of Michigan: zmap and zgrab.

zmap is an IPv4-space scanning tool that detects open ports, while zgrab is a Go tool that follows up by connecting to those ports and collecting a number of protocol details.

I added support for Session Ticket resumption to zgrab, and then wrote a simple Ticketbleed detector by having zgrab send a 31-byte Session ID, and comparing it with the one returned by the server.

By picking 31 bytes I ensured the sensitive information leakage would be negligible.

I then downloaded the latest zgrab results from the Censys website, which thankfully included information on what hosts supported Session Tickets, and completed the pipeline with abundant doses of pv and jq.

After getting two hits in the first 1,000 hosts from the Alexa top 1m list in November, I interrupted the scan to avoid leaking the vulnerability and postponed to a date closer to the disclosure.

While producing this writeup I completed the scan, and found between 0.1% and 0.2% of all hosts to be vulnerable, or 0.4% of the websites supporting Session Tickets.

Read more

For more details visit the F5 K05121675 article or ticketbleed.com, where you'll find a technical summary, affected versions, mitigation instructions, a complete timeline, scan results, IPs of the scanning machines, and an online test.

]]>Nick Sullivan and I gave a talk about TLS 1.3 at 33c3, the latest Chaos Communication Congress. Here's the Fahrplan entry.

We spoke about the flow of TLS 1.2 vs. TLS 1.3, how it manages to save a round trip, resumption and 0-RTT, forward secrecy and replays,

]]>https://blog.filippo.io/tls-1-3-at-33c3/851d41d2-108e-4dc1-8a00-57fe0aeee522Wed, 01 Feb 2017 16:25:59 GMTNick Sullivan and I gave a talk about TLS 1.3 at 33c3, the latest Chaos Communication Congress. Here's the Fahrplan entry.

We spoke about the flow of TLS 1.2 vs. TLS 1.3, how it manages to save a round trip, resumption and 0-RTT, forward secrecy and replays, all the things that were removed, all the things that were made more robust, and finally about the history and implementation of the spec.