Dec 5: Reverse ACLs - making IMAP LIST fast

This is a very technical post about recent improvements to the Cyrus IMAP server run at FastMail.

The Cyrus server is the core of our infrastructure. We segment data into three categories: original work product, master copy and cache. We store some user data in our replicated Mysql database, and files in the filestorage object pool, but more and more of your data is stored in Cyrus - all email, calendars, contacts and notes are now in the Cyrus spools. Plus I'm technical lead for Cyrus, so I write about it a lot! It's an ancient codebase written in the C language, which is being slowly modernised. This post will cover some of those structural changes.

Speed matters

Many years ago now, I rewrote the internals of the Cyrus mailbox handling code. There were surprisingly few on-disk format changes required, mostly just adding CRC32s to detect corruption early. But one thing I did change was delaying EXPUNGE handling. When you mark a message with the \Deleted flag (IMAP does two-phase delete) it's just a flag update in the index file. In series 2.3 and of Cyrus and earlier, if you then ran the EXPUNGE command (even UID expunge to just remove a single message) it caused a complete cyrus.index repack. Quite expensive.

Amongst other changes I made, I moved the contents of the cyrus.expunge file back into cyrus.index, and made EXPUNGE be just as cheap as a flag update. It sets the special internal flag FLAG_EXPUNGED, and updates some counters. It meant that instead of just a msgno offset into the index, there's now a separate recno offset - this won't make much sense unless you read the code - and maybe not even then!

The good thing is, EXPUNGE is now very cheap in current Cyrus, particularly if there's nothing to actually remove. This turns out to have been very handy now that Mail.app does an EXPUNGE every time it drops out of IDLE in the default configuration (compact mailboxes ticked). This is causing quite a lot of pain for those stuck on 2.3.

To tell you the truth, speeding EXPUNGE wasn't even the primary goal - the primary goal was making the QRESYNC (previously RFC5162) handling be exactly correct. Not only for QRESYNC and CONDSTORE themselves, but because I wanted to use the underlying change tracking mechanism to make cross-datacentre replication more efficient. But the good design has paid off.

LIST is slow

Back when MyOpera was still a thing, we create a user account for every single active MyOpera user, about a million accounts in all. We spread them over a bunch of backend servers, but we did notice that LIST commands were running slower. One of the reasons we split our users across hundreds of separate stores is to speed the LIST command. Like many things in Cyrus, an O(N) algorithm scales surprisingly well, but we've changed the mailboxes.db format to a less efficient but more expressive DList based format, and it does spend a bit of CPU just parsing the records to check the ACLs.

And it's wasted CPU. Very few users have any shared folders at all. Even those who do, it's only a handful of folders. But here we hit another problem. The global shared namespace. It has no prefix in the default namespace, and it has no prefix in the mailboxes.db - so we wind up scanning the ENTIRE mailboxes.db twice, one for users, and again for shared. And neither time do we return more than a few records in even the most actively shared account. There's even a Cyrus config option called foolstupidclients to work around the fact that LIST is slow on big sites.

Most of the complexity in the LIST code (particularly mboxlist_findall) has been added to try and reduce the amount of DB work required.

This is clearly something that's ripe for optimisation.

Building infrastructure

The LIST code was a mess. If you look through the release notes for the 2.4 series, nearly every release's major claim to fame was "finally fixing LIST". Except it broke in interesting ways for other users. With three different possible values for virtdomains, altnamespace, unixhierarchysep and on disk hashing too, plus the subscriptions database and the horrors of LIST-EXTENDED there were almost infinite ways to screw up. I managed to try most of them.

Luckily we now have Cassandane and a ton of LIST command tests. But LIST was still a mess of special case magic. Optimising it was going to be a nightmare. So I fixed little bits - an abstract datatype, mbentry_t, for database records - an abstract datatype, mbname_t, for mailbox names and their various representations - methods for converting between them - caching representations... tons of small things.

A lot of the work was done as part of making the final LIST-EXTENDED fixes that I did during the marathon CardDAV work.

And during the beers we had to celebrate surviving DDoS week, I was asked how long it would take to actually build the cross-domain sharing I'd been talking about, now that most of the parts were in place.

Attempting murder

The Cyrus aggregation protocol is called murder, after a murder of crows. Computer people have a sense of humour, honest.

Right now, all users in a family or business are on a single store. We have an audit task to ensure this, but it does mean that we can't support a business with more than 1TB of email. We need to solve this, so aggregating the entire mailbox listing of all users into a single namespace has to happen at some time.

Also, users can only share to other users if the are in the same domain. Two reasons, one is that it's hard to represent domains with '.' as the separator. The other is that it's cheaper to only scan the folders within a single domain. Again, it's expensive to do a linear scan of a big database, so there were lots of tricks added over time to make it cheaper.

Work smarter, not harder

We knew that we had to do a reverse lookup by ACL instead of scanning the entire namespace, it's the only thing that makes sense. But the problem looked quite intractable when the LIST code did tons of different things at once, and there were no nice APIs for handling the entries and names. So we kept putting it off.

But I had made a drunken boast on Friday that I could get crossdomains working, and I knew it had to have reverse ACLs in place for it to be efficient enough. Cyrus doesn't have an indexed database type, only key value stores with foreach on prefixes. So I had to build on top of that. The only sane place to store this data is the mailboxes.db, so it can be updated in the same transaction as any change.

crossdomain: new listing mode that supports folders across domains
NOTE: this is very much recommended only for unixhierarchysep: yes,
otherwise you can't work with domains with dots in them!
Usage:
crossdomains: yes
unixhierarchysep: yes
virtdomains: userid
There's still no way to support shared folders in other domains.

mboxlist: implement reverse ACLs
To enable, set 'reverseacls: yes' in imapd.conf and restart the
server. Since it requires a complete table sweep in a transaction,
this will only be done by ctl_cyrusdb -r at startup. You can also
remove the reverse ACL support by removing the config key - at next
startup it will remove all keys starting with $RACL.
To enable reverse ACLs, it creates a key called '$RACL' If this key
is present, then the reverse ACLs are used for looking up records in
NAMESPACE_USER and NAMESPACE_SHARED.
NOTE: reverse ACLs are not used by admin connections at all, since
admins tend to have read access to everything, and it would be
pointless and create bloat. Likewise, reverse ACLs aren't use for
the user's own folders (INBOX and friends), because they are always
accessible to their owner.
When the $RACL key is present, all updates from mboxlist_update or
mboxlist_delete will cause the ACL to be examined, and for each
user who:
a) is not the mailbox owner
b) is not an admin (NOTE: making an existing user not an admin will break things)
c) has the 'l' right
A record will be created or removed. If the folder is in NAMESPACE_USER
then the record will have key:
$RACL$U$<acluserid>$<intname>
If it's NAMESPACE_SHARED, then it will be:
$RACL$S$<acluserid>$<intname>
The record has a zero-length value. This works in all decent DB types, because
it's also the method used by the subscriptions database.
When a LIST (mboxlist_findall) command is run, then if it's a non-admin
request, and $RACL is present in the mailboxes.db, the reverse ACLs will
be read into a strarray_t, and each record will be called with cyrusdb_forone.

The crossdomains option isn't switched on in production, though it has Cassandane tests. It's waiting for us to switch on imap.fastmail.com and make a couple more small changes to allow the classic interface to remain functional.

Reverse ACLs on the other hand, they took about a day of testing, and they were ready to go. The code turns out to have been quite simple to write once all the groundwork was laid. LIST is now fast for all FastMail customers - and if a client comes out that runs LIST every time it does something, our servers will cope!

Total DB records read, less than 30 - no matter how many users are on the server. And each single operation is O(log N) on a twoskip database, so very fast.

Before the RACL changes, it looked like this:

FETCH brong.net!user.test4
-> one record which is processed directly
FOREACH brong.net!user.test4.
-> about 10 records usually, folders, addressbook, calendars, notes
which are processed directly
FOREACH brong.net!user.
-> some hundreds of records, not so bad since brong.net is small, but
if it was a fastmail.com user, thousands of records. Reject records
starting with brong.net\!user.test4[\0.]
FOREACH brong.net!
-> all the same records again, but reject any starting with brong.net\!user.

Total DB records read, approximately 2N, where N is the number of user folders in the same domain (or total number of folders on the server once we turn on crossdomain)

Exactly the same records are actually accepted either way, but the two giant scans over everything in the domain are removed, and hence removing the domain prefix is also possible.

Next steps

Aggregated murder is still very much TODO - I'm seriously considering something like Redis or Cassandra as the global database rather than the current system of a single-point-of-failure MUPDATE master. It also needs to handle mailboxes being on multiple servers to integrate with the replication system, and be designed for high update rates, because I want to store the highestmodseq and sync_crc of every folder as well to allow some real frontend optimisations and global consistency checks.