Let's say you want to build an SSL server in C++. Transmitting data is going to
be a major part of your application, so you need it to be fast and efficiently
use system resources, especially processor cores.

You may have heard of Asio (possibly better known as Boost.Asio). Asio is a
"cross-platform C++ library for network and low-level I/O programming that
provides developers with a consistent asynchronous model". It's widely used and
mature, and in the future may be a part of C++17 standard library. Note
that this post is not a tutorial or an introduction to Asio, but rather a study
on how scalable it is in our use case, why it scales poorly and how to improve
it.

Benchmarking

Asio includes SSL support using OpenSSL library. I've created an example,
"naive" benchmark that sets up a server with a given number of
threads, creates a given number of connections and measures the time it takes
each of them to send M messages of size N. The code uses Asio 1.10.6, which
is the current stable version. Here's how I compile it on OS X:

As you can see, the chart shows an increase in bandwidth for two threads, which
we would expect even for a single connection (as one endpoint has to encrypt,
and other decrypt the data). After that point, the bandwidth rapidly falls off
with the number of threads, reaching about 200 MB/s when used with multiple
connections. Connections can potentially run in parallel, so the ideal chart
would show a linear speedup with the number of threads, capping at the ratio of
2 threads/connection.

Summarizing the experiment, Asio with OpenSSL not only does not scale with the
number of threads, it actually slows down considerably when the number of
concurrent operations is high enough. This suggests a heavy lock congestion.
Let's check our theory in a profiler:

Inverted call tree

Most of the time of our application is spent waiting for locks, most of the
locking occurs in ERR_get_state OpenSSL function, and it can most often be
found in a call tree of asio::ssl::detail::engine::perform. A look at
engine::perform implementation in engine.ipp quickly shows us that...
everything is fine. Sure, you can maybe make a few small optimizations, but in
the larger picture OpenSSL is used correctly12. There's no error there
that would result in the lock congestion. We conclude that the bottleneck in
scalability lies in OpenSSL itself, more specifically in its error handling
functions. So what now?

Trying BoringSSL

After Heartbleed, a few OpenSSL forks like LibreSSL and BoringSSL have
appeared with a vision to trim down, modernize and secure OpenSSL's code while
remaining mostly source-compatible with the original. A quick look at
LibreSSL's err.c file shows it's largely unchanged from the OpenSSL's
version. But BoringSSL's err.c has obviously been reworked
and now uses a thread-local storage instead of locking a global mutex for data
access!

As I mentioned, BoringSSL is mostly source-compatible with OpenSSL.
Unfortunately, there are a few changes to make before we
can use it with Asio. There are openissues in Asio
GitHub repository to integrate these changes into upstream, but even then
compatibility with BoringSSL is currently a moving target as the code is still
being cleaned up.

Well, that's much better. You may end reading here, knowing that as of this
time OpenSSL's error handling causes it to scale poorly with the number of
threads, while BoringSSL scales much better indeed. But there's still a
fall-off in bandwidth when the number of threads increases, so let's try to
answer that as well.

Cores, threads and io_services

If you want to find a bottleneck, profiling is usually the best answer:

Inverted call tree with BoringSSL

Looks like Asio's internal thread synchronization is the bottleneck this time.
There are two main approaches to get scalability in Asio: thread-per-core and
io_service-per-core. In the "naive" example I'm using the thread-per-core
approach as it always seemed more natural to me - we're basically creating a
threadpool that - if needed - can dedicate one or more threads to a single
connection, as opposed to io_service-per-core where each connection would be
served by at most one thread. But in the light of our profiling results I've
modified the example to use the second approach.

I've added a new class, IoServices, objects of which hold multiple
io_services. When we need an io_service - e.g. for a new client- or
server-side socket - we call ioServices.get() which will return one of the
stored io_service objects on a round-robin basis.

Let's put the results into a final chart. This time we'll just focus on 8
threads, 20 connections, and directly compare our different benchmark
applications. Note that I'm cheating here, if just a little bit:
io_service-per-core will perform better when there are multiple connections
per io_service, as it will result in a more balanced CPU load. Since we're
considering a server scenario, though, 20 connections is still a very low
number.

This still isn't the ideal chart (of course, due to threads' overhead no
implementation can exist that would produce the ideal results), but it actually
scales up with the number of used threads.

That's it for the post. Now that you know how to make Asio-based SSL server
scale, you can go build a faster and safer applications.

Thanks for reading!

UPDATE 2015-08-17: While it's true that Asio uses OpenSSL API
correctly, it's possible to write multi-threaded code that uses OpenSSL
and which is not constrained by the error-handling bottleneck. Asio calls
ERR_clear_error() before each call to SSL_*() functions, as OpenSSL
documentation states "The current thread's error queue must be empty
before the TLS/SSL I/O operation is attempted". To avoid the bottleneck,
instead of clearing the error queue before each operation, you have to
make sure to clear the queue after an error has occurred. This is
something that can be done in Asio code. ↩

UPDATE 2015-08-20: The update above turned out to be incorrect, as
OpenSSL will still call locking functions internally. While the solution I
described above will decrease lock contention, it doesn't completely
remove the bottleneck. ↩

So here came the end of Google Summer of Code 2013. It's the best moment to talk
about what has been planned, what was done, and - traditionally - what the
future holds. Oh, and let's get some stats and pictures (because that's what
we're here for)!

The difference between these numbers tells us a story: at least about ⅖ of the
code I wrote was removed afterwards through heavy refactoring. (Alright,
maybe this wasn't much of a story.) In other words I not only overshot my
initial plans, I was also making sure I'm doing it with style. ;) Of course
the numbers take into account only the changes that were finally committed -
there were a lot more in between.

Let's make some code size comparisons between old and new importers.

I really wanted some graphs in this post.

As you can see, the numbers are similar. Rhythmbox and iTunes importers are made
bigger by XML-processing code (oh how I hate it), and Amarok 2.x and FastForward
importers by custom, rich configuration widgets. The simplest importers,
Clementine and Banshee, are small and pretty.

Oh, but that's not the whole story here, is it? All of the new importers also
contain write capabilities - they can sync the statistics back to the foreign
media player. Without it a new importer can easily fit inside 100 lines, as
demonstrated in one of my previous posts. Mission accomplished.

As an aside, I find it interesting to note that the number of lines does
translate very well into the number of characters. The average number of
characters per line for measured code is 38.17, with standard deviation of 2.60
characters between files.

Oh, and I have proof that I was actually doing something during GSoC. Do take
a look at this video that I made, if only for the amazing soundtrack (720p
recommended). For details, please see the video's description.

The future

So, it's the end of Google Summer of Code, but it's not the end of the project
nor my contribution to Amarok. Arguably the most important event in the
project's lifetime - code review - still lies ahead. Not only that, but I
already have some further refactorings in mind.

Other than the project, there's always a lot to do for Amarok, and the great
community around it makes it hard to leave - so I'm not. There's just too much
fun to be had. ;)

Well, that's it for the post. I'm going to take a few days off and then an
academic term starts, and I'm back to my daytime job - I'll need some time to
adjust my schedule. Thanks for sticking around. It's been - and continues to be
- a pleasure!

Last week I asked my mentor if I could skip that week's report and make the next
one (i.e. this one) a double one instead. The reason for this was that I was
working almost exclusively on tests, and tests more often than not make for a
dull post.

Well, after two weeks I have even more tests. I'd go as far as to say that
things are satisfyingly tested.

I made a base test suite that relies on convention; importers are expected to
have tracks with certain metadata in their databases, and then tests make sure
that this data is imported correctly. Tracks with the right metadata are pre-
created and stored in the source tree so creating a test database is a matter of
adding them to media player's collection.

There isn't much more to tests that's interesting, so let's skip to the next
topic. GSoC 2013 is in the homestretch. September 16 is "soft pencils-down"
and September 23 a "hard pencils-down" date. We're expected to have all the code
done on September 16, and to spend the following week on documentation. Having
that in mind, this is my much-more-detailed-than-usual plan for this week:

Since I've been documenting as I went, the week between September 16 and 23 will
be devoted to small bugfixes, design tweaks, typo fixes. Also that's the week
where I remove old media-player importing capabilities and take a minute or two
to celebrate.

As always, you can check out my progress on my public Amarok clone. The branch
is named gsoc-importers.

Thanks for reading!

iTunes database will then have to be imported back into iTunes.
But hey, it's synchronization. Kind of. ↩

Fixing

Last week I fixed Amarok 2.x embedded database importer. There were quite a few
problems with handling an external database process:

if the QProcess object received commands (start(), kill(),
terminate()) from a different thread than that which has created it, it
resulted in wrong behavior, often in the form of a crash (can't get much
wronger than that)

the QProcess object does not encapsulate the process; it's only an
interface. When a QProcess object was destroyed while the process was still
running, it issued a warning and killed the process with SIGKILL. Normally
you'd want to stop an ongoing process with a SIGTERM, so manual lifetime
management is needed

if the server has stopped (which it does, after a period of inactivity),
calling QSqlDatabase::removeDatabase() resulted in a SIGPIPE signal from
inside the MySQL driver

related to that, an old QSqlDatabase connection silently failed to work if
the server has been restarted. There would be no warnings on
QSqlDatabase::open().

There was also a question of waiting for mysqld process to be ready to serve.
After some research I decided to adopt an approach that MySQL startup scripts
(including the "official" script, mysql.server) use, which is to wait for the
server's PID file to be modified. Overall, I'm quite satisfied with the results;
it's reliable and fast enough.

Creating

By the way, there's a new statistics synchronization target: Clementine.

I know how much you like pictures

This brings the total number of importers to six, and marks the end of
implementing new importers. From now on, I intend to focus on existing code
including, but not limited to, read-write capabilities and - of course -
testing.

Simplifying

You may have noticed one additional target on the screenshot above: "Example."
Another thing I focused on this week was code deduplication and simplifying
creation of new importers. To show off, I prepared a basic "Example" importer.
Below is the full C++ code. Bear in mind that aside from the code, importers
need a plugin's *.desktop file and a CMakeLists.txt file. Also bear in mind that
with this code the importer is already fully reconfigurable and instances of it
can be created and removed at user's leisure.

Last week was free of surprises. I applied some more polish to both the
Importers framework and concrete importers themselves, deduplicated some code -
just general maintenance. I had a very technical to-do list containing - mostly
- very minor entries, so it's nothing write about in the post. The bottom line
is: the overall quality of my project is improving.<

The list keeps gaining new entries, so hopefully I'll still be having plenty to
do.

One thing that continues to give me a bit of trouble is Amarok 2.x embedded
database importer. There are surprisingly many problems with managing a single
server process, shared between method calls, in a multithreaded environment.
QProcess class has its own set of quirks on top of that which need to be taken
into account, especially when it comes to destroying the object, particularly
in multithreaded environment. I've got ideas, but it's something for the next
post.

In other news: Banshee importer!

It still has some issues, but nothing that won't be solved by this time next week

Alright, that's me done for the post. It's been a short one, but as I said - no
major problems, no major changes. A steady march to high quality.

As always, you can check out my progress on my public Amarok clone. The branch
is named gsoc-importers.