This year's Internet Problem Solving Contest is over. This year was
special for me, because for the first time I was the root responsible for
making sure that shit doesn't happen during the contest and in case shit
does happen, making it go away.

Historically, we always had server outages during the contest due to the
relatively high number of HTTP requests from contestants, especially at
the immediate beginning of the contest, when everyone is downloading the
problem statements, and then also as the end draws near, when everyone
keeps spamming F5 on the results page.

After I took charge of our servers a year and a half ago, I decided to put
an end to these outages. Unfortunately, last year I was busy around the
time of the contest (writing my Bachelor's thesis), so I didn't really
have the time to tune our servers. This year, however, I finally got
around to boosting the performance of the IPSC website.

Tuning the server

Usually the server outages were caused by Apache spawning too many server
processes and causing the server to swap itself into oblivion. The
previous version of the web interface was a PHP thing and it used to be
hosted on our main server that we use for everyday tasks, SVN/git hosting,
people's personal pages, SSH access etc.

Due to the fact that the server hosts people's personal pages, often using
PHP and things like .htaccess for mod_rewrite and other stuff, we can't
really ditch Apache on that server, and getting that mammoth of a server
to behave is no easy task. (It's no secret I have little love for Apache,
or even PHP, for that matter.)

Last year, Tomi Belan rewrote the whole web frontend in Python as a WSGI
application. Personally, I consider this a step in the right direction,
obviously, with my involvement in the Python web development community.
My immediate intention was to put this app on a separate virtual machine
behind nginx and tune it to accept more than just a few (<10) requests per
second.

So that's what we did. We used a spare virtual machine that had been
sitting on our physical server unused for many years, and set up nginx and
uWSGI to run and serve the app. Tomi implemented certain measures to make
the most expensive pages load much faster (the results pages).

Then, about a week before the actual contest, I did some benchmarks to
find out how high a load the server will be able to handle. I played
around with httperf, autobench and ab for a while, and I found out that if
the server was hit with too many requests (about 500 per second), it would
just respond with a 502 error to most of them. I found that weird, I
certainly didn't expect the server to swat nearly all requests for a page
that takes about one millisecond to render away with a 502 Bad Gateway
error.

So I tried to look at the logs, and in the nginx error log I found the
following:

The first candidate for a solution I found by googling was to increase the
net.netfilter.nf_conntrack_max sysctl. However, that didn't help, so
I kept searching until I found the real cause.

Each socket keeps a queue of incoming connections that have not yet been
accepted by the application listening on the socket. By default on Linux,
the queue is limited to 128 incoming connections, which, I think, is
surprisingly low. uWSGI sets it even lower for its socket, it only allows
100 incoming connections. In uWSGI the relevant setting is listen,
which I raised to 2048. However, this alone isn't enough as Linux won't
allow any process to raise this above the global limit, which is set in
the net.core.somaxconn sysctl whose default value is, as I already
mentioned, 128. So I raised this to 2048 as well.

Next, nginx started to complain about insufficient worker connections:

This is a symptom of an insufficient limit on the number of open file
descriptors per process. If you try to google for these error messages,
you'll come across a lot of suggestions to modify
/etc/security/limits.conf, raise the system-wide limit on open files
to 4096 or whatever (both soft and hard), add pam_limits.so to the
common-session config file of PAM, logout/login and restart the
server. While this may work for you the moment you start the server
manually, the higher limit won't apply if the server is started at bootup
by init, since in that case, it doesn't go through any PAM mechanism which
would apply the settings from limits.conf.

Instead, the actual solution on Debian is to enable ULIMIT="-n 4096"
in /etc/default/nginx or add the following to your nginx.conf:

worker_rlimit_nofile 4096;

After going through all this setup, I was satisfied with the server setup.
It could serve about 600 requests for the main page per second, and buffer
a sufficient number of requests before responding with 5xx codes instead
of just failing immediately when the number of requests increased.

I also did a few benchmarks on the main results page. The page itself
renders within 10 milliseconds, which is quite fast, considering it's half
a megabyte long. That means, this page responds at a rate of more than
100 requests per second, which should be sufficient, as this corresponds
to a bandwidth of about 62 MB/s – that's about one half of the bandwidth
of the server's gigabit connection.

The morning before the contest

The day of the contest I got up at about 9:00 a.m. The contest itself was
starting at 12:00 p.m. in our timezone. I ate my breakfast, checked my
email, read the RSS and was getting ready to go out to take care of a few
things before the start of the contest. As I was about to pack my laptop,
I noticed that the SSH connection to our server timed out. This is
nothing exceptional – our faculty, where our servers are housed, has
connectivity outages quite often.

Unfortunately, as I was leaving, Usamec told me the IPSC server was down
as well, while other machines in the same room were accessible. Crap.

So I hurried to school to threaten the server with a hammer to make it
behave. As I was on my way, Miňo Plžík called me and said he was
investigating the cause. When I arrived to school and connected a display
and keyboard to the machine, I was greeted by this backtrace:

While I was on my way, Miňo found out that the problem was most likely
caused by this bug in the “Xen made easy!” bug tracker. Later I found
the corresponding Debian bugreport. In short, it's a bug in the Linux
netback/netfront implementation used by Xen, which is triggered when the
DomU's kernel tries to offload the fragmentation of a packet bigger than
64 KB to the virtual network interface. When this happens, the netfront
driver running in Dom0 disables the affected network interface altogether
with no way to restore it, other than shutting down the DomU and creating
it again. In addition, it occasionally tries to dereference a null
pointer, which has other side effects on the whole Xen stack.

Anyway, according to the first bug report, a workaround should be to
disable GSO offloading in the affected DomU, so we rebooted the whole
physical machine (because of the kernel oops), started up all DomUs and
disabled GSO.

This was the >40 minutes long outage you might have seen during the
practice session.

The contest itself

Now for the thing we were the most curious about: how the server would
handle the load during the actual contest.

I made a plot of the number of requests per minute throughout the contest.

A graph showing the number of requests per minute throughout the contest.

The beginning at 12:00 p.m. was the first big challenge. Just to be on
the safe side, though, Mišof cooked up two password-protected zipfiles
containing all the problem statements that people could download from
another server and at 12:00 he published the password. This presumably
reduced the load on the main server a little bit.

Still, at the beginning of the contest, the number of requests
spiked at more than 5200 per minute. The server didn't really notice,
though, with a load of about 0.25. To put this into perspective, on the
server hosting the IPSC website previously, at this point the load would
jump into the double-digits and it would be swapping like crazy. If there
were any significant lags this year, then they can only have been caused
by insufficient bandwidth of the internet connectivity.

Within a few minutes the request rate dropped significantly and settled
slightly above 1000 requests per minute where it would have stayed
throughout the whole contest, had things gone without any unexpected
trouble. However, as anyone who's been around computers knows, shit just
does happen.

At 12:13:33, the requests just stopped coming in. That was really weird,
I was still connected to the server from the local network, the public
network interface appeared to be up, only no packets were going through.
At first I was somewhat clueless, however, the syslog on the hypervisor
revealed the issue to be the same one that hit us in the morning:

What the hell? How? I thought I had implemented the workaround (disabled
GSO) on all relevant DomUs. As it turns out, nope, I forgot about this
particular virtual machine. As a matter of fact, it didn't even have
ethtool installed. A quick shutdown, xm create, try to install
ethtool. However, as soon as the machine booted up, it was flooded with
incoming HTTP requests from contestants nervously spamming the F5 button,
which meant the interface crashed before I could successfully both install
ethtool and disable GSO. (I guess I'm too slow at typing. d-: ) That's
the small spike you can see in the plot. So, to get the thing to run
again, I had to give it another reboot.

Finally, I was able to get the machine to run with disabled GSO. I
thought we were safe and that there was nothing worse that could happen to
us. Again, I was proven wrong when the exact same thing happened at
12:30:53.

This time, I was really desperate. We had already used the workaround
suggested in the “Xen made easy!” bug report and still it didn't help. I
was considering the following two options: either try to migrate the whole
virtual machine to a physical computer as fast as possible, or try to
perform a dist-upgrade – the machine was still running Debian squeeze, I
thought that Wheezy might already contain the fix mentioned in the bug
report. (This was before I found the Debian bug, which makes it clear
this wasn't true.)

After a short while, Miňo suggested trying to disable every possible kind
of offloading on the network interface with the following short shell
script:

Being as desperate as I was, I set this as the post-up script of both
ethernet interfaces and rebooted the machine once again. From then on, I
was constantly watching the HTTP logs for any silent period longer than a
second. We even had a real-time visualization with Logstalgia on a
projector. Luckily, after that incident, the interface hadn't crashed
again.

The last big challenge was the end of the contest when everyone is
reloading the results page. This is somewhat different from the beginning
of the contest, because in the beginning, everyone wants to download the
problem statements in the first possible second, which means a large
number of requests in one second, but the content is mostly static. At
the end of the contest, though, the load tends to build up more gradually
as more and more teams give up during the last few minutes and just keep
reloading the results page. Also, the results page is generated
dynamically.

Actually, to be honest, the actual contents of the results page are only
generated every 15 seconds, not on each request, but still, this page has
to go through a little bit of Python processing before it is returned to
the client.

Anyway, again, there was a spike at about 3500 requests in the first
minute after the contest which faded pretty quickly and, again, the server
didn't even notice – its load was lower than at some points during the
contest.

To sum up, I think the contest went fairly well, at least from my point of
view. There was the hiccup caused by a kernel bug and partly by myself,
which I apologize for. However, for the rest of the contest, the server
handled the load above any expectations.

Looking forward to IPSC 2014 already. (-;

As a bonus, feel free to watch this Logstalgia visualization of the
contest.