Some time ago we discovered that certain very slow downloads were getting abruptly terminated and began investigating whether that was a client (i.e. web browser) or server (i.e. us) problem.

Some users were unable to download a binary file a few megabytes in length. The story was simple—the download connection was abruptly terminated even though the file was in the process of being downloaded. After a brief investigation we confirmed the problem: somewhere in our stack there was a bug.

Describing the problem was simple, reproducing the problem was easy with a single curl command, but fixing it took surprising amount of effort.

In this article I’ll describe the symptoms we saw, how we reproduced it and how we fixed it. Hopefully, by sharing our experiences we will save others from the tedious debugging we went through.

Two things caught our attention in the bug report. First, only users on mobile phones were experiencing the problem. Second, the asset causing issues—a binary file—was pretty large, at around 30MB.

After a fruitful session with tcpdump one of our engineers was able to prepare a test case that reproduced the problem. As so often happens, once you know what you are looking for reproducing a problem is easy. In this case setting up a large file on a test domain and using the --limit-rate option to curl was enough:

Clearly our server was doing something wrong. The RST packet coming from CloudFlare server is just bad. The client behaves, sends ACK packets politely, consumes the data at its own pace, and then we just abruptly cut the conversation.

We are a heavy users of NGINX. In order to isolate the problem we set up a basic off-the-shelf NGINX server. The issue was easily reproducible locally:

This proved the problem was not specific to our setup—it was a broader NGINX issue!

After some further poking we found two culprits. First, we were using the reset_timedout_connection setting. This causes NGINX to close connections abruptly. When NGINX wants to time out a connection it sets SO_LINGER without a timeout on a socket, followed by a close(). This triggers the RST packet, instead of a usual graceful TCP finalization. Here’s an strace log from NGINX:

We could just have disabled the reset_timedout_connection setting, but that wouldn’t have solved the underlying problem. Why was NGINX closing the connection in the first place?

After further investigation we looked at the send_timeout configuration option. The default value is 60 seconds, exactly the timeout we were seeing.

http {
send_timeout 60s;
...

The send_timeout option is used by NGINX to ensure that all connections will eventually drain. It controls the time allowed between successive send/sendfile calls on each connection. Generally speaking it’s not fine for a single connection to use precious server resources for too long. If the download is going on too long or is plain stuck, it’s okay for the HTTP server to be upset.

In the config we ordered NGINX to use sendfile to transmit the data. The call to sendfile succeeds and pushes 5MB of data to the send buffer. This value is interesting—it’s about the amount of space we have in our default write buffer setting:

$ sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096 5242880 33554432

A minute after the first long sendfile the socket is closed. Let’s see what happens when we increase the send_timeout value to some big value (like 600 seconds):

After the first large push of data, sendfile is called more times. Between each successive call it transfers about 1.7 MB. Between these syscalls, about every 180 seconds, the socket was constantly being drained by the slow curl, so why didn’t NGINX refill it constantly?

A motto of Unix design is “everything is a file”. I prefer to think about this as: “in Unix everything can be readable and writeable when given to poll“. But what exactly does “being readable” mean? Let’s discuss the behavior of network sockets on Linux.

The semantics of reading from a socket are simple:

Calling read() will return the data available on the socket, until it’s empty.

poll reports the socket as readable when any data is available on it.

One might think this is symmetrical and similar conditions hold for writing to a socket, like this:

Calling write() will copy data to the write buffer, up until “send buffer” memory is exhausted.

poll reports the socket is writeable if there is any space available in the send buffer.

Surprisingly, the last point is not true.

It’s very important to realize that in the Linux Kernel, there are two separate code paths: one for sending data and another one for checking if a socket is writeable.

On the other hand, the conditions for a socket to be reported as “writeable” by poll are slightly narrower:

The last condition is critical. This means that after you fill the socket send buffer to 100%, the socket will become writeable again only when it’s drained below 66% of send buffer size.

Going back to our NGINX trace, the second sendfile we saw:

08:24:21 sendfile(5, 9, [6024754], 45748730) = 1768041

The call succeeded in sending 1.7 MiB of data. This is close to 33% of 5 MiB, our default wmem send buffer size.

I presume this threshold was implemented in Linux in order to avoid refilling the buffers too often. It is undesirable to wake up the sending program after each byte of the data was acknowledged by the client.

With full understanding of the problem we can decisively say when it happens:

The socket send buffer is filled to at least 66%.

The customer download speed is poor and it fails to drain the buffer to below 66% in 60 seconds.

When that happens, the send buffer is not refilled in time, it’s not reported as writeable, and the connection gets reset with a timeout.

There are a couple of ways to fix the problem.

One option is to increase the send_timeout to, say, 280 seconds. This would ensure that given the default send buffer size, consumers faster than 50Kbps will never time out.

Another choice is to reduce the tcp_wmem send buffers sizes.

The final option is to patch NGINX to react differently on timeout. Instead of closing the connection, we could inspect the amount of data remaining in the send buffer. We can do that with ioctl(TIOCOUTQ). With this information we know exactly how quickly the connection is being drained. If it’s above some configurable threshold, we could decide to grant the connection some more time.

My colleague Chris Branch prepared a Linux specific patch to NGINX. It implements a send_minimum_rate option, which is used to specify the minimum permitted client throughput.

The Linux networking stack is very complex. While it usually works really well, sometimes it gives us a surprise. Even very experienced programmers don’t fully understand all the corner cases. During debugging we learned that setting timeouts in the “write” path of the code requires special attention. You can’t just treat the “write” timeouts in the same way as “read” timeouts.

It was a surprise to me that the semantics of a socket being “writeable” are not symmetrical to the “readable” state.

Tuning a CDN to work well for all the users takes a lot of work. This write up is a result of hard work done by four engineers (special thanks to Chris Branch!). If this sounds interesting, consider applying!

Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/WJB5NQIe8Po/

Recent Posts: <CONTENT /> v.6

Lets have a look on what we have on this article: Setting up Apache 2 in Linux Subsystem for Windows 10 Setting up MySQL Server in Linux Subsystem for Windows 10 Setting up PHP in Linux Subsystem for Windows 10 Visual Studio Code and PHP Extensions Few tweaks to work on PHP, Visual Studio Code …

There are some good pointers and excellent tips in Web Development and Advanced Techniques with Linux on Windows (WSL) that are useful if you’re using WSL as part of your dev toolkit. Getting the same envs, including PATH, running across both systems is handy. It’s cool to be able to develop in Windows and test …

Virtually nowhere to be found were the 90 percent of lawyers who practice outside biglaw, the business clients who do not run mega-corporations, the access-to-justice community, or those disenfranchised from the legal system. — Five Days, Two Conferences, One Echo Chamber | LawSites https://www.lawsitesblog.com/2019/02/five-days-two-conferences-one-echo-chamber.html