Tag: performance

We recently had the need to make sure our front end apache httpd reverse proxy and ssl termination server could handle the larger number of websocket connections we are going to use with it. Given websockets are longer lived connections, this is a different use of apache httpd and we want to get it right. The proxied service is capable of handling tens of thousands of concurrent connections, if not hundreds of thousands or more.

First, our testing tool is custom made, it makes all the websocket connections first and then proceeds to ping. This is important as it exercises the concurrent connections capabilities of httpd. When using it, the client system needs the ability to create enough sockets. The first limit I encountered was with my test client system. The shell environment defaults to 1024 open files limited. It is a soft limit, so use ulimit -S to adjust the limit. Even ab will show an error of “socket: Too many open files (24)” if you use -n 1050 and -c 1050 options.

Now, your testing tool can create more than 1024 connections. The next limit I ran into was that of connections on the httpd server. Even mpm_event uses thread per request (do not let the event name fool you). The default ubuntu apache2 mpm_event configuration allows for 150 concurrent connections:

Now turn up the slowhttptest numbers. Change the -c parameter to 15000 and the -r to 1500. It should take 10sec to ramp up the connections. In my use case I could not create that many connections so quickly. slowhttptest was maxing out a CPU core.

All of the above apache httpd config was done using the mpm_event processing module. The next issue I ran into was a case of mpm_worker not behaving as I expected. I have a doubly proxied system, because this is super real world where we route http things all over the place, sometimes in ways we shouldn’t but because we are lazy, or it is easier or… anyway…

In ubuntu/trusty with apache httpd 2.4.7 mpm_worker has a limit of 64 ThreadsPerChild even if you configure it with a larger number. There is no warning. You’d never know unless you take a look at the number of processes in a worker: $ ps -uwww-data -opid,ppid,nlwp The fix is to switch from mpm_worker to mpm_event.

tl;dr : Consider optimizing uwsgi by setting `threads-stacksize = 64` or some small value in your uwsgi config. Python apps which do not use many C modules do not use the C stack very much. A smaller stack size mean threads use less memory and you can safely have more of them servicing requests.

Long story:

Years ago I was deploying a new flask web service using uwsgi. I needed it to scale to thousands of connections. I read a blog post (I searched and cannot find it now) which suggested 10 processes with 10 threads each to be able to serve 100 concurrent connections. After testing and tuning this particular app, we settled on 10 processes and 100 threads per process. It ran well.

Recently, a production app, which I helped deploy, fell on its face. It was performing very poorly, seemingly out of nowhere. This app was originally deployed with the same 10 processes, 100 threads per process configuration which I had used so successfully in the past. The ops team had already reduced the process count to 4 due to excessive memory use of the application. This means the application was only able to service 400 concurrent connections.

I still cannot entirely explain why the app ran for many months and then suddenly had problems. I’m guessing it is because of recent announcements driving more traffic to the site. The 400 threads were actually being used instead of sitting idle waiting for connections.

In the process of trying to restore service, our ops team wisely used a tool which I was not likely to have used (huge thanks to them). The tool is pmap and it shows mapped memory for a given process. I noticed something interesting in the output of pmap:

This was repeated with the same memory increment 50 times for a total of 100. It occurred to me that the default stack size of a thread in Linux is 8MB and that these memory maps were the stack of each thread. I was able to confirm this suspicion by running the app myself and adjusting the size by configuring uwsgi with –threads-stacksize.

I started by moving to 1MB which I know is the default Windows thread stack size, guessing it would still be plenty. Then I started to play limbo and see how low can I go. I started to get pretty happy when I broke the 256KB mark and our app was still functioning. Our app has the luxury of not having any deep calls. I might have been able to go lower, but once I got to 64KB, I didn’t see my point. Every order of magnitude decrease was smaller and smaller an improvement.

Moving from 8MB to 1MB took memory usage from 3.2GB to 400MB. Every halving of stack size halved overall memory usage of the thread stacks by this app. First 512KB/thread for 200MB, then 256KB/thread for 100MB, then 128KB/thread for 50MB, then 64KB/thread for 25MB. At this point, everything about the app was running exactly the same, the only difference being that I wasn’t wasting 3.2GB of memory in unused thread stacks.