Wednesday, June 3, 2009

apc futex_wait lockdown make your apache freeze over

We had the typical LAMP setup going on, with Drupal as the base CMS and APC for bytecode cache. We needed a good caching engine so I figured why not use APC's user cache. Well, we tried the APC Cache Drupal Module which, with minor fixes proved to work very nicely. That is, until we actually put this all thing on production.

The first thing we had was having our apache hang and not respond to any user requests. We susspected network issues, especially since netstat -na showed that all the apache processes were hanging on SYN_WAIT. However, since apache restart solved the issue i started to suspect this was something else.

To make a long (very long) story short, I got strace on our prod machines to find out that apache was either hanging on futex_lock(....FUTEXT_WAIT...) or doing infinte loops on the same functions.

To make even a longer story short, I got gdb installed on those machines and the backtrace clearly indicated that the locks were from APC user-cache calls.

We decided to abandon APC user-cache and switch to memcached which proved faster and had less lockdowns.

The funny thing is that when we talked about this over dinner the same evening a developer from another team just pointed me to this article by one of the APC leaders: (or something) How to Dismantle an APC Bomb which has been around for over a year. I am supprised and shocked that such a information is hidded so well and not mentioned anywere in the docs. Moreover, I went through the APC code again after reading this post (I went through it once when i started analysing the problem) and it seems that this is not even close to being resolved. there are no patches and no TODOs and nothing of the sort. From reading the code the entire user-cache needs a major re-write. What gives?

(this is a post i wrote a couple of months ago, never had time to finish it. Unfortunately this is still not fixed afaik)EDIT (07/2009): http://pecl.php.net/bugs/bug.php?id=15179 reports this to be fixed. If anyone can confirm this please send me a note so that I could update this post

4 comments:

Seeing the same problem with stuck Apache processes. Just upgraded to APC 3.1, going to see what happens. No time to properly implement Memcache at the moment so I'm first trying this stop-gap solution. So far so good. I'll try to remember to report back in a couple of days :)

Interesting blog post. We are also experiencing similar problems using APC v3.0.9. I'm curious if anyone else had reported success on making the upgrade.

On a side note, we began by caching everything in mecmached, but ran into problems caching very large objects b/c of the latency in the network connections between our servers. Therefore, we moved a lot of large static objects to APC and saw big improvements in reliability and performance. Funny that we moved in an opposite direction. :)

Hey Halbert,Thanks for the note. I did not see anyone that got this problem fixed, but then I was at least a year since I last checked the issue and I am sure some progress has been made by someone. (at least I hope so).

If you do happen to have stuff working properly I would be honored if you could report back here.

As for the memcached issue. We had the same issue and solved it by moving most of the large stuff to memcached running on localhost so no network latency.