Apache mod_deflate and mod_cache issues

The Problem: Using Apache mod_deflate and mod_disk_cache (or other mod_cache) together can create far too many cached files.

The Background: Apache is a web server with many different modules you can load in to enhance it. Two common ones are mod_deflate and mod_cache (or mod_disk_cache).

Mod_deflate compresses content that is sent to the webserver using gzip. It can take 100k of html, css, or javascript, and compress it down to ~10k, before transmitting it to the user’s browser. The browser then uncompresses it, and displays the page. Most web servers (depending on how your site/application is structured anyway) are not CPU limited. Therefore, you can spend some extra CPU doing the compression, and get much faster content delivery times to your users, who are often bandwidth limited. Not only does this make pages load faster for your users, but it also allows request handling threads to complete sooner, letting your web server handle more requests.

Some web browsers are not able to handle gzipped content correctly, therefore it’s important to add in some logic to only send gzipped content to browsers who can handle it. Also, there are different types of files which are already compressed and hence trying to gzip them is a waste of time and resources, such as images, video, etc…

“For files under /”
“Compress them”
“Unless it’s Netscape 4.x, then only compress text/html files”
“Or, if it’s Netscape 4.06-4.08, then don’t compress any files”
“But if it’s IE, don’t compress any files” – NOTE: this is different than the common version you see floating around which turns back on compression for IE. If you are loading content from a Flash swf within IE 6, that content can’t be compressed, even though IE 6 handles it fine. Flash doesn’t for some reason. So this setting is safer. If you aren’t using Flash, feel free to change this.
“but if it’s IE7, undo the no compression settings we made before, activating compression”
“but don’t compress already compressed files like images and video”
“Set the response Vary header to User-Agent so that any upstream caching or proxying won’t cache the wrong version and send a compressed version to a browser which can’t handle it, or an uncompressed version to a browser that should have gotten the compressed file”

Confused yet? :)

Mod_disk_cache allows you to specify various files to be cached on the web server and lets you set a cache expiration time, etc… It’s of great value when those files are being served out of a web application, and not coming from the local disk. For instance if Apache is serving files from an ATG instance, mod_disk_cache, lets you have the web server cache images, css, js, videos, etc… from your WAR. There’s also a memory based cache, mod_mem_cache, but it’s more trouble than it’s worth, and you can trust the linux kernel to cache recently accessed files in memory anyhow.

Got it?

So this is where it gets tricky.

If a response has a Vary header set, mod_disk_cache will cache a different version of that file for each value of the Header that Vary references.

So for a file compressed as above, there will be a different version cached for each User-Agent. In theory this will mean that browsers which support gzip compressed content, will get the compressed content, and browsers which don’t, will get the uncompressed version.

In practice, due to the amazing tiny variations of the full User-Agent string, you end up with thousands of copies of the same file in your cache. On a disk cache only a few hours old, there were over 4,400 cached copies of the same javascript file. Each with a slightly different User Agent string, even though there were less than 10 actual browser types represented.

This is a problem for several reasons: Firstly, you end up using far more disk space than you really need. Secondly, you negate the kernel’s in-memory file caching, since those 4,000+ version of the single file are being accessed, it won’t be able to simply keep the two different files (compressed and uncompressed) in memory. Thirdly, you make cleaning out the cache much slower, since you have to delete these thousands of extra files and their containing directories.

20 Comments

Thanks for posting this. I was noticing very high iowait times on my server (community website), which turned out to be due to htcacheclean not being able to keep up with all the subdirectories that are being produced as a result of the combination of mod_deflate and mod_cache (I run a caching reverse proxy with mod_perl backend). Investigating, it soon became clear that there were many more directory sublevels than I had specified with CacheDirLevels (3), all because of that Vary header – which your post explained.

I have now disabled mod_deflate, since I think this is perhaps not so relevant today – most people have faster connections and probably won’t notice the difference between 20KB and 80KB of text. The server will use more bandwidth, but so far that doesn’t seem to be the case in reality (according to munin, anyway).

It took more than 3 hours to delete the old cache, which had climbed to more than 10GB (htcacheclean just couldn’t keep up with the growth). This is on a quad core Opteron with 4x10k SCSI disks in RAID0. After I enabled the new cache without mod_deflate, things seemed much more sane – only the 3 levels of directory I specified with CacheDirLevels, and du now only takes a second or less to traverse the 1 GB. Htcacheclean can now keep up with the -i -n options, and the iowait on my server has dropped to almost zero. Before, htcacheclean was running all the time to keep up, and failing.

So the lesson seems to be that if you are running a dynamic website which can produce lots of files, then maybe avoid using mod_deflate in combination with mod_disk_cache!

@Neil: I’m glad I’m not the only one. I saw big issues with htcacheclean. I’ve taken your lead and turned off mod_deflate on a couple of sites. I love the concept, but the mod_disk_cache was providing a bigger performance improvement, and if I can’t run them both together….

I am back to square one currently, due to the fact that on my community website, I have to enable the Vary header for cookies. If I don’t do that, then people get the wrong type of page – their cookie contains options for things like ad display, pic size and more, and when people go to the home page, this needs to be taken into account. So mod_disk_cache will only store one copy of it if there is no Vary header (based on cookie). If I turn on the Vary header for cookies, then I get the right caching behavior, but we’re back again to the insane number of subdirectories.

It’s bad – I just had a new server built, one which is pretty well specced – quad core Intel Xeon with an 8-way RAID card and 8 10k Raptor SATA disks as RAID10. This is a pretty fast machine. But after a few days of running with the cache in Vary mode, it takes a good few minutes just to do a du on the cache directory. I had wondered before if the delay was because of a problem with the old server, but here’s a brand new, very fast server, and it’s having real issues even scanning that number of subdirs. Very depressing.

I am currently trying to run htcacheclean without the -n “nice” option (I think perhaps it never catches up if the server is busy, which it is most of the time). I also turned on the -t option to delete empty directories, in the hope that will help the traversal time. Finally, I decreased the limit from 1000MB to 100MB. Currently the cache is at 3.6GB according to du. I’ll let it run for a while in the new config (-d60) and see if it is able to bring that down. If not, then I’m going to have to start talking to the Apache devs, because this thing is clearly kind of broken. I’m really surprised, given how Apache is supposed to be the premier webserver on the planet.

@Neil: I think I know what site you are referring to. If so, then no to my session cookies question, however you have about 55,296,000 possible cookie values (as far as my basic math goes). Which could make your cache very very large:) Although I’d guess the majority of your users don’t have a cookie set.

Correct me if I’m wrong, but you’re only setting the Vary header on the html pages, so all the media should be getting cached just once, which is good.

So are you basically seeing lots of folks with custom cookied options causing multiple cachings of the various html pages?

How did you arrive at the number 55,296,000 for possible cookie values? The site I am talking about here is http://www.crazyguyonabike.com, and sister site(s) topicwise.com. All the software on there was implemented by me (at the app level anyway) and so is fully controllable by me. Basically there are two possible cookies – opt (for options) and id (for user logins). I don’t set automatic session cookies.

I think the problem here is simply that mod_disk_cache wants to create all those subdirs, for every variation on a request. I haven’t looked at their code, but I assume they are creating a hash of the URL and using that to key off where the cached file goes. The trouble comes when they then want to do another three layers of subdir below the initial levels. It’s not scalable, as I’m finding out. My iowait is going through the roof at the moment.

One option I’m considering is to simply ditch htcacheclean altogether. Let the cache grow. At midnight, I already rotate the server logs, briefly bringing the webserver down in the process (for a matter of seconds). What I could do in addition is to rotate the cache dir. So when the server is down, move /var/cache/www/* to /var/cache/trash/. Then once the server is back up, do a delete of all the files on /var/cache/trash. I have to do this asynchronously, since doing *anything* with this type of cache takes a lot of time.

Alternatively, I might just go back to switching off the Vary header for cookies again, and tell my users that they should use a bookmark which has a special ‘o’ parameter – all generated links on my site incorporate a special ‘o’ param which embodies the user’s current options. This was done originally to try to get around broken web caches which couldn’t distinguish requests by cookie. So with a different URL, people with different options would generate pages containing slightly different links. If people bookmark http://www.crazyguyonabike.com/?o=Fd12, then that should be cached differently from http://www.crazyguyonabike.com/?o=OIPLx. In theory, anyway – I’ve noticed on occasion that even with the param, I seem to get pages which were obviously generated for a different cookie. So I don’t know if the mod_disk_cache hashing algorithm maybe isn’t as robust as it could be (i.e. seems to be getting conflicts).

I’m honestly annoyed to be having to work around this crap. I would have thought apache would have a better disk cache that doesn’t make your iowait explode if you try to use it in the way it was intended.

p.s. Sorry, I didn’t really address some of your questions… I need to cache all generated html content which isn’t already user-specific. So the home page definitely has to be cached, since one of the purposes of using the cache at all is to be able to resist popular sites like slashdot, reddit, digg etc linking to my site, and having thousands of people all clicking on the same link at the same time. Having that content cached gives you orders of magnitude more requests per second than if everything has to come from mod_perl.

I don’t cache images or any other non-mod_perl content in mod_disk_cache – not much point in that. I have over 75GB of images, it wouldn’t make any sense to store those again in the cache. They are already served directly from my front end reverse proxy apache (I use two custom builds, one lightweight front end and the heavyweight back end mod_perl).

Setting the dir_levels to 2 would result in more files per directory. I’m not sure how that would work in terms of performance; I just went with what they recommended on the apache site – I assumed that they had done some tests and knew that this was empirically the best configuration. However now I’m starting to question how much they know about this stuff in reality, so I may go and try a different dir_level value like you suggest.

Thanks! I can hardly believe you and I are the only ones experiencing this issue. Maybe not that many people think this much about their website performance. I guess most sites just run the heavy app server on the front end and serve custom pages to everybody. Maybe that’s why so many of those sites seem to go down so hard when they get unexpected traffic…

@Neil: I got the 55 million number by multiplying the option value counts, which should give the maximum number of possible cookie values for the opt cookie.

The mv and purge later plan isn’t a bad one. If you try it, let me know. Or you could just not do any cleanup.

I’ve found that a more shallow dir structure performs much better, even though it means more files per dir. I don’t know about you, but I was ending up with very few files in the actual end-node directories, but there were a ton of directories. The Apache recommendations don’t seem to apply if you’re using any sort of Vary header. Reducing the directory tree size by a few orders of magnitude only led a hundred files/dir or so, which is very manageable. So that might be worth a try.

Another thing you could do, if you’re mostly worried about a slashdotting, is to only cache the index page, and let the rest be served from perl. That would keep your cache very small, and solve your primary concern. Or you could cache the index page via mod_mem_cache or something.

I mostly use mod_disk_cache to keep my J2EE apps from having to serve out static files from the war repeatedly, so our uses are a bit different.

How do you know what the total number of possible option values is for my website? Did you try setting the option and then peek at the cookie? Even so, that doesn’t cover the number of possible Vary values, since there is also the id cookie which is set for users – and that can have a huge number of permutations since it’s stored as an encrypted hash digest (i.e. long string of gibberish). So you’d need to multiple the opts permutations by the id permutations – and even then, that just gives you the number of possible versions for just one file. There are hundreds of thousands of dynamically generated pages on the site. It wouldn’t work to just have the caching on the home page (to handle slashdottings) since they can link to any interesting page – e.g. a particularly funny pic in a journal, or a day page that talks about something interesting etc etc ad infinitum. It’s really endless, so I need to cache all dynamically generated content, really.

Anyway, I tried setting CacheDirLevels to 2 rather than 3, and already I think I see an improvement. I do have more files now in each of the level 2 subdirs, but at present it only seems to be of the order of 10 or so, which is nothing much – you only see problems when you go to tens of thousands of files in a single directory, as I recall. Even then, you could enable dir_index for the partition to get the hash tree for filename lookup (but I don’t want to do that since it tends to require dodgy multiple reboots with e2fscks and I’ve read that it can cause problems on some systems).

As it stands now, I have the following for htcacheclean:

htcacheclean -i -t -n -d60 -p/var/cache/www -l1000M

And this morning, I can do a du -hs on the cache directory, it seems to complete in a few seconds (much better). I’ll have to retest randomly, since it’s possible that the directory tree just happened to be mostly in cache – it’s much more telling to do it cold, so the disks actually have to be read. It’s interesting that htcacheclean doesn’t seem to take into account the full size of the directories, since du -hs gives me a size of 1.9GB, almost twice the requested size for htcacheclean. I guess htcacheclean maybe only looks at the actual filesize of the content, and just ignores the size of the directories themselves (which are files, and take up some space). Maybe it also doesn’t take into account the header files, I dunno.

I thought about increasing CacheDirLength to 2, because that would result in an even broader and more shallow directory structure. But looking at the characters it seems to use for the current directories, a-z, A-Z, 0-9, _ and @, then that would give us 64^2 = 4096 possible subdirs in each directory level. I think it’s probably better the way it is right now – something on the order of tens of files is certainly going to be better than thousands in terms of lookup performance.

I do still see little spikes in iowait (using munin to monitor), but I’ll leave this for a day or two to see how it pans out over time. Maybe just removing one level of directories (CacheDirLevels from 3 to 2) is enough to make the difference here – after all, it’s effectively reducing a power function by 1, which is exponential in terms of the possible number of subdirs. We’ll see…

@Neil: I just went to your options page and counted up the various options. It looked like each option was given a numeric value and concatenated together for the cookie value? Anyhow, not a big deal:)

[…] can read more about this configuration, plus some issues that arise when using mod_disk_cache and mod_deflate together here. You can get around this by only gzipping html output from your JSPs, which you won’t be caching […]

[…] will use all available RAM to cache those files in memory for rapid serving. If you plan on using mod_gzip and mod_disk_cache together, please read my post on the issues encountered using them toget…. Share and […]

Anyway, I was wondering if you have found a solution to this. We have an apache(2.2.11)/mod_jk/jboss(4.2.2) setup wherein we want to cache non-user specific xml files (like http://www.tennisearth.com/widget/displayLiveScores.htm) requested from jboss to be cached for 10 seconds. To take care of the fact that the last modified time of these files might be only a few milliseconds in past, we had even set CacheLastModifiedFactor to 10000.
I guess there is some other header that we should ignore that should do the trick. We do not want to remove mod_deflate.

Having run into this problem recently we have work with the apache folks and by the looks of it the result is to move mod_cache to be a content rather then a quick handler. See http://www.gossamer-threads.com/lists/apache/dev/374883 for more details. It looks like this is going into Apache 2.3 and is being planned for a backport back into 2.2

I tried this and seem to have a conflict in the configuration. Without CacheEnable, my CSS files correctly get gzipped by Apache. But with CacheEnable (using disk cache), Apache seems to ignore the “Accept-Encoding: gzip, deflate” and respond with an uncompressed version of the CSS file (I presume from cache). I have “CacheIgnoreHeaders Accept Accept-Language Accept-Charset Cache-Control User-Agent Cookie Host Referer” so I think it should incorporate the “Accept-Encoding” information into the cache hash. Please – what am I missing?