Fixing WooCommerce + Memcached performance

Just want the summary and the fix? Scroll down to the tl;dr at the bottom of this post.

Update 25-02-2014: Mike Jolley from WooThemes added a fix to the upcoming WooCommerce version, making the maximum expire on long-lived transients 30 days, fixing the issue entirely in WooCommerce. The issue below is still relevant, though I have yet to find other plugins heavily leaning on longer-than-30-day transients.

We recently ran into a performance problem with a client of ours we had just moved over to WP Engine.

After we solved the original problem, WP Engine alerted us that there was an issue with WooCommerce on the site. It was causing 2,500+ database queries on category pages.

You read that right: 2,500+ queries.

WooCommerce is one the most popular WordPress plugins out there, powering over 600,000 sites. So surely WP Engine must be right in thinking that there is something horribly wrong with our setup!

It turned out there was a problem affecting all websites that use WooCommerce with Memcached. In this post, I’ll explain the problem and how I found it. To make this post more accessible, especially if you are less familiar with caching concepts, I added extra explanations in grey.

Starting the investigation

Before I checked what queries were causing the problems, I started by looking for any systemic failures: if a page is making that many queries, odds are that the object cache is broken.

To get the right content to display on every page, WordPress gets the content and all your options from a MySQL database. Getting the data out (a query) is considered a slow operation, so developers will always try to avoid making unnecessary queries. A common strategy is to cache the data: if you have to get the same information twice during a single page load, it’s better to keep the results in your cache than to ask it from the database twice. To improve this mechanism, persistent caches are used. These keep frequently needed information cached also between requests. Examples of persistent caches are Memcache, APC or Redis.

Using Debug Bar Console, I first made sure that the object cache was still working by storing a value, querying it immediately, and then making another request to query it:

All seemed fine there. I did the same with transients.

Transients are a way to temporarily store information that takes time to generate. This can be information downloaded from an external website, or something that takes a lot of time to compute – for example the results of faceted search queries. These are normally stored in the database, but if you have a persistent object cache, these are stored in the cache instead.

Finding the culprit with SAVEQUERIES

To monitor the queries made on your site, you can tell WordPress to log your queries by enabling SAVEQUERIES in wp-config.php:

[php]
define( ‘SAVEQUERIES’, true ); [/php]

With this enabled, Debug Bar will show all the queries made in detail.

This seemed strange: WooCommerce would never run so many queries uncached. So I opened up the WooCommerce source code and had a look at the problematic lines. And sure enough, the terms are cached in a transient. The transient should be valid for a whole year even!

Again, I turned to debug bar (have I mentioned how awesome debug bar is?) to check if keys starting with wc_ln_count_ were being added to the cache. It turns out, wc_ln_count_ is gotten from the cache, but never added. This is what you’d expect when the cache is working properly. Not when you have 2,500+ queries on your page.

I tried out some of the cache keys above in debug bar, and confirmed that the value did not get stored in the cache.

I rewrote the problematic part of the function for debugging, and it confirmed the behaviour.

Though it felt like I was getting nowhere, I was getting there! I confirmed that:

Transients are working

The object cache is working

The transient in WC_Widget_Layered_Nav isn’t stored

So I needed to find out what the difference is between my test transient and the transient in the layered nav widget.

I already spoiled what the problem turned out to be at the beginning of this post, but before I got there I tried checking if key was maybe too long, or if the type of object made a difference. Of course, these tests came back negative.

The last thing I tried was taking away the YEAR_IN_SECONDS expiry argument. And (surprise, surprise) the transient worked as it should.

Figuring out why expiring transients weren’t stored

Now that I knew that there was a problem with the $expire parameter in the transient API on my hosting setup, I started looking into the Memcached object cache implementation of WP Engine, which is pretty much the one from the Memcached Object Cache plugin.

What happens in the plugin is quite straight forward: it instantiates a Memcache object with a connection to your Memcached server and provides an API from the WordPress object cache to Memcached.

$expire is passed through directly to the Memcache::set() and Memcache::add() methods, so I opened up the php.net documentation to find out what is going on.

And there it was, plain and clear:

expire: Expiration time of the item. If it’s equal to zero, the item will never expire. You can also use Unix timestamp or a number of seconds starting from current time, but in the latter case the number of seconds may not exceed 2592000 (30 days).

Just to double check, I opened the Codex to see what WordPress expects for the $expire parameter:

$expire: (optional) this defines how many seconds to keep the cache for. Only applicable to some functions. Defaults to 0 (as long as possible).

So WordPress always treats $expire as time to live (TTL).

The world makes sense again! YEAR_IN_SECONDS (31536000) is treated as a Unix timestamp, which is seconds since the 1st of January 1970. So the Memcached server thought the WooCommerce transients should expire the 1st of January 1971. As a good piece of software, it did what is was told, and forgot the data yesterday.

Most of the time explicit expiry times are much less than 30 days in the future, so this problem won’t occur often, and will be hard to notice. However, WooCommerce stores transients for a year regularly throughout the plugin. With such a far future expiry date, you can assume that the programmers of WooCommerce figured it’s very important these transients stay saved for a long time. Yet, it is exactly that data that is forgotten immediately. See the irony there?

Writing a fix for the Memcached plugin

We know this:

WordPress always treats $expire as time to live

Memcache treats $expire as a timestamp when it’s more than 30 days in seconds

The obvious fix is to always pass Unix timestamps instead of TTL to Memcache. We can safely do this by simply adding the current timestamp to $expire.

In the WP_Object_Cache::set() method of the Memcache Object Cache plugin, I added a line that always adds the current timestamp to $expire. That way, the behaviour is consistent, whether $expire is 5 minutes away, or 10 years:

And that fixed it. The queries went down from 2,529 to 150 on the category pages. Still a lot, but at least in a sane range. Woohoo!

I also added the fix to the WP_Object_Cache::add() method and set the default expiry in the constructor. You can find the fixed version of the plugin at the bottom of this post.

Publishing the fix

I checked the newer Memcached Redux plugin for WordPress, which uses the Memcached class, and found that it suffered from exactly the same problem. The PHP.net docs are also a bit more explicit here.

I have contacted the authors of the plugin, as well as WP Engine (who install their own modified version of the plugin automatically) to get this fix into their respective plugins.

You can find my patched versions at the bottom of this page.

Just to make sure: this problem is not caused by WooCommerce. WooCommerce is playing by the books and doing things exactly how WordPress expects them to do it cache-wise. (On a side-note, they are doing a pretty awesome job in general) However, many sites are running Memcached and upgrading the object-cache.php file is not as easy as updating a plugin. This is why I have also opened a pull request with WooCommerce to patch this problem for users that are using Memcached.

tl;dr: The expiry parameter in MemCached is treated as a Unix timestamp instead of a TTL if it’s higher than 30 days in seconds.

In WordPress the transient and object cache API always assume the expiry parameter is TTL, no matter how high it is. So all those values with a YEAR_IN_SECONDS expiry are only stored until new year’s eve 1971. So not at all. Patched versions of the Memcached plugins are here:

Are you using the script that is optimized for WP Engine (the last button on this page)? The other ones won’t work and could quite likely lead to an error.

To figure out what is causing the white screen, you can turn on error reporting. Open up wp-config.php and add define( "WP_DEBUG", true );. When you reload the page, you should see the error, which will point you in the right direction.

Feel free to send me a PM at mike (at) remaintenance.io if you can’t find the source of the problem!

This is an interesting find, and I realise it is old news now and hopefully not affecting people who have kept up-to-date.

It raises two question to me: firstly, why is the application even trying to do 2500 queries in the first place? That aside, are the number of queries listed, the queries that are *actually run against the database*, and not those that are caught and taken from the cache first? Should our aim for performance be to reduce the number of queries listed, meaning that more of them are coming from the cache? From a developer’s POV avoiding 2500 queries in the first place is key, but from a site settup perspective I would just like to understand better what the listed number of queries means practically.

WooCommerce was very quick to get the problem fixed (see the commit here).

The queries were actually run against the database, as WP doesn’t cache the queries themselves (this wouldn’t make much sense anyway). The data returned should have been cached, but because of the problem described in the post, this didn’t happen. All in all, it just shows how easily a side-effect like this can creep in!