Technology, Startups and Strategy

This is part two in my caching series. Part one covered the concept behind the full page caching as well as potential problems to keep in mind. This part will focus on implementing the concept in actual PHP code. By the end of this you’ll have a working implementation that can cache full pages and invalidate them intelligently when an update happens.

Requirements

I’ll provide a fully functional framework with the simple application I used to get my benchmark figures. You’ll need the following software to be able to run it.

Nginx. I’m not sure which exact version but I generally use and recommend the latest development version.

PHP 5.3.0. I recommend at least 5.3.3 so you’ll have PHP-FPM for your fastcgi process management.

MySQL

Memcached

The Framework

I’ll be referencing the code on github instead of pasting it in this post to keep the size down, so you will probably want to download it.

The framework uses a 3 tiered setup like most of the popular frameworks. It consists of controllers, libraries and views.

A controller is what handles the flow of the request. It parses the input provided and decides on what action to take. Only one controller will ever be loaded during a request.

The libraries handle the brunt of the work, they’re usually be the ones to access the database and generate the actual data for the controller to handle. Several libraries might be used during a request.

Views are the template logic, they’re non-parsing and use PHP for their logic.

The index.php file handles the routing, there are a few settings there but nothing really interesting for this blog post. The only thing you need to know if you want to mess around with the sample application is that there is a direct URI to file routing. There is no manual routing available.

The Caching Logic

Just so we’re on the same page, the goal here is to define a way of invalidating cached pages that use stale data. Since the cached pages are served directly we have to invalidate cached pages when the data is being changed. So before we begin the implementation of this we’ll need a few concepts to help us keep things straight.

Cache keys. This is how pages will be identified in the cache. The framework uses a direct URI to controller mapping it makes sense to use the URI as the cache key, so if refer to the URI or the cached page I mean the key under which its cached.

DataKeys. These are essentially identifiers for data. The goal is to prevent stale data so we obviously need a way to identify and reference the data we’ll be working with.

With the cache keys and dataKeys concepts defined we can now begin to implement the invalidation logic. For this we need to track the data and establish a relation between data and cache keys. As we established in part one there might be multiple controllers using the same data so we need to map what data every controller use. Furthermore we need each controller to report which cache keys they generate so that we can invalidate them.

This is where cachetracker.php comes in, you can find it in the core directory. All caching logic is handled by this file. If you look at the top of it you’ll see an interface called ControllerCacheable. Every controller which handles cached data needs to implement this interface.

ControllerCacheable defines two methods, dataKeyReads() and dataKeyInvalidates(). The former handles mapping data to controllers and the latter handles mapping data to cache keys.

DataKeyReads() should return an array of the dataKey a controller will read from. This allows us to easily iterate every controller and generate a dependency mapping of data -> controller

DataKeyInvalidates() accepts the dataKey to invalidate and an optional payload (will show example later). When given a dataKey this method should return an array of cache keys that use this dataKey. These cache keys will then be invalidated.

The CacheTracker generates the dependency mapping in the getDataKeysAccessors() method. It will iterate through the controllers directory and call the DataKeyReads() method if it implements the ControllerCacheable interface. After covering all the cacheable controllers the mapping list will be stored to a file ‘deplist.txt’ in the root directory relative to the index.php. Please note that if you change the dataKeys a controller uses you’ll have to delete this file so that it’ll be regenerated.

The second method of interest in the CacheTracker is triggerDataKeyInvalidation(). This is the method that one should call whenever a change to data has been made. This method checks the dependency mapping list and call dataKeyInvalidates() in the controllers which use the dataKeys. At this point we’ve essentially managed to get the cache key used by every controller which use the piece of data we’ve just updated. Time to see how this translates into a real world example.

The Sample Application

The application I’ve included in the download is quite simple as it’s intended to showcase the concept only, it’s not a valid measurement of how fast a real world application would be. With that out of the way, have a look at the news.php controller. It’s got everything a news script really requires, news and comments! The actual news and comments implementation is not overly interesting so scroll to the bottom of the file and check out the methods defined by our ControllerCacheable interface.

DataKeyReads() defines an array with elements news and comments. These are the dataKeys that this entire controller deals with.

DataKeyInvalidates() converts a DataKey into the cache keys pages are stored under. The code pretty much speaks for itself but I do want to point out the use of $payload as this is a good example of how the payload information can be used to pinpoint the exact cache keys to invalidate. Without it we would have had to invalidate all the news posts.

Next in the sample application is the news library. It’s located in the cachetest folder under libraries. The interesting part here is the call to CacheTracker::triggerDataKeyInvalidation() whenever the library changes the data.

If you want to try out the sample application you need to configure a few things first. Inside the includes directory there is a config.php file. The various configuration options should speak for themselves. There is also an .sql file in the root which contains the table definitions and some sample data.

The Nginx Configuration

The final part of the puzzle is to make sure nginx serves the cached pages instead of sending them to PHP. The configuration is as follows:

The caching part is towards the end, we set the memcached key which is the namespace plus the URI, if we get a 404 not found we instead pass to the fastcgi backend. It’s really that simple on the nginx end. The only thing to note is that I’m using the Upstream Keepalive 3rd party module for Memcached keepalives, this removes some of the overhead from connecting to Memcached. If you do not have this module compiled in you can simply remove the keepalive line from the upstream block.

Limitations

I covered this aspect in part 1 but I feel it’s something that’s worth pointing out again. The method I’ve used in this framework allows for easy mapping of data->controller->cache-key relations, but only in cases where the cache key is predictable based on the data. In my daily usage I find that a large majority of my application often have a predictable relationship between data and the cached pages, however, there are common situations where it’s simply impossible to avoid stale data.

The most obvious example of this is if you have built-in search. Say someone searches for Platypus on your blog about animals, since the Platypus is an awesome animal you have a lot of posts about it. The URI (and thus our cache key) for this search page is /search/platypus/. Now if you add another or edit an article about the Platypus your search results will now be outdated. You can obviously invalidate /search/platypus/, but what about the URI /search/duck/ or /search/australia/ which also return your Platypus articles? Suddenly mapping the relations become downright impossible.

I haven’t really been able to think of a way to actually solve this problem other than simply accepting stale data and do TTL caching, or accepting the performance hit and not cache at all. Thankfully with the use of tools like Sphinx or another dedicated search daemon the performance and scalability should be easy enough to handle.

An obvious limitation is, of course, also truly dynamic data. If your page contains the string “Welcome $username!” then you might not want to cache that page unless every one goes by the same username, naturally. There are ways to handle caching of dynamic data as well, though. Edge Side Includes is one such way and I plan to play around with that myself and possibly write a part 3 in this series. Until then I’d love to hear how useful you think this method of cache invalidation is.

On a side note, what do you guys actually think of the framework as a whole? It’s written to be really lightweight, provide options for streamlining the development process but otherwise stay out of your way. It does create some limitations such as no manual routing, but do people actually ever need this instead of just straight file to URI mapping?

Yoba

Posted:September 17, 2012

I think the problem is that he writes on a level that assumes basic knowledge of Nginx.

To use an analogy, imagine an advanced class in playing the piano, and Prof. Mozart, is explaining the toughest trick of how to play six keyboards simultaneously while blindfolded. When he finishes the lecture and asks the class if there are any questions, one student who was taking the most notes asks "Professor, you never explained what the black keys on the piano are used for." Reply

Hello, I am trying to do some caching using nginx for an IIS server runnins asp applications, do you know anyone that implemented this? is the full page caching a good idea in this case? I am not very optimistic, because i know nginx is more php friendly.
Thank you Reply

Hi! Nginx is very backend agnostic. It does not care what's in the back and it's only PHP friendly because it's FastCGI friendly. However, nginx is not very Windows friendly with quite a few known limitations documented here: http://nginx.org/en/docs/windows.html

Trying to use nginx as a caching proxy on Windows is probably not a good idea. So to do this you'd need to run a Linux server with just nginx and then HTTP proxy to your IIS server. In this scenario full page caching can certainly work so long as your backend is coded to support it. Reply

bytetrain

Posted:September 30, 2014

In order for the above mentioned configuration to work I had to include the fastcgi params as follows:

location @nocache {
include fastcgi_params; # this line must be there for php to kick in
fastcgi_pass backend;
}

Actually it just needs to be added somewhere in a higher context. I prefer to include fastcgi.conf in http context so I don't need it in multiple server blocks or even location blocks. Which is also why it's not in my config. Reply