The digest calculation in the Portal "contenthandler" service can render content URL caching ineffective in Apache if various services such as WebSEAL (LTPA junctions) or Google Analytics insert cookies in the request stream. This blog entry will present a method to deal Apache caching in the presence of these cookies.

Introduction

WebSphere Portal now relies heavily on the the "contenthandler" component to address and render portal resource requests. When you render a Portal page, you will many requests with a request URL beginning "/wps/contenthandler" or "/wps/mycontenthandler". These typically (but not necessarily ) are resources required by the Portal theme.

Typically, the items returned are CSS files, Javascript files, JSON files and other content that is static in nature. These items are also typically invariant between users. Given that, caching of these responses would be highly desirable. The cache that makes the most sense here is the WebServer (IHS or Apache) URL cache.

However, the contenthandler service inserts "digest" calculation in the response to insure that if and when contenthandler responses should be unique to a user, that these responses have different response URLs. That way, URL caching services like Apache mod_disk_cache or Akamai return the correct response for an individual user. Some services, such as WebSEAL with the "LTPA junction" type, insert cookies into the request headers that erroneously force uniqueness on the Portal contenthandler responses unless appropriately dealt with.

WebSEAL LTPA Junctions

Once authenticated to WebSEAL, WebSEAL does a SetCookie to the user's browser for a cookie name that begins with "PD_STATEFUL", for example "PD_STATEFUL_00bcef52-0c5a-11e4-98a1-a224e2a50102=%2Fwasapp". When a request comes to WebSEAL with this cookie, that identifies the user associated with this request to WebSEAL so that WebSEAL can insert an LTPA token on the request before it gets to Portal.

WebSphere Portal Digest Calculations

Be default, Portal will calculate a "digest" for all contenthandler requests. The result of this calculation can be found in the Portal responses. Here is an example response URL to a content handler request:

/wps/contenthandler/!ut/p/digest!dIUs4TDXUuNN4g3szsXj1Q/mashup......

In this URL, the section after "digest!" is the calculated digest by Portal for this URL. By default, the digest calculation by Portal will exclude several cookie. This is controlled by settings in the "WP_ConfigService" resource environment provider in the WebSphere Application Server hosting Portal. The setting name is "cookie.ignore.regex". As you can guess from the name, this is a Java regex that will exclude all the cookies specified.

By default, the exclusion list is "LTPA, LTPA2 and JSESSIONID". These cookies specify the SSO ID and session IDs of this user. To exclude other cookies from the digest calculation, append the, to these cookies in the WP_ConfigService resource environment provider. So, to exclued the WebSEAL PD_STATEFUL cookies add the following name/value pair to WP_ConfigService:

When I am helping customers with performance, there are 3 things that I tell them they must do relative to IHS:

1. Set the "Expires" and "cache-control" headers properly. Note that Portal does NOT set these headers on some types of content
2. Set IHS to compress (gzip) content so that it travels the network efficiently
3. Use mod_disk_cache to cache statics and off-load the Portal servers of this work.

Their next question is usually "how do I do this"!

So, I've attached a copy of my httpd.conf that I use in my cluster to do all three. If you use my copy, you still need to adapt it to your specific setup, but it should get your started.

There is one line like this: "CacheEnable disk /". That simply means "cache everything". Normally, you would not want to cache anything that starts with "my" such as /wps/myportal or /wps/myContentHandler". But Portal places a "pragma no-cache" stanza along with a "cache-control: no-cache" in content that is not allowed to be cached.

Another common question is "how do I know content is caching in IHS? A simple trick is to place the following stanza in the LogFormat option: "Age in Cache: %{Age}o". This will print the age of the content in the cache (in the access_log). If the content is served from the Portal, there is no "Age" header and this you will see "Age in Cache: -". When the content is server from the IHS cache, you will see the "-" replaced by the age of the content in the IHS cache (in seconds).

Note also that when using the reverse proxy cache (like I have done here), that you would normally use htcacheclean to insure that your cache doesn't get too large or that stale entries are periodically removed from the disk. I have put the following in a cron job to do just that:

This runs every night to force the cache to occupy no more than 10Meg of space.

There was a bug in mod_cache whereby s-maxage was not honored when making caching decisions. However, IHS has a post-2.2.8 s-maxage fix (PK98225) that resolve this.

Performance: I recently read a blog article that shows that when using Linux (ext3/4) as your IHS mod_disk_cache OS, you can see performance improvements by using "noatime and data=writeback" as fstab options. Linked here is the blog article.

As I've discussed in the past, to render a Portal page, especially in later version of Portal, requires a lot of "statics". These statics are images, CSS files, JavaScript files, etc.

These statics are not security aware. In other words, a lot of these statics are delivered on the same URL regardless of your security rights. The statics are, generally, not considered secure content.

Given that CPU cycles are considered "expensive" on the Portal/WAS servers and cheaper on the IHS/Apache servers, it is very desirable to reverse proxy these statics on the IHS servers using the mod_cache facilities.

Prior to IHS version 7, the only choice available was the "mod_mem_cache" module. Mod_mem_cache provides an RFC2616 compliant reverse proxy cache. IHS version 7 added support for the "mod_disk_cache" option in addition to the mem_cache.

In short, the correct answer is to use mod_disk_cache. Let's look at some attributes of each type of cache.

mod_mem_cache:

1. Cache is "Per process": Apache spawns processes to handle inbound HTTP(S) requests. An instance of the mem_cache is created for each process. There is duplication of cache entries in this scenario (i.e. wasted CPU memory).
2. Cache size limitations: Because of "1", the cache instances must necessary be limited in size to not exhaust CPU main memory. There are several mod_mem_cache directives to help in limiting the size of responses that can be stored in the cache.
3. Occasional inefficient replacement algorithm: Because of "1" and "2" together, responses that are near the limit of the size allowed in the cache, make force removal of responses better left in the cache.
4. Limited capability for stale pages: Because the cache is limited in size and because it gets regenerated with each new process instantiation, there is very limited chance of stale responses being in the cache.

mod_disk_cache

1. Cached responses are shared among all process: There is on instance of the cache system wide. Therefore, there is less wasted space in memory.
2. Disk_cache type takes advantage of Unix/Linux file buffering: See the commentary below for discussion of this item.

2. Need to use clean up utility - htcacheclean: mod_disk_cache does not automatically clean stale items in the cache. This can result in wasted disk space. More troubling is that some responses from the response owner (i.e. Portal/WAS) may not have proper cache-control headers indicating how long responses are allowed to live in proxy caches. Therefore, the cache can potentially return the wrong, stale version of a response. The htcleancache utility is therefore needed to be periodically used (via "cron", for example), to insure stale responses are removed from the cache.
3. Need to allocate disk space: Since the responses are stored on disk, there is always the potential to exhaust disk space. Like all production Unix machines, monitoring policies need to be in place to insure you don't let this happen.

When first considering which type caching to use, most would immediately suggest mem_cache as the better option. From a performance perspective, serving from memory is obviously better than serving from a disk. In reality though, if you understand how Unix/Linux buffer file I/O, the benefits of disk_cache become apparent. Unix allocates unused portions of memory to buffer files as they are read. So, the initial read request starts reading the file into memory. Subsequent read requests for the same file are read from memory without even touching the disk. So, with the exception of the initial load, disk_cache performs as well as mem_cache and there is only one instance of the response in memory as opposed to "per process" duplication of mem_cache. Because memory utilization is more efficient, cache hit ratios can be much higher with disk_cache.