Ehcache replicated caching with JMS, AWS, SQS, SNS & Nevado

Recently I’ve been researching ways to GSLB a very large app that relies on Ehcache for numerous things internally such as; cached page output, cached lists etc etc. Anyone who has experience getting large scale applications to function active-N in a GSLB’d setup knows the challenges such topologies present. The typical challenge you will face is how to bridge events that occur in locally (dc) clustered applications, for example in: DC-A (data center), with another logical instance footprint of the same application living in DC-B. This extends all the way from the from the data-source, all the way up the stack.

So for example, lets say user A is accessing the application and hitting instances of it residing in DC-A. This user updates some inventory data that is cached locally in the cluster in DC-A; subsequently this cached inventory also resides in the cluster running in DC-B (also being access by different users in some other region). When user A updates this inventory data, the local application instance, writes it to the data-source, and then does some sort of cache operation, such as a cache key remove, or put (replace). Forgetting the entirely separate issue of how the data-source write is itself is visible across both DC’s, point being is that the cache update in DC-A is visible only to participating instances in DC-A….. DC-B’s cache knows nothing of this; only its data-source is aware of this new information…. so we need a way to get DC-B aware this cache event. There are a few ways this can happen; for example we could just configure the caches to rely solely or LRU/TTL driven expiry, or actually respond to events in a near-real-time fashion.

Now before we go on I’ll state up-front that despite what I am about to describe would work (to an extent), ultimately I likely will NOT go with this setup due to the inefficiencies involved, particularly the amount of data being moved across WANs if you just use the Ehcache JMS replicated caching feature alone. (i.e. cached data is actually moved around, rather than just operation events with the JMS replicated Ehcache feature)

Continuing with that train of thought, after the latter caveat…. so one thing I started looking at was the Ehcache JMS Replicated Caching feature. Basically this feature boils down to permitting you to configure any cache to utilize JMS (Java message service) for publishing cache events. So when a PUT/REMOVE happens, Ehcache wires up a cache listener that responds and subsequently relays these events (including the cached data on puts) to a JMS topic. Then any other Ehcache node configured w/ this same setup can subscribe to those topics and receive those events. Couple this with a globally accessible messaging system, you now can have a backbone for distributing these events across multiple disparate data-centers…… but who in the hell wants to setup their own globally accessible, fault-tolerant messaging system implementation…. not me.

Note the code is at the end of this post ….. and yes the code is very basic and NOT production ready/tested; it was just for a prototype/research and is a bit hacked together. Also note this code is reliant upon this PATCH to Nevado, which is pending discussion/approval

The first step was creating an Ehcache CacheManagerPeerProviderFactory (NevadoJMSCacheManagerPeerProviderFactory), which returns a JMSCacheManagerPeerProvider to Ehcache that is configured to use Nevado on the backend

The NevadoJMSCacheManagerPeerProviderFactory boots a little spring context that sets up the NevadoTopic etc

Created a little test program (below) EhcacheNevadoJMSTest. I just ran several instances of this concurrently w/ breakpoints to validate that events in one JVM/ehcache instance were indeed being broadcast over JMS -> AWS -> back to other Ehcache instances on other JVM instances.

The first thing I noticed was that while the events were indeed being sent over JMS to AWS and received by other Ehcache peers, the actual cached data (Element) embedded within the JMSEventMessage were NOT being included, resulting in NullPointerException’s by the Ehcache peers who received the event.

Once I created a patch for Nevado to use the ObjectOutputStream things worked perfectly.

CAVEATS

Again this code was for research/prototyping

The viability of having the actual cached element being moved around to AWS, across WANs and back to other data-centers is likely not too optimal. It would work, but under high-volume you could spend a lot of $$ and bandwidth.

Ideally, all one really cares about is “what happened”, meaning Ehcache KEY-A was PUT or REMOVED etc. Then let the receiving DC decide what to do (i.e. remove the cached KEY locally and let next user driven request re-populated from the primary source, the real data-source). This results in much smaller message sizes. The latter is what I’m now looking at, using the Ehcache listener framework w/ some custom calls to SNS/SQS would suffice for this kind of implementation.