Get answers, ideas, and support from the Apigee Community

Tutorial: How to Enable a "Maintenance Mode switch" for an API?

We at Apigee think of APIs as a digital channel that should be "always on", to allow users of the API, whether they are employees, business partners, or customers, to always be able to get access to the things the API exposes. Product catalogs, data services, order processing, or whatever it is.

The commercial Apigee Edge cloud service is guaranteed to deliver three-nines of availability (99.9% available), or four-nines if the organization uses multiple regions.

But, the reality is that large complex systems still go offline periodically for maintenance. Even if Apigee Edge is 99.99% available, somewhere in the distributed chain of systems that implements the API, somewhere in the chain behind Apigee Edge, there are systems that get taken offline regularly for maintenance and upgrade.

During those times it would be *Really Handy* if the Operations team could easily change a setting in Edge to put the system into "maintenance mode". When in that mode, the API proxy, operating at the edge of the network, would not proxy requests into the configured backend API implementation, but rather would respond with a 503 status or similar, and a response payload stating that the system is in maintenance mode.

Hmmmm, What would be the specific requirements here?

Requirements

It should be easy to turn the maintenance mode on or off, administratively.

modifications to the maintenance mode setting should take effect relatively quickly.

checks for maintenance mode should not affect performance at scale.

it should be possible to restrict the scope of maintenance mode to particular proxies or products.

Not only is it possible to do this in Apigee Edge, it's drop-dead simple. Here's how.

Implementation

Apigee Edge includes a Key-Value map. This is a persistent store of data that can be referenced by proxies at runtime. It's perfect for storing a maintenance mode setting. The key-value map can be changed administratively, and can be read by proxies.

I've built a sample proxy to illustrate how to use the KVM to implement a maintenance mode switch.

It works by relying on a Key-Value map with a single key, "maint_mode". The value is a string. If the string value is "true", then the proxy responds with a canned 503 response (Sent via RaiseFault) If the string value is not true, then the proxy responds as normal.

It's a good idea. I tried to implement a similar "Lock Down" function using Spike Arrest. All proxies refer to KVM variable and do a Spike Arrest Policy in Pre-Flow using the KVM variable as the Rate value. If we want to bring the entire platform down (for example, if our back-end is under strain), we can just flick the KVM value to 0ps.

However...

I'm finding the caching of KVMs within Apigee too unpredictable. The KVM data seems to be cached in the message processors and there doesn't seem to be a way to force a refresh. So if you do flick it into maintenance mode, it will be a indeterminate amount of time before the change is replicated into the message processors, even if Edge UI is showing the new value. This is even worse when you want to switch it back because you have the business screaming at you.

Unless I have missed something, and there's a way to clear the MP cache?

There is no way to explicitly clear the cache used for KVM entries. You have to wait for the entries to expire from the cache. (Not quite true - you can use a KVM-Put policy, as described here).

There IS a way for you to tell Apigee Edge how long to cache the entry. When you configure the KVM policy, using a GET operation, there is an optional element that you can specify. It looks like this in the KVM configuration:

This tells Apigee Edge to cache the data it is reading for key1, for 60 seconds. You could call it the "time to live" or TTL.

I believe that If you use a value of zero or less than zero (or omit the value entirely, which implies -1), it tells Apigee to use the default cache TTL, which is 300 seconds. (The Apigee doc says -1 implies "never expire" but according to my tests this is incorrect. I will ask to have the docs updated...)

So what should you set the expiry to? That depends. If your API volume is low, or if you don't care whether an API proxy takes an extra 3-4 ms, then set it to 10 seconds or lower if you like.

If the request volume on your API is high, then you should take a little more care choosing. The KVM store is Cassandra. A low TTL used on a high volume API means Cassandra will keep pretty busy satisfying reads for the KVM Get. Cassandra is "pretty fast", probably <3ms for a read, but that might be a cost you'd like to avoid if possible.