But if you read over the GAE tutorials/guides for their various service APIs (Memcache, Mail, Datastore, etc.), they all reiterate that you should always code for the possibility that one of their services are down. GAE even provides a CapabilitiesService that you can check before calling any service method to see if that service is currently enabled or not.

So I ask: is there ever a chance that JUL logging operation will ever fail:

logger.info("Can I ever fail and not get logged?");

If not, why? And if so, what can I do to "failover" in the case that JUL has choked? Thanks in advance.

3 Answers
3

I've ran into this same problem, and yes the logging service can fail without errors. The best you're going to get (until GAE improves the logging service API), is to cron a job to wake up, say, every minute, and perform a logger.info(...).

Then run a LoggingService#fetchLogs(...), filtered to only retrieve the AppLogLine containing the most recent logger call, and check to make sure you can retrieve it. If you can't, then the logger.info(...) failed, and you can have your app react however you like.

I always expose a secure servlet on my GAE apps that pings the Capabilities Service and asks for a status check on each service. If the service is disabled or down for maintenance, I have an external monitor (that checks this URL every 5 mins) send me a text message. You can tie this "log checking" cron job into that kind of a service check.

Thanks @Alexander Pogrebynak (+1) - however your answer doesn't address a few things. For one, the GAE docs make no mention of these 4 possibilities (or any others that could arise), which tells me that their engineers might have somehow accounted for them and a GAE developer does not need to worry about them. Either way, secondly, the other half of my question was "how can I failover when JUL chokes?" It doesn't look like GAE provides a way to check for a fail condition. Thanks again.
–
IAmYourFajaFeb 4 '13 at 22:37

In most systems the Uptime is 100% minus the summation of the downtime of
all other systems. The exception to this rule is logging. When Logging
fails to record the downtime, Uptime goes up. As a result Google has been
working hard to build a logging system that goes down just ahead of all
other systems, and comes up shortly after.