The Curious Incident of the Java Timezones

7 AM
May 30, 2006

I finished reading The Curious Incident of the Dog in the Night-time less than twenty-four hours ago, so it was a pleasant surprise when, this morning, I was presented with my very own puzzle-mystery to solve. Better still for one who has just read “The Curious Incident”, the players in this puzzle-mystery are not people, but computer programs!

Chapter 0. The Scene Set

We have a production process, logsrv, that takes log records from IBM MQ, interprets each one and then inserts it into a database table. A second process, publisher, provides a SOAP web service that reads the database when requested by clients.

This is all written in Java and deployed to Windows. logsrv and publisher run as Windows services (using JavaService), and the database is Microsoft SQLServer. While it is a little complex – and there are good reasons for that, trust me – this system is rock-solid stable. The publisher process, for instance, has a record up-time of 89 days, with no sign of memory leaks or performance. (I think it could run for a year or longer, but unfortunately had to take it down recently, to put in a one-line code change. Oh well.)

Chapter 1. The Morning Ruined

All was going fine, until this morning, when I got The Email That Changed The Course Of My Day. Log messages shown at the client were timestamped for ten hours later than they should have been.

If you’re a Java developer, and you live on the east coast of Australia, then that ten hours ought to make you suspicious. Ten hours is the time difference between Australian Eastern Standard Time and UTC.

Chapter 2. The Psychic Database

So the Simon The Sys Admin and I had a look in the database and found A Clue. The database records both WHEN_OCCURRED and WHEN_RECORDED. WHEN_OCCURRED is the timestamp from the record that taken from MQ, while WHEN_RECORDED is the time that the record was received at the logsrv process. According to the database, the logs were being RECORDED ten hours before they OCCURRED.

We checked the Windows system time and timezone, and found they were set correctly. A quick check of the log4j debug output confirmed that the logsrv process was ten hours behind normal time.

So now we had two problems. First, logsrv had the wrong timezone, and second, the web service was publishing the WHEN_RECORDED timestamp instead of the WHEN_OCCURRED timestamp.1

In accordance with the ancient and mystic precepts of Issue Resolution, I asked the Simon The Sys Admin to stop and restart logsrv. This did not correct the issue. The clients still showed the wrong timestamp. Again in accordance with the precepts, the sysadmin elected to Reboot The Server. This time, when the server came up, the clients began to show correct timestamps once more.

Feeling a little better about it all, I wrote an email to the IT manager, reporting that all was now well, and went hunting through the publisher code for this showing-the-wrong-timestamp bug.

Chapter 3. All Mixed Up

Funny. I couldn’t find a bug in the publisher code. The publisher code correctly ignored the WHEN_RECORDED value, only using the WHEN_OCCURRED value.

There was only place where the two values could have been confused: in Hibernate. Now, this wasn’t quite as silly as it sounds. The timestamps had been stored using a Hibernate UserType, something which is not as common as straight data types. We were also using an older version (2.1.6), and hadn’t upgraded. I downloaded 2.1.8 and read through the release notes, but there was nothing appropriate there.

I decided it was time for a cuppa. While waiting for the jug to boil, I realised that I was grasping at straws, and decided to put this particular theory away and see if I could think of something a little more likely.

Chapter 4. Log4J Shows The Way

Just then, a good thing happenned. Simon The Sys Admin emailed me the log4j debug output for the whole day so far, plus some of yesterday. Looking at the logsrv output, I could see that, after restarting the service, it had the wrong idea of system time. However, after the reboot, logsrv suddenly had the right time. Curious. How the service was started should have no effect on the timezone.

I happenned to have a copy of the SQLServer database on hand – one that I’d made only yesterday. A quick couple of queries showed that logsrv only started to put the wrong WHEN_RECORDED timestamp starting yesterday at – doh! – approximately the time I’d manually restarted the logsrv service, breaking the 89 day uptime record.

Still, the logsrv was working fine, apart from putting in the wrong WHEN_RECORDED date.

Then it occurred to me that the publisher likely had the same problem. Sure enough, I had had to restart it yesterday too, and after the restart it had the wrong timezone.

Fortunately, these applications dump the contents of System.properties() to log4j on startup. I could see right there on every start up the value of the user.timezone property. It was set to GMT whenever the service was restarted manually, and Australia/Sydney after the reboot. GMT is what Java defaults to when it can’t determine the timezone, that explained the ten hour difference.

The simple fix is to configure the logsrv and publisher services to pass -Duser.timezone=Australia/Sydney to java.exe on startup. Easy. I can do that next week, since I’ll be deploying a new release then anyway.

Production issue resolved, I went back to my development task.

Chapter 5. The Mystery Resolved

However, something about this just didn’t make sense. Sure, it works fine when all the processes have the correct timezone, but still I wondered what exactly was causing the client to show the wrong times when the timezone is wrong. logsrv was inserting the correct timestamp into the WHEN_OCCURRED column of the database table. publisher was not supposed to be modifying any dates or times at all.

I nutted it out while I was at the shops. This is what happens when the timezone in the logsrv and publisher is wrongly set to GMT:

The MQ record holds the correct “occurred time”, without timezone information.

Being a Java process, and using Java Date objects, logsrv needs to know the timezone that applies to any given point in time. Since there is no explicit timezone, it interprets the “occurred time” as a GMT time.

logsrv then writes the occurred time to the SQLServer database.

Like most databases, SQLServer is not aware of the timezone of date and time data. It accepts the date from logsrv verbatim, so WHEN_OCCURRED ends up with exactly the same time as was in the MQ record. So far so good.

publisher then reads the data from the WHEN_OCCURRED column, assuming it to be a GMT time.

publisher then constructs a SOAP message to send to the client. It marshals the “occurred time” as an XMLSchema dateTime. XMLSchema dateTimes include the timezone.

The client then reads the SOAP message, interprets the occurred time as a GMT time, but displaying it as a local time, hence it appears ten hours into the future. Q.E.D.

Now I’m much happier, because I know that if I have the correct timezone on logsrv, publisher and client, this problem is derinitely solved. I also have a better understanding of how this part of the application will behave if the timezones ever get of sync in future.

——

1 Observant readers will note that I was reasoning backwards. WHEN_RECORDED was certainly off by ten hours, but it was ten hours early, whereas the reported symptom was that times where ten hours late. I’m kind of glad I didn’t notice this, though, as then I would have been really stumped, and may not have persevered past rebooting the PC.