1) must be able to continue operating after node failure
2) logs must be recoverable after node failure - no data loss
3) must be able to scale out
4) must be transactional - when a message is logged, I need a guarantee that it is persisted to disk

This is similar to a previous question of mine, but I just realized that importance of the transactional feature. This is for a medical app; we cannot afford to lose any log messages.

Now this setup does not specifically handle automatic fail-over between multiple log servers. I personally didn't worry about it as each client sending logging data would queue the data on their end until the logging server was back up. And I had monitors in place that would notify me of the logging server being down.

IF you already had a DB system that had appropriate fail-over and high-availability setup, you could setup two logging servers and use a heartbeat system (perhaps linux-ha) to do automatic take-over of the IP from the live logging server.

Thanks, Brian. Sadly, these are windows servers, so I can't run rsyslog client on them. Otherwise, this would be a nice solution. I am thinking of using flume (github.com/cloudera/flume) if I can build it for windows.
–
JackoAug 3 '10 at 1:39

From what you described above what you want is continuous-computing. There are 2 sort of software on windows based platform that can provide what you are looking for. I'm not too familiar with any transactional log applications. With both the below HA/FT solutions, you could just use about any out there. ( just as long they run in windows )

Neverfail is a HA solution that protects your application from any data loss. In an event of a server outage, failover between both servers are seemless ( does not require human interaction) and depending on how much data is still left in the memory that isn't written to I/O on the Active server, the Passive Server will take over operations. This would provide you with close to 99.99% uptime.

Marathon is similar like Neverfail but it has an added protection which is component protection. With their FT feature, if any failure were to happen to your sever, like a disk failure or even a network failure, your application will keep on running. Thus no data loss or interruption to business.