Effective log management in the cloud with Splunk

In the early days of CloudMagic development, we realized that traditional log management techniques won’t work in the cloud. We were generating zillions of log entries across several servers and debugging was turning out to be a pain as information was scattered. There was no way to know when a server component encounters an error.

We put some thought on what goals our log management tool should meet and came out with the following list:

Custom log patterns:

Watching logs from all components in our stack and notifying any erratic behaviour. Our stack includes a number of third party applications each logging in it’s own format. Hence we required the capability of extracting information from custom log patterns.

Manageable notifications:

We wanted to have great control over alert notifications. The tool should decide whom to notify depending on which component has encountered with an error. We didn’t want the developers to be buried under an unmanageable pile of notifications. Notification should be sent only when something required action, eliminating false positives and duplicates. And we wanted each alert to be tracked until closed.

Ease of manageability:

We didn’t want to spend time in managing the tool itself. It had to be easy with ability to version control the rules defined for log processing and alerting.

Centralized and easily searchable:

Debugging should not require access to multiple machines. All logs should be available at a central location in real time. Log transfer should be reliable and robust to network or machine failures. We should be able to search for logs at any point in time, efficiently.

Minimal footprint at CM servers:

CloudMagic has been designed to make optimum use of computing resources on every machine. Hence the solution had to be very lightweight on CloudMagic servers.

Scalable architecture:

The solution should scale as CloudMagic grows.

With the above goals in sight we tried a number of tools; SEC, Facebook Scribe, Logstash to name a few. None of these suited our requirements. SEC was resource intensive on our servers with no control over duplicate notifications. Facebook Scribe looked like a dead project with no activity in past couple of years. Although great for log aggregation, our other goals were not met. Logstash was underdeveloped and unreliable with limited features, no timely bug fixes and no support available.

The closest match we found was Splunk. Setting up Splunk wasn’t much of a challenge, it was up and running in no time. It could consume data regardless of format or location in real time. Once the data is in Splunk, its powerful search language lets us get right at the data with full control over it. We could monitor, analyze and generate advanced visual reports on our data, set up real-time alerts and receive notifications via email or RSS or by executing a custom script. Notifications could be throttled based on a variety of threshold, trend-based conditions and other complex searches. Splunk also scales quite well.

We were however not entirely satisfied with the level of control Splunk gave us on alert notifications in terms of content and throttling rules. We scripted our own notification module in order to have absolute control. We also have plans to integrate it with our own task management system IssueBurner which will enable us to track every notification till it’s closed.

Splunk is indeed the best log management tool you can ask for. The tedious job of log management is fun with Splunk on your side. Do you face challenges in handling your logs in the cloud? Share it in the comments section below.