Search This Blog

Why we switched from Papertrail to SplunkStorm

A few months ago, during the great search for loggly alternatives, I ran across two major solutions, Papertrail and SplunkStorm. While at first we used Papertrail since it was very familiar to us as developers (in fact it's really just "tail" with an optional UI), it turned out that this was incredibly useless when we wanted to get real details on the logs more then just a simple "what's going on right now". We upgraded our plan, shoved about 50GB/month into the system, and soon it became apparent that Papertrail could not reliably handle the massive amount of logging data we had. What's worse, there's no built-in graphing support.

We realized that SplunkStorm, despite being a lot more "enterprise-esque", offered us the one thing we really needed form logging: traceability. There's a big difference between accepting log messages, and understanding log messages. For example, we include some timing metrics in some of our log messages, to help us see what areas need to be focused on for performance tuning. We were able to parse this data into meaningful fields, and then do something like this query:

index_time > 10 | timechart count

This gave us how many of our events took more then 10 seconds, and plotted them out in a very nifty little time chart.

What's even better, SplunkStorm is actually cheaper then Papertrail, and it's incredibly fast. With Papertrail, you can tell they're not indexing events quite properly. Specifically if you do a search for something that happened days ago, it might take hours to get any sort of response (in our case we have upwards of 50GB of log files, and it appears to simply be grepping through them in reverse chronological order). When you need to see a pattern, that's simply not acceptable.

The folks at Papertrail certainly did do a good job with tailing log files, they do that way better then Splunk Does, and they even have a CLI client to do this. However, that one feature does not make up for the fact that it fails very badly with any sort of searching, which really is the point of log aggregation isn't it?

If you want more from your statistics, if you can't wait to dive into details directly within your logging platform and really find out what your logs mean, not just what they say, then take a look at Splunk. It's powerful enough for the advanced developer (can you say "regex" everywhere!), yet simple enough for our non-technical users to also understand. Being able to produce sexy graphics on-demand of statistics you didn't even know you had? Yeah, that's Splunk.

There are two very major features that Splunk is working on which is the reason we switched over now; a real-time streaming API for searching (think, CLI client for tailing splunk logs), and more importantly for most, alerts. While Papertrail does do both of these things now, it's not very good with Alerts, and it's lack of fast searching in the history of your logs do not make up for it's current features.

What do you think? Do you use Splunk? What are some of the cool things you've done with your logging systems?

logos/graphics are trademarked and copyrighted by Splunk and used with their consent

Popular Posts

Ever wonder how sites like battle.net support things like this in Google Chrome?

Well I did, so I did a little bit of digging. It turns out Google Chrome supports an open standard called Open Search. This format is relatively simple, and very easy to add to your own site. I just added it to some of our systems in under 5 minutes.

Adding OpenSearch to your site is incredibly simple, you just have to add a simple tag to your index HTML page, and add a simple XML file that it points to. The link tag looks like this:
<link rel="search" type="application/opensearchdescription+xml" href="http://my-site.com/opensearch.xml" title="MySite Search" />

For a while, I have been creating command line tools provided right with boto which I used to manage AWS. Recently, others have become interested in these tools as well, and I've seen several other contributors adding to these tools to make them even more useful to others. One recent submission by Ales Zoulek added some nice features to my list_instances command, which I use on a regular basis to list out the instances that are currently active for my account in EC2.

Amazon now lets you add Tags to EC2 objects such as Instances and Snapshots. This allows you to actually "Name" your EC2 instance, as well as add some metadata that could be used for AMI initialization, etc. Ales added the ability to list these tags by name within the list_instances command line application:

Last week, Amazon announced the launch of a new product, DynamoDB. Within the same day, Mitch Garnaat quickly released support for DynamoDB in Boto. I quickly worked with Mitch to add on some additional features, and work out some of the more interesting quirks that DynamoDB has, such as the provisioned throughput, and what exactly it means to read and write to the database.

One very interesting and confusing part that I discovered was how Amazon actually measures this provisioned throughput. When creating a table (or at any time in the future), you set up a provisioned amount of "Read" and "Write" units individually. At a minimum, you must have at least 5 Read and 5 Write units partitioned. What isn't as clear, however, is that read and write units are measured in terms of 1KB operations. That is, if you're reading a single value that's 5KB, that counts as 5 Read units (same with Write). If you choose to operate in eventually consistent mode, you'r…