Search This Blog

The Great search for Syslog services

I've been spending a lot of time lately looking for a good replacement for Loggly, ever since they've been having so many problems with uptime and availability. The most important feature of any log management platform is obviously to make sure it's available when I need it, and always collecting my logs. If the service drops 50% of my logs, then it's not very useful in tracking down those little bugs, I still have to log into all of my servers to see everything.

Features of Log Management Solutions

What we really need out of a log management solution is something that easily integrates with our services transparently. What I mean by that is not something that will take extra code-level development work in order to use. We use Python with the standard logging module, which we then push to syslog. Syslog (or rsyslog as we use now) allows us to ship off logs to another remote server, which is really what we like to do. That means that no matter what language we use, all of our logs are able to be stored both locally and shipped off to the log management solution, without any integration with our actual programs. We also like to see logs from native linux apps like SSH, so the only real solution for us is something that integrates with syslog

Scale and Search

It also Needs to scale. We follow the rule of Log everything, even if you don't think you'll need it. Its not abnormal to have over 1 million log events, or 1GB of log data, in a single day. If you log everything, you can sort through things later to more easily find out what happened.

Since we do log everything, what we really need is the ability to search for something across all of our systems. We need to be able to trace down something that may have gone wrong and figure out exactly where it went wrong

Alerts

We also want to be able to use our logging solution to alert us if something is going wrong. Specifically we have two types of alerts:

If more then X events appear in Y minutes (error threshold)

If LESS then X events appear in Y minutes (heartbeat)

Specifically, we deliver to clients using FTP, HTTP, or other methods, and we need to be alerted if we haven't delivered in over 10 minutes, or if we've received more then a few errors per client in an hour. It's not abnormal to have a client system go down for a few minutes, or receive just a periodic error (the internet is not perfect, after all), but if there say 500 errors delivering to client Z in an hour, then someone should be alerted. It's also nice to be able to schedule timeframes for when these alerts run, but that's just icing on the cake

Graphing

Another feature we would like to have is the ability to graph events. Take for example graphing how many errors you've had over the past day. Even better, how many deliveries have you had over the past month? This could show us what days we have the heaviest amount of load on our services, and we could then use that information to determine when we need to have more servers available (after all, we're in the cloud so that can be really automated).

Uptime is key

The most important piece of any log management solution though, is consistent uptime and availability. If your logs are lost, or if you can't get to them in the time of an emergency, or if alerts suddenly fail because the service isn't available, then the service is completely useless

What services have we looked at?

We've looked at quite a few different services to solve our logging problems. Obviously we don't want to roll our own service, since that's really not our core business, and we don't want to maintain yet another system that doesn't make us any money. Here's a list of what we've looked at

Loggly

We previously were using Loggly. It's cheap, only about $200/month for up to 1GB/day and 90 days of search history. It gives you graphs, search, and it's got the easy integration. With the addition of AlertBirds, it also has very robust alert management. AlertBirds is actually not integrated directly with Loggly, but it's a free service that they do also provide.

So why did we recently switch away from them? They violated the first rule of Log management, they had a horrible uptime. Whenever we went to their service, they always had a message about "Sorry, we're working on backfilling the servers with your logs". This type of transparency is nice, but it doesn't make up for constant issues. The final straw was A huge outage which lost all logs for several days and a weak apology where they basically blamed AWS for reboots, which Amazon announces up front that are not uncommon. Amazon has always had the policy that they may have to reboot your instance at any time. While they try to let you know about it, sometimes it's just not possible (what if a server suddenly dies?). They weakly tried to say it was their fault while simultaneously blaming AWS. It's one of the major points that anyone seriously interested in cloud computing should handle from the beginning. It's one of the few rules of doing business in the cloud

Loggr.net

While digging around, I also took a look at Loggr.net. While at first glance it does appear very nice, they offer tons of analytics tools, it again violates one of my primary rules. It requires you to use their API to push logs, it doesn't integrate seamlessly with Syslog or any other common standard. Additionally, they charge per log event, and the highest plan they appear to have is for 20,000 log events per day. We blow through that in a few minutes, so obviously they're not designed to scale to what we need.

Papertrailapp

Papertrailapp is the current log management system of choice. They have one feature that I really wanted from the beginning with Loggly, being able to see a live tail of your logs. They do lack alerting and graphing, but they also offer up a very nice API which means you could build it yourself, or simply wait for them to build it as they seem to be very concerned about their customers. I've had several email interactions with the folks there, which is quite frankly the entire reason we're still staying with them. What's even nicer is that unlike Loggly, I don't have to keep logging in every 5 minutes to view my logs. Automatic login that actually works is very nice.

Although you wouldn't think much of it, being able to look at a live tail of a saved search is very important to debugging issues, or simply watching how well an upgrade went. Although Papertrailapp doesn't support more then 4 weeks of search, having this live tail of events is very nice to be able to see to get a "live look" at how well things are going in the system. They do also offer a nice archival to your S3 bucket which means you could do your own work with your log events after the 30 days of search results in Papertrailapp are gone

What other solutions are people looking at?

Have a better solution or use something better? Please let me know in the comments!

Popular Posts

Ever wonder how sites like battle.net support things like this in Google Chrome?

Well I did, so I did a little bit of digging. It turns out Google Chrome supports an open standard called Open Search. This format is relatively simple, and very easy to add to your own site. I just added it to some of our systems in under 5 minutes.

Adding OpenSearch to your site is incredibly simple, you just have to add a simple tag to your index HTML page, and add a simple XML file that it points to. The link tag looks like this:
<link rel="search" type="application/opensearchdescription+xml" href="http://my-site.com/opensearch.xml" title="MySite Search" />

For a while, I have been creating command line tools provided right with boto which I used to manage AWS. Recently, others have become interested in these tools as well, and I've seen several other contributors adding to these tools to make them even more useful to others. One recent submission by Ales Zoulek added some nice features to my list_instances command, which I use on a regular basis to list out the instances that are currently active for my account in EC2.

Amazon now lets you add Tags to EC2 objects such as Instances and Snapshots. This allows you to actually "Name" your EC2 instance, as well as add some metadata that could be used for AMI initialization, etc. Ales added the ability to list these tags by name within the list_instances command line application:

Last week, Amazon announced the launch of a new product, DynamoDB. Within the same day, Mitch Garnaat quickly released support for DynamoDB in Boto. I quickly worked with Mitch to add on some additional features, and work out some of the more interesting quirks that DynamoDB has, such as the provisioned throughput, and what exactly it means to read and write to the database.

One very interesting and confusing part that I discovered was how Amazon actually measures this provisioned throughput. When creating a table (or at any time in the future), you set up a provisioned amount of "Read" and "Write" units individually. At a minimum, you must have at least 5 Read and 5 Write units partitioned. What isn't as clear, however, is that read and write units are measured in terms of 1KB operations. That is, if you're reading a single value that's 5KB, that counts as 5 Read units (same with Write). If you choose to operate in eventually consistent mode, you'r…