Monitoring + securing containers + microservices.

How 6 of the world’s largest companies use Kub + Sysdig.

Sysdig + Logs: Advanced Log Analysis Made Easy

Log collection and analysis is one the most powerful tools in the hands of developers and operations teams. Inspecting logs is useful in a number of areas, including security, monitoring, debugging and troubleshooting. Often, however, dealing with logs is frustrating, because they give just a hint about a problem and don’t provide enough context to understand what the problem actually is or how to fix it. Addressing the inherent shortcomings of logs has always been one of our many goals with sysdig. And the latest release of sysdig, in particular, adds some pretty cool functionality for log collection, inspection and analysis.

This blog post will demonstrate a couple interesting methods for using sysdig with logs, including:

Tailing all of the machine log files with a single command line, without the need to configure anything

Easily correlating log entries with the relevant system activity

Troubleshooting a failed PHP request with sysdig + logs

Note: this post will likely make more sense if you have a basic familiarity with sysdig. If you’ve never heard of sysdig, I suggest you start with the website or the wiki.

Analyzing syslog with sysdig

Since sysdig sees all the I/O on the system, sysdig captures all syslog activity by default. A sysdig chisel called spy_syslog is available to easily isolate and display that syslog activity. For example, if I run spy_syslog while manually generating a bit of syslog noise, I get an output like this:

$ sudo sysdig -c spy_syslog
local3.emerg logger[14024] Jul 30 16:35:53 root: This is an emergency message
local3.alert logger[14025] Jul 30 16:35:53 root: This is an alert message
local3.crit logger[14026] Jul 30 16:35:53 root: This is a critical message
local3.err logger[14027] Jul 30 16:35:53 root: This is an error messagelocal3.warn logger[14028] Jul 30 16:35:53 root: This is a warning messagelocal3.notice logger[14029] Jul 30 16:35:53 root: This is a notice messagelocal3.info logger[14030] Jul 30 16:35:53 root: This is an info messagelocal3.debug logger[14031] Jul 30 16:35:53 root: This is a debug messageuser.err python[14032] root: This is an error message from ptyhonuser.warn python[14032] root: This is a warning message from ptyhonuser.debug python[14032] root: This is a debug message from ptyhon

As you can see, sysdig displays each syslog message as it occurs on my system (in this case, logs from my ‘logger’ and ‘python’ processes), and it uses color to indicate the severity of the message. And as always, this chisel is integrated with the sysdig filtering language, so that I can dig into syslog activity using arbitrary expressions. For example, I can capture only messages with high severity coming from the process ‘logger’:

Also, as always, this will work on a live system, but also on a trace file that I captured previously. By the way, if you are interested in syslog messages only, this command will create a compact trace file that only includes those:

$ sudo sysdig -F -w trace.scap evt.is_syslog=true

Analyzing application logs with sysdig

Another chisel that is very useful for log spelunking is spy_logs. It works in a similar way to spy_syslog, but instead of showing messages written to /var/log/messages, it shows messages that are written to files that contain “.log” or “_log” in their name. (Note, if you want to tune this, just edit the FILE_FILTER constant at the beginning of the chisel Lua code). spy_logs automatically unpacks multi-line writes and renders each of them nicely on the screen, with some basic pattern-matching-based coloring.

Think about it as a filterable super-tail that shows all the log files in the system, with no configuration required. Just run it, and you get all of the logs in a single stream. It’s sort of magic. Don’t believe me? Here, for example, I ran the chisel for a minute or two on one of my VMs:

The first item in each line is the process name, the second is the name of the log file, and the rest of the line contains the actual message.

As always, this chisel can be applied to a live system (in that case it will behave very similarly to tail) or to a trace file that has been captured previously. And as always you can use filters. This command line, for example, will capture log writes made by httpd and containing the word GET.

Using logs and system events to troubleshoot a failed HTTP request

Now let’s get into the cool stuff.

Sysdig really starts to shine when the log collection capabilities presented in the previous sections are married with sysdig’s innate ability to do deep system inspection. For example, the spy_logs chisel has a very powerful feature: it can save a snapshot of the activity of the thread that wrote a log event, around the time when the event was written.

As usual, I’ll use a real world example to show you what I mean.

One of the pages in my web server is failing with a generic “database error”. I want to understand what’s causing the failure. And, by the way, I have to admit I don’t really know PHP.

I’m going to start by taking a sysdig capture while I load the page, hoping that the problem will show up in some log. I start the capture with:

$ sudo sysdig -s2000 -w system.scap

…and stop the capture with CTRL+C after refreshing my browser window. Now I have a full snapshot of my system while the problem happened. I start by taking a look at the log activity:

Bingo. The second line is the page access log, and confirms that the URL is the right one, while the first line is the error we’re looking for: “Database error”. Not surprisingly, it’s not very helpful. And normally, that is all you have, which is the typical issue with logs. This time, however, I have a shiny new sysdig trace!

Let’s start by using that powerful feature I mentioned above: I’ll isolate the relevant system activity around the time of the failed request into a separate sysdig file. Here’s the command:

Since this command is a little bit more advanced, let’s look at it piece by piece:

-r system.scap indicates the trace file that we want to read from

-c spy_logs “request.scap 1000” means: for each event selected, save 1000ms of system activity before it happened and 1000ms of activity after it happened, but just for the thread that generated the event, and save it to request.scap

“evt.buffer contains Database” is our filter for selecting events, which we use to isolate the specific log entry we’re interested in (and in fact, sysdig confirms the success of our filter by showing the one log message we wanted to isolate in this case)

Now we have a small file that contains just the activity before and after that log message was written, without noise. So let’s take a look. Note that I’m going to interject my comments to the trace output, to help the untrained eye spot the interesting parts.

So, to summarize: a script called db_interface.php, imported by userstats.php, is trying to connect to mysql, but is using the wrong credentials. mysql refuses the connection and the request fails.

The solution: go check the password!

Conclusion

This blog post only scratches the surface of what sysdig can do with your system logs, but hopefully it’s a good start to get a basic idea and start hacking.

The chisels that I presented here focus on (1) easy system-wide log collection and (2) the ability to dive deeper by correlating log entries with relevant system information. The more I work on sysdig, however, the more I’m amazed at how much useful information is buried in every system. The trick is to bubble it up in a way that is easy and productive. So stay tuned, because we are working on several more powerful features here.

And, as usual, we’d love to hear what you think. If you have any thoughts or suggestions, please let us know in the comments or on twitter: @sysdig.