Posted
by
kdawson
on Saturday November 13, 2010 @10:25AM
from the truth-of-kernel dept.

francis-giraldeau writes "Linux Tracing Toolkit (LTTng) provides high-performance kernel tracing for Linux. This is the killer app for system level debugging and performance tuning. It's now easier than ever to install, with packages released for Ubuntu Maverick. The short introduction to kernel tracing shows how to interpret a simple kernel trace and relate it to strace. I would like to ask Slashdot readers what they would expect as features for a kernel tracing analysis tool, because I'm starting my PhD on this topic and looking for ideas. Also, I wonder why LTTng is not mainline yet. Will Linus Torvalds see the light in 2011?"

With DTrace, you have to know what you are looking for in advance, while LTTng can trace in background in flight recording mode and record everything that is going on. Then, afterward you can have all the information you need, and this is invaluable when you have a hard to reproduce bug!

Here is another idea for you. How about hardware assisted "dynamic" (aka dynamically hooked) tracepoints via a custom Xen-like bare metal hypervisor? The OS and therefore its contained malware would know nothing of the inspection process, and best of all it could be OS independent if done at the hardware level. The control/diagnostics software could be running in a VM right next to the OS under test. Boot the hypervisor from CD and then load the original machines OS. Stealth rootkits would be a thing of the past. Simply boot the monitor before loading the OS under test and have a blast uncovering all kinds of malware in any OS of your choice.

"Not that much point having a tracing tool if an inexperienced admin cannot safely use it on a live system which has a problem. "

Right. Because everyone knows the best place to develop, debug, and profile code is on a production machine, and the person doing the development should be a system administrator, preferably with minimal experience.

I would say many people do know that the best place to understand the performance of a system in production is in production. If the vendors support techs can give an admin commands to run and know that a typo here or there will not result in a panic then that is a very useful feature.

Sorry, I'm not going to bother registering - I read/. quite steadily but don't usually ever feel the need to add more than what's already said. You can google me around, though, I'm easy to find.

FWIW, I introduced LTT in 1999 and lobbied kernel developers for inclusion for 6 years before giving maintainership to someone else. LTTng is in fact a complete rewrite of LTT and I've got little do with the project these days. I had little to do with its authoring and it likely has none of my code.

I do take issue, however, with your posturing. The fact that LTT was there before DTrace and still today Linux lacks equivalent functionality speaks volumes about some of the lesser known aspects of the kernel development process: namely that disruptive changes are insanely hard to mainline. It's one thing to ask for proof of the need. It's another to ignore the proof that's already out there and the project's history.

Francis doesn't need to finish his PhD to prove usefulness. Simply because, combined, there have already been half a dozen PhDs and Masters degrees done on Linux kernel tracing already.

Actually... While I was the maintainer, IBM's had a team of people working on LTT for a period of 3 years before pulling the plug on their involvement because they saw that all the money they were pouring in there wasn't leading to a mainlining.

Why were they interested in kernel tracing? Well... When a customer of theirs has one of his 10,000 servers misbehaving in production, they can't afford telling him to just take it offline for diagnostics. They have to find (and fix) the problem in the field. There are very few tools that allow you to do that. Oh, and having the source code and being able to rebuild is just not an option in those cases. After I passed on maintainership, Google did some work on kernel tracing with the LTTng developers with goals very similar to IBM's: misbehaving machines in server farms should not need to taken offline for diagnostics.

As for shooting the performance, I suggest you read up on LTTng's literature. The current team has done a stupendous job at deconstructing that myth.

You're right, Linux is being taken very seriously. Hence the need for these kinds of tools.

He has expressed similar sentiments more recently as well (eg from 2007 on git's use of c vs c++)

C++ is a horrible language. It's made more horrible by the fact that a lot
of substandard programmers use it, to the point where it's much much
easier to generate total and utter crap with it. Quite frankly, even if
the choice of C were to do *nothing* but keep the C++ programmers out,
that in itself would be a huge reason to use C.