Zack's Kernel News

Avoiding Bloat in Test Kernels

Borislav Petkov updated KConfig's DEBUG_INFO description to warn users that it would cause "huge bloat" and slow down the build. David Rientjes, however, pointed out that this was only a marginal improvement over the previous description, because it didn't give a clear explanation of what "huge bloat" actually meant. Specifically, David pointed out that the compile option did not result in a larger vmlinux binary. So, where was the bloat?

Linus Torvalds pointed out that the object files would be much larger with DEBUG_INFO enabled. As an example, he said his fs/built-in.o file went from 2.8MB without that option enabled, to 11.8MB with DEBUG_INFO (and DEBUG_INFO_REDUCED) enabled. He said, "The object files being much bigger really does screw you especially on laptops that often have less memory and less capable IO subsystems. The final links in particular tend to be much bigger and use much more memory."

Linus added, "I suspect a lot of people are in denial about just how *horrible* the overhead of debug builds are. And yeah, if you have oodles of memory to cache things, it's not too bad. But you really want to have *lots* of memory, because otherwise you won't be caching all those object files in RAM, and your build easily becomes IO bound at link time. A factor of four size difference will not be helping your poor slow laptop disk."

Andrew Morton affirmed all of that and added that he always disabled DEBUG_INFO when testing allmodconfig builds (which otherwise enable all build options). But, he suggested that the DEBUG_INFO help text "should make clear that the bloat is a compile-time impact and not a runtime one." Instead of just saying that DEBUG_INFO caused "huge bloat," Andrew suggested saying, "It hugely bloats object files' on-disk sizes and slows the build."

Borislav Petkov liked that version more and offered to send an updated patch, but David pointed out that "CONFIG_DEBUG_INFO_REDUCED requires you to say yes here and that config option actually specifies the tools for which it is effective. Are you suggesting that we decouple CONFIG_DEBUG_INFO from CONFIG_DEBUG_INFO_REDUCED and make CONFIG_DEBUG_INFO select the reduced variant?"

In response to this, Ingo Molnár suggested an entirely new SAVE_TIME_AND_DISK_SPACE config option, which would compile with no debug info in the object files. Linus replied, saying the kernel already had something like this already – the COMPILE_TEST option, which explicitly was for testing the compilation process but not for running the resulting executable.

So, because the whole discussion was really about testing compile options, Linus made his own patch to make the DEBUG_INFO option depend on disabling the COMPILE_TEST option. This, in theory, would save testers from the bloat but still allow them to test out the build system with all options enabled.

Andi Kleen pointed out that, actually, Andi sometimes did indeed run the binary produced by an allyesconfig build. He did this mostly to compile statistics on the running kernel. Linus's patch would mess that up for him. He suggested providing a simple way for experts to bypass the constraint and compile with all the bloated debug data built in. Linus replied, however, that Andi's use-case was rare enough that it didn't need to be handled by the build system directly, especially because Andi could edit the build files by hand after doing make allyesconfig.

The thread ended at about there. These are odd discussions to watch, because they don't really concern regular users and are more about developer convenience than anything else. But, the debate is fascinating.

Easy Virtual Machines

Andy Lutomirski announced virtme, a project intended to make it easier to test kernels in virtual machines. Andy was inspired by the difficulty of getting all the ducks lined up for testing in plain KVM. He described the experience: "to test a kernel change, I would scrounge around for a disk image, try to remember the magic QEMU incantation to boot my kernel with that image, and spend a while cursing at the initramfs that would inevitable fail to find my 'hard disk'. Then I'd curse more because it's a real PITA to share files between the host and a guest. At some point I'd accidentally corrupt my image, and then I'd start over."

When virtFS came along, Andy wrote some scripts to do what he needed, and these scripts turned into virtme. It does essentially all the legwork of setting up a bootable environment with enough tools to test the build from the inside, or download more tools, as the case may be.

Andy mapped out a course for future development. One idea involved support for automatically running scripts within the virtual system. This way an entire test suite could be run, and results gathered, without the user ever having to interact with the virtual machine directly. A single command would start the system, run all the tests, collect the results, and exit.

Other ideas involved improving portability (it had only been tested on Fedora 20 at that point) and robustness.

There was no discussion at the time, but the project seems pretty cool. Eventually, whether using virtme or some similar system, it might take just a single command to download the kernel source, build it, test it, and send the results to a designated server. If that ever happened, you'd probably see massive numbers of regular users setting up their crontabs to do this while they slept. And, because they'd have nothing to lose, many of them would certainly test the tip of the Git tree, which typically doesn't get as much testing as -rc releases, which don't get as much testing as official versions. All of a sudden, the tip of the tree would likely get thousands or millions of test runs every day. I vote yay!

Inode vs. File Path for Some ACL Features

Ilya Dryomov pointed out that a couple of ACL (access control list) patches that had gone into the kernel actually conflicted with each other. In particular, he said, this affected the Ceph code. Ceph is a distributed filesystem that aims to be robust and scalable. Ilya posted a patch to bring Ceph in line with the more generic of the two conflicting patches.

Linus Torvalds replied that he had already committed some of the conflicting code to the source tree but was ready to merge the tree against the new Ceph code from Ilya, once he got the appropriate "Signed-Off-By" line from Ilya.

However, while Linus was looking at the relevant code in the VFS (virtual filesystem), he noticed a new problem that he thought needed to be addressed.

Specifically, Linus noticed that one of the ACL functions passed file information around as a dentry structure. Dentries are essentially pointers to a file's inode on disk. Linus remarked that although a dentry might be the appropriate data structure to use with normal filesystems, a distributed filesystem like Ceph would benefit more from using the actual file path instead of an inode.

The reason for this is that inodes are unique and relatively easy to deal with on a single hard drive, whereas distributed filesystems can't make certain useful guarantees about inode uniqueness.

Linus wanted Al Viro and Christoph Hellwig to take a look and see if perhaps there was a clean way to have the relevant ACL functions look up the file path from the dentry and pass that path around instead of the dentry.

Christoph replied and said that, actually, although the set_acl() function would be relatively straightforward to modify, the get_acl() function (which Linus had written himself) did not have an obvious way to get at the desired path data. He asked Linus to take a look, because it was his own code, and see if anything looked doable.

Linus replied, "You're right, it looks like an absolute nightmare to add the dentry pointer through the whole chain. Damn." He added that there might be a horribly kludgy workaround but said, "I definitely see your argument. It may just not be worth the pain for this odd ceph case."

Linus suggested that "if the ceph people decide to try to bite the bullet and do the required conversions to pass the dentry to the permissions functions, I think I'd take it unless it ends up being *too* horribly messy."

Christoph pointed out, however, that this wasn't just an "odd ceph case." He said the Plan9 filesystem also needed the same file path fix at a fundamental level, or it wouldn't be able to make use of the new ACL helper functions. After that, he said, "we can get rid of the lower level interfaces entirely and eventually move ACL dispatching entirely into the VFS."

Likewise, Christoph said that CIFS needed the same fix.

With other filesystems affected, Linus took a deeper look at the code and decided it wouldn't be such an absolute nightmare after all. He posted a patch that managed to bring the file paths deeper into the function call chain, although not quite all the way. He remarked, "interestingly, replacing the inode pointer with a dentry pointer ended up actually simplifying several of the call-sites, so while doing this patch I got the feeling that this was the better interface anyway, and we should have done this long ago."

He also said that trying to bring the file paths any deeper in the call chain might actually start to cause difficulties. Some of the necessary data just wasn't available at those deep levels. But, he said, the data could be made to go just far enough to address the problem at hand.

Al didn't like Linus's code at all. He thought there was a lot of pointless code in Linus's patch, and that there were also areas that worked too hard to achieve something that could already be relied on because of some apparent special cases that would really always be true at that level. In particular, Al said that Linus was trying to produce something special that could really just be handled with an inode instead.

Linus, however, pointed out that the whole point was to avoid using inodes and use file paths instead. Linus said:

Some network filesystems pass the *path* to the server. Any operation that needs to check something on the server needs the *dentry*, not the inode.

This whole 'the inode describes the file' mentality comes from old broken UNIX semantics. It's fundamentally true for local Unix filesystems, but that's it. It's not true in general.

Sure, many network filesystems then emulate the local Unix filesystem behavior, so in practice, you get the unix semantics quite often. But it really is wrong.

If the protocol is path-based (and it happens, and it's actually the *correct* thing to do for a network filesystem, rather than the idiotic 'file handle' crap that tries to emulate the unix inode semantics in the protocol), then the inode is simply not sufficient.

And no, d_find_alias() is not correct or sufficient either. It can work in practice (and probably does perfectly fine 99.9% of the time), but it can equally well give the *wrong* dentry: yes, the dentry it returns would have been a valid dentry for somebody at some time, but it might be a stale dentry *now*, and it might be the wrong dentry for the current user (because the current user may not have permissions to that particular path, even if the user has permissions through his *own* path).

Part of the disagreement between Al and Linus was that Al thought the problem was really simpler than Linus did. Al argued that filesystems like CIFS didn't have hard links, which made it easier to look up file locations over the network.

But Christoph and Linus both pointed out that CIFS actually did support hard links, which surprised Al at first. Al said, "How the hell does it … Oh, right – samba on Unix server. I really wonder how well do Windows clients deal with those."

Meanwhile (and as part of that disagreement), Linus explained why the existence of hard links in a networked filesystem was relevant. He said:

Do this: create a hardlink in two different directories. Make the *directory* permissions for one of the directories be something you cannot traverse. Now try to check the permissions of the *same* inode through those two paths. Notice how you get *different* results.

Really.

Now, imagine that you are doing the same thing over a network. On the server, there may be a single inode for the file, but when the client gives the server a pathname, the two pathnames to that single inode ARE NOT EQUIVALENT.

And the fact is, filesystems with hardlinks and path-name-based operations do exist. cifs with the unix extensions is one of them."

Al was convinced – though he still hated the ugly necessity of Linus's proposed change. Linus, meanwhile, split up his patch and sent it out for review, acknowledging, "I do actually agree that the second patch isn't exactly pretty. Passing in both dentry and inode is redundant, and calling the function 'inode_permission()' is now a misnomer." He also pointed out some technical reasons why it would be better to do the patch as-is and clean it up separately later.

Al offered various harsh technical criticisms and reiterated his hatred of the patch, but he acknowledged that the actual changes did seem necessary. Al also cursed the name of Andrew Tridgell for supporting those awful hard links. But, Jeremy Allison replied, "Actually you have to blame me for that. Tridge always *HATED* the UNIX extensions :-)."