From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 00/2] improve .text size on gcc 4.0 and newer compilers
Date: Thu, 05 Jan 2006 19:26:56 UTC
Message-ID: <fa.fu9j3rc.g0q6qq@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0601051112070.3169@g5.osdl.org>
On Thu, 5 Jan 2006, Martin Bligh wrote:
>
> There are tools already around to do this sort of thing as well - "profile
> directed optimization" or whatever they called it. Seems to be fairly commonly
> done with userspace, but not with the kernel. I'm not sure why not ...
> possibly because it's not available for gcc ?
.. and they are totally useless.
The fact is, the last thing we want to do is to ship a magic profile file
around for each and every release. And that's what we'd have to do to
get consistent and _useful_ performance increases.
That kind of profile-directed stuff is useful mainly for commercial binary
releases (where the release binary can be guided by a profile file), or
speciality programs that can tune themselves a few times before running.
A kernel that people recompile themselves simply isn't something where it
works.
What _would_ work is something that actually CHECKS (and suggests) the
hints we already have in the kernel. IOW, you could have an automated
test-bed that runs some reasonable load, and then verifies whether there
are branches that go only one way that could be annotated as such, or
whether some annotation is wrong.
That way the "profile data" actually follows the source code, and is thus
actually relevant to an open-source project. Because we do _not_ start
having specially optimized binaries. That's against the whole point of
being open source and trying to get users to get more deeply involved with
the project.
Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 00/2] improve .text size on gcc 4.0 and newer compilers
Date: Thu, 05 Jan 2006 19:44:50 UTC
Message-ID: <fa.g0pj43g.igq62u@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0601051126570.3169@g5.osdl.org>
On Thu, 5 Jan 2006, Linus Torvalds wrote:
>
> That way the "profile data" actually follows the source code, and is thus
> actually relevant to an open-source project. Because we do _not_ start
> having specially optimized binaries. That's against the whole point of
> being open source and trying to get users to get more deeply involved with
> the project.
Btw, having annotations obviously works, although it equally obviously
will limit the scope of this kind of profile data. You won't get the same
kind of granularity, and you'd only do the annotations for cases that end
up being very clear-cut. But having an automated feedback cycle for adding
(and removing!) annotations should make it pretty maintainable in the long
run, although the initial annotations migh only end up being for really
core code.
There's a few papers around that claim that programmers are often very
wrong when they estimate probabilities for different code-paths, and that
you absolutely need automation to get it right. I believe them. But the
fact that you need automation doesn't automatically mean that you should
feed the compiler a profile-data-blob.
You can definitely automate this on a source level too, the same way
sparse annotations can help find user access problems.
There's a nice secondary advantage to source code annotations that are
actively checked: they focus the programmers themselves on the issue. One
of the biggest advantages (in my opinion) of the "struct xyzzy __user *"
annotations has actually been that it's much more immediately clear to the
kernel programmer that it's a user pointer. Many of the bugs we had were
just the stupid unnecessary ones because it wasn't always obvious.
The same is likely true of rare functions etc. A function that is marked
"rare" as a hint to the compiler to put it into another segment (and
perhaps optimize more aggressively for size etc rather than performance)
is also a big hint to a programmer that he shouldn't care. On the other
hand, if some branch is marked as "always()", that also tells the
programmer something real.
So explicit source hints may be more work, but they have tons of
advantages. Ranging from repeatability and distribution to just the
programmer being aware of them.
In other projects, maybe people don't care as much about the programmer
being aware of what's going on - garbage collection etc silent automation
is all wonderful. In the kernel, I'd rather have people be aware of what
happens.
Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 00/2] improve .text size on gcc 4.0 and newer compilers
Date: Thu, 05 Jan 2006 20:18:42 UTC
Message-ID: <fa.g0p93ji.igg6ig@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0601051208510.3169@g5.osdl.org>
On Thu, 5 Jan 2006, Martin Bligh wrote:
>
> Hmm. if you're just going to do it as binary on/off ...is it not pretty
> trivial to do a crude test implementation by booting the kernel, turning
> on profiling, running a bunch of different tests, then marking anything
> that never appears at all in profiling as rare?
Yes, I think "crude" is exactly where we want to start. It's much easier
to then make it smarter later.
> Not saying it's a good long-term approach, but would it not give us enough
> data to know whether the whole approach was worthwhile?
Yes. And it's entirely possible that "crude" is perfectly fine even in the
long run. I suspect this is very much a "5% of the work gets us 80% of the
benefit", with a _very_ long tail with tons of more work to get very minor
incremental savings..
> OTOH, do we have that much to gain anyway in kernel space? all we're doing is
> packing stuff down into the same cacheline or not, isn't it?
> As we have all pages pinned in memory, does it matter for any reason
> beyond that?
The cache effects are likely the biggest ones, and no, I don't know how
much denser it will be in the cache. Especially with a 64-byte one..
(although 128 bytes is fairly common too).
There are some situations where we have TLB issues, but those are likely
cases where we don't care about placement performance anyway (ie they'd
be in situations where you use the page-alloc-debug stuff, which is very
expensive for _other_ reasons ;)
Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 00/2] improve .text size on gcc 4.0 and newer compilers
Date: Thu, 05 Jan 2006 20:23:36 UTC
Message-ID: <fa.fv9l3rd.h0s6qr@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0601051213270.3169@g5.osdl.org>
On Thu, 5 Jan 2006, Linus Torvalds wrote:
>
> The cache effects are likely the biggest ones, and no, I don't know how
> much denser it will be in the cache. Especially with a 64-byte one..
> (although 128 bytes is fairly common too).
Oh, but validating things like "likely()" and "unlikely()" branch hints
might be a noticeably bigger issue.
In user space, placement on the macro level is probably a bigger deal, but
in the kernel we probably care mostly about just single cachelines and
about branch prediction/placement.
Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 00/2] improve .text size on gcc 4.0 and newer compilers
Date: Fri, 06 Jan 2006 00:05:08 UTC
Message-ID: <fa.fv9v4ji.h0e7ig@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0601051548290.3169@g5.osdl.org>
On Fri, 6 Jan 2006, Ingo Molnar wrote:
>
> i frequently validate branches in performance-critical kernel code like
> the scheduler (and the mutex code ;), via instruction-granularity
> profiling, driven by a high-frequency (10-100 KHz) NMI interrupt. A bad
> branch layout shows up pretty clearly in annotated assembly listings:
Yes, but we only do this for routines that we look at anyway.
Also, the profiles can be misleading at times: you often get instructions
with zero hits, because they always schedule together with another
instruction. So parsing things and then matching them up (correctly) with
the source code in order to annotate them is probably pretty nontrivial.
But starting with the code-paths that get literally zero profile events is
definitely the way to go.
> Especially with 64 or 128 byte L1 cachelines our codepaths are really
> fragmented and we can easily have 3-4 times of the optimal icache
> footprint, for a given syscall. We very often have cruft in the hotpath,
> and we often have functions that belong together ripped apart by things
> like e.g. __sched annotators. I havent seen many cases of wrongly judged
> likely/unlikely hints, what happens typically is that there's no
> annotation and the default compiler guess is wrong.
We don't have likely()/unlikely() that often, and at least in my case it's
partly because the syntax is a pain (it would probably have been better to
include the "if ()" part in the syntax - the millions of parenthesis just
drive me wild).
So yeah, we tend to put likely/unlikely only on really obvious stuff, and
only on functions where we think about it. So we probably don't get it
wrong that often.
> the dcache footprint of the kernel is much better, mostly because it's
> so easy to control it in C. The icache footprint is alot more elusive.
> (and also alot more critical to execution speed on nontrivial workloads)
>
> so i think there are two major focus areas to improve our icache
> footprint:
>
> - reduce code size
> - reduce fragmentation of the codepath
>
> fortunately both are hard and technically challenging projects
That's an interesting use of "fortunately". I tend to prefer the form
where it means "fortunately, we can trivially fix this with a two-line
solution that is obviously correct" ;)
Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 00/2] improve .text size on gcc 4.0 and newer compilers
Date: Fri, 06 Jan 2006 00:32:38 UTC
Message-ID: <fa.fva143g.h0862u@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0601051626290.3169@g5.osdl.org>
On Fri, 6 Jan 2006, Ingo Molnar wrote:
>
> Especially if there enough profiling hits, it's usually a quick glance
> to figure out the hotpath:
Ehh. What's a "quick glance" to a human can be quite hard to automate.
That's my point.
If we do the "human quick glances", we won't be seeing much come out of
this. That's what we've already been doing, for several years.
I thought the discussion was about trying to automate this..
Linus