Correct but baby steps.
The gcc upgadae means rebuilding all the C++ code for the ABI change. I suspect that the out of date gcc is hinting an an out of date install, thus its only the tip of the iceberg._________________Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

Why do you consider this foolish? The kernel's position of authority makes it an attractive target for all kinds of attacks. The kernel has had bugs before where untrusted user input could cause it to copy too many bytes from user memory to kernel memory. Not all of those necessarily would result in a stack overwrite, but as a general rule, programmers seem to keep introducing bugs that let the attacker specify how many bytes to write. The stack smash protector is a relatively cheap way to guard against this recurring class of bug, and has the desirable properties that it consumes relatively little memory (one pointer-aligned integer cookie per function) and only a few instructions for setup/teardown. Some of the stack-protector variants are fairly conservative in what functions they instrument, which can help considerably with minimizing code bloat.

Why do you consider this foolish? The kernel's position of authority makes it an attractive target for all kinds of attacks.

Exactly; which is why it has to be pedantic in verifying user parameters.
Once you've verified the memory regions are valid (for this case) there is no need to mess about with checking your stack internally.
If you haven't verified it correctly, then all bets are off anyhow -- which is why this is already a done thing, in small utility functions.

Quote:

The kernel has had bugs before where untrusted user input could cause it to copy too many bytes from user memory to kernel memory. Not all of those necessarily would result in a stack overwrite, but as a general rule, programmers seem to keep introducing bugs that let the attacker specify how many bytes to write.

For a kernel, that's insane. If you don't verify inputs stringently, then you shouldn't be writing kernel code.

Quote:

The stack smash protector is a relatively cheap way to guard against this recurring class of bug, and has the desirable properties that it consumes relatively little memory (one pointer-aligned integer cookie per function) and only a few instructions for setup/teardown. Some of the stack-protector variants are fairly conservative in what functions they instrument, which can help considerably with minimizing code bloat.

It doesn't address the underlying issue, which has already been taken care of at the syscall interface.

Nor does it address the real issue, which is that you should not mix up the control stack with the data stack (something my boss researched back in 2002/3, so I have heard quite a lot on this topic.)

For a kernel, that's insane. If you don't verify inputs stringently, then you shouldn't be writing kernel code.

I think the main audience for these hardening switches isn't the end user, or the one writing the code, but people whose full time job it is to reject crap on LKML. Replying with "this patch doesn't even boot under allyesconfig" is easier than writing a tailored e-mail having to explain why idiot﻿@dodgy-wifi.cn's code is a pile of shit and will never work.

Exactly; which is why it has to be pedantic in verifying user parameters.
Once you've verified the memory regions are valid (for this case) there is no need to mess about with checking your stack internally.

Sure. Once everyone swears not to write code vulnerable to buffer overflows, and follows through on that promise, we can skip enabling the buffer overflow detector. Until then, I like the trade that the compiler will ensure that such bugs are a denial-of-service, and nothing worse.

steveL wrote:

If you haven't verified it correctly, then all bets are off anyhow -- which is why this is already a done thing, in small utility functions.

Stack smash protection doesn't promise to recover from kernel bugs. It promises to (try to) limit how much damage is done when someone screws up and doesn't verify the inputs correctly.

steveL wrote:

For a kernel, that's insane. If you don't verify inputs stringently, then you shouldn't be writing kernel code.

Agreed. That's why we only let veteran developers who would never make such a mistake work on kernel code.

steveL wrote:

It doesn't address the underlying issue, which has already been taken care of at the syscall interface.

You're assuming all kernel stack overflows are triggered by user input that came through at the syscall level. That's probably a common path, but it's not the only one. Suppose a user passes in a string that is copied safely into an adequately sized long-term buffer. Suppose the kernel then later copies that long-term buffer into an undersized stack buffer. You get a user-influenced stack overrun, long after the syscall returned and possibly even after the user program exited. Yes, it's still a data validation issue, but it's not one that the original syscall necessarily should have detected. The buffer that syscall used was perfectly capable of handling the large string.

steveL wrote:

Nor does it address the real issue, which is that you should not mix up the control stack with the data stack (something my boss researched back in 2002/3, so I have heard quite a lot on this topic.)

Agreed. Complain to the 8088 designers not to implement their stack that way.

steveL wrote:

Fuzzing the kernel ABI is a much better idea, IMO.

Why can't we do both? Even better, isn't it nicer when running a fuzzer to do it on a kernel that has relatively deterministic failure modes? Personally, I hate debugging buffer overruns in code that was built without -fstack-protector, because by the time the program crashes, the stack is a mess and the debugger cannot reliably show me which function did the bad copy. By comparison, when -fstack-protector traps the overrun and kills the program, there is enough context that I can see exactly which function performed the bad copy.

Exactly; which is why it has to be pedantic in verifying user parameters.
Once you've verified the memory regions are valid (for this case) there is no need to mess about with checking your stack internally.

Hu wrote:

Sure. Once everyone swears not to write code vulnerable to buffer overflows, and follows through on that promise, we can skip enabling the buffer overflow detector. Until then, I like the trade that the compiler will ensure that such bugs are a denial-of-service, and nothing worse.

Eh? For this case (as qualified). I'll get to the rhetoric later.

steveL wrote:

If you haven't verified it correctly, then all bets are off anyhow -- which is why this is already a done thing, in small utility functions.

Hu wrote:

Stack smash protection doesn't promise to recover from kernel bugs. It promises to (try to) limit how much damage is done when someone screws up and doesn't verify the inputs correctly.

Yes, I know what it is, thanks; and I never said it would.
Input verification is essential everywhere; but the syscall ABI is the one place it absolutely must happen.
My point was simply that in verification terms, you're better off fuzzing the ABI, and then you can be much more confident about input handling.

I'm not saying that SSP has no place; indeed my boss codes using something very similar when working in asm. My point is not about the "bloat" required (a word of memory is piffling nowadays.)
My point is about putting it in-place within a kernel as standard, rather than doing the obvious and separating stack, which can be used as a basis for provable security.

SSP in production-kernels gives a false sense of security ("it will only be a denial of service") and nurtures a next-generation of "developers" who assume that buffer-overflows are no longer an issue (as the "SSP will catch it when it happens, and it will only ever be a denial of service, right?")

IOW we'll repeat the cycle of presumed, rather than verified, correctness.

So my rhetorical point is about the (cargo-)culture, and stopping it.

It's similar to the debates about X and buggy applications a couple of decades ago; we'd have been a lot better killing bad apps, rather than accommodating shit code, which would have been fixed in quick order if it crashed immediately. (Yes, I see the similarity with SSP.) The end-result is a metric tonne of "developers" with no discipline, and even less craft.

steveL wrote:

For a kernel, that's insane. If you don't verify inputs stringently, then you shouldn't be writing kernel code.

Hu wrote:

Agreed. That's why we only let veteran developers who would never make such a mistake work on kernel code. ;)

*sigh*
If you go dual-stack (in the original sense) then you take away the possibility of an overflow leading to loss of execution-control.

Solve the problem; don't paper over a bad design with more cycles. Remove the vector; it was never a good idea in any case (old-school asm-heads find it dreadful.)
"There is nothing so wasteful, as doing with great efficiency, that which does not have to be done at all."

Hu wrote:

You're assuming all kernel stack overflows are triggered by user input that came through at the syscall level.

Not at all.
I just prefer to solve the underlying problem for good, including your example of an internal validation problem (which again is better caught by testing and review, and would be even more of a howler for a supposed kernel coder.)

steveL wrote:

Fuzzing the kernel ABI is a much better idea, IMO.

Hu wrote:

Why can't we do both? Even better, isn't it nicer when running a fuzzer to do it on a kernel that has relatively deterministic failure modes?

Absolutely.
But that's not the same as SSP on production kernels, is it? You're into actual testing, which was my point. (valgrind is nice ;)

Keeping your control structures and your data separate is fundamental, though clearly not appreciated (or even heard of) by most "developers".

I don't think we're too far apart in practice; I have no issue with SSP in and of itself, nor do I care what someone else uses.
Where it leads, and a lack of consideration of alternatives, is troubling.

In x86 assembly, there are instructions for calling to a function and instructions for returning from a function (yes, plural in both cases). Call implicitly pushes $ip onto the stack, then transfers control to the target. Return implicitly pops from the stack into $ip. The stack for return addresses is designated by $esp/$rsp (x86/x86_64), and you cannot change that without changing the CPU ISA. You could, in theory, design a software ABI where $esp/$rsp is never used for any locals, effectively reserving the stack pointer to be only for return addresses. You would then need to spend an additional register tracking your stack-of-locals. It's not technically impossible to fix this in software, but it's very ugly. It gets worse when you consider some special purpose low level code that needs access to temporary storage, but runs in a context where the stack may be the only storage available.

I didn't notice this problem on most of my machines, save one... the one running an old hardened-sources. I wound up patching the makefile, because I already have stack protector turned off. I don't know if this is the correct way, but it does seem to work.

Code:

kernel/bounds.c:1:0: error: code model kernel does not support PIC mode

Code:

# The all: target is the default when no target is given on the
# command line.
# This allow a user to issue only 'make' to build a kernel including modules
# Defaults to vmlinux, but the arch makefile usually adds further targets
all: vmlinux

# The arch Makefile can set ARCH_{CPP,A,C}FLAGS to override the default
# values of the respective KBUILD_* variables
ARCH_CPPFLAGS :=
ARCH_AFLAGS :=
ARCH_CFLAGS :=
include arch/$(SRCARCH)/Makefile

In x86 assembly, there are instructions for calling to a function and instructions for returning from a function (yes, plural in both cases). Call implicitly pushes $ip onto the stack, then transfers control to the target. Return implicitly pops from the stack into $ip. The stack for return addresses is designated by $esp/$rsp (x86/x86_64), and you cannot change that without changing the CPU ISA.

Yes, this is all the same across architectures.
I'm not getting at you: I understand that you're establishing the ground rules. However, this was considered as given in earlier discussion.

So in the same spirit, this is all the same across archs: we have a control stack for call/ret, since this is a requirement of doing subroutines at all. Even in a HLL like C, which does not mandate use of the stack like this, a LIFO is required.

As any coder will tell you, and I am sure you know, a LIFO is a stack is a LIFO.

Quote:

You could, in theory, design a software ABI where $esp/$rsp is never used for any locals, effectively reserving the stack pointer to be only for return addresses.

What do you mean, "in theory"? FORTH (I think it is) has required a dual-stack since the beginning.
This has all been done before, IOW; including on the Z80 (an 8-bit processor which has been around since the 1970s.)

Quote:

You would then need to spend an additional register tracking your stack-of-locals. It's not technically impossible to fix this in software, but it's very ugly.

Not at all. Much uglier is shoving data on your control-stack, then overflowing it, and expecting someone else to correct your insanity with moar cycles.

Quote:

It gets worse when you consider some special purpose low level code that needs access to temporary storage, but runs in a context where the stack may be the only storage available.

Sorry, I'm calling bullshit on this part.
IRQ context needs to be setup before any interrupts are handled, whichever way you look at it. (Or: where did you get that stack from? ;)

Historically, x86 has been considered register-starved relative to most other ISAs. x86_64 made this suck less. As a consequence of this starvation, any scheme that reserves extra register(s) needs to have a very good justification or people will dismiss it due to the performance problems caused by the increased spills. I approach new schemes with that same skepticism.

As a point of clarification, which categories are you advocating go on which stacks? We need to track return addresses, function parameters, and function locals. We could group all function locals together or subdivide them into scalar locals (which only turn into overruns when people do something very stupid and treat a scalar as an array) and array locals (which overrun merely by forgetting to check the array length). In the latter case, even some relatively complex functions could be done entirely on a single stack. I replied on the basis that a data stack stores all data, and a codeflow stack stores only return addresses, and nothing else.

Assuming you want the control-stack to be exclusively return addresses, with all data moved to another stack, you get into a practical irritation. The CPU has lots of nice instructions for managing a combined data/codeflow stack via esp/rsp. If you want to separate them, and you don't redesign the ISA, there are various instructions you no longer get to use because they're hardwired to use $sp as their memory reference (push, call, ret, pop being the most common always-$sp instructions). So either you designate $sp as your data stack, then forgo easy use of call and return, or you designate $sp as your codeflow stack, then forgo easy use of push/pop. Both are possible, but may generate worse code compared to using the instructions designed for this access, hence "very ugly."

I say "in theory" because I am not aware of any production or seriously-proposed-for-production x86_32 or x86_64 ABI which works as you describe. I am aware that this is not new in the abstract, but as far as I know, it has never been done in production on x86 chips.

Regarding IRQs, my point was that when an IRQ handler triggers, the CPU is responsible for preparing the registers before it begins executing the IRQ handler. Some of that preparation is based on loading values that the OS prepared when the IRQ handler was installed, but my concern, which I did not research at the time, is whether the OS can prearrange enough state that an IRQ handler, using only whatever the CPU did autonomously and any prearrangements from earlier OS code, can get both a usable data stack and usable code stack, and do so without needing to write to whichever one it starts without. Based on your comment about not wanting a data stack in IRQ context, you may have intended that a "data stack" is only for that data which is more readily misused. I interpreted it much more strictly, as being that the data stack is used for everything except return addresses. If so, there's very little that can be done to obtain a data stack because there is very little scratch space, so the handler would need to receive one preinitialized by the OS, and discoverable with very little code in the handler. If we say that the control-flow stack can also be used to store spilled registers and function arguments, an IRQ handler can do quite a bit of useful work using only the control-flow stack, and might well be able to obtain a data stack if one were needed.

Historically, x86 has been considered register-starved relative to most other ISAs. x86_64 made this suck less. As a consequence of this starvation, any scheme that reserves extra register(s) needs to have a very good justification or people will dismiss it due to the performance problems caused by the increased spills. I approach new schemes with that same skepticism.

Sure. Try for a second to turn that same skepticism on the idea of intermingling machine control information, with random user input; or more accurately, with random external observable data.

I pointed out the prior art wrt Forth on Z80, to show that it's been proven feasible on an architecture with a much smaller address range and register file.

Hu wrote:

As a point of clarification, which categories are you advocating go on which stacks? .. I replied on the basis that a data stack stores all data, and a codeflow stack stores only return addresses, and nothing else.

Yes, that's right; though it also stores temporary register values from standard PUSH and POP insns.
By "standard" I mean what an asm-coder would use without concern, as a matter of course.

Hu wrote:

Assuming you want the control-stack to be exclusively return addresses, with all data moved to another stack, you get into a practical irritation. The CPU has lots of nice instructions for managing a combined data/codeflow stack via esp/rsp. If you want to separate them, and you don't redesign the ISA, there are various instructions you no longer get to use because they're hardwired to use $sp as their memory reference (push, call, ret, pop being the most common always-$sp instructions). So either you designate $sp as your data stack, then forgo easy use of call and return, or you designate $sp as your codeflow stack, then forgo easy use of push/pop. Both are possible, but may generate worse code compared to using the instructions designed for this access, hence "very ugly."

I think I just answered this above.
It seems you're missing the perspective of an asm-coder, to whom not using PUSH and POP for temporaries would be like trying to walk without feet. (Never mind the idea of foregoing CALL/RET.)

Oh, notadev is forcing me to add that you can then ban a whole class of insns in user code (anything that accesses SP directly, for example) which opens up pre-filtering before execution. (blah, blah blah.. ;)

Hu wrote:

Regarding IRQs, my point was that when an IRQ handler triggers, the CPU is responsible for preparing the registers before it begins executing the IRQ handler. Some of that preparation is based on loading values that the OS prepared when the IRQ handler was installed, but my concern, which I did not research at the time

The issue is that you need to research this (not an issue in and of itself): it's all abstract to you, so you're trying to frame the problem at the same time as "attack" our position; but you don't grok the problem, yet.

Look at how any OS has to setup the MMU at bootup (on any modern architecture: you wouldn't have needed to on a Z80 ;) Compare and contrast amd64 bringup with ARM32, in /usr/src/linux, for example, and take a look at a few bootloaders.

Quote:

If we say that the control-flow stack can also be used to store spilled registers and function arguments,

OFC we do: well, for spilled registers. Not for function arguments: use registers, and spill to DS.

Quote:

an IRQ handler can do quite a bit of useful work using only the control-flow stack

absolutely; that's what "stackless" coding actually means: no data-stack.
OFC the idea gets lost (and we contend, is essentially futile) if you don't actually have a separate data-stack to be "stackless" from, in the first place.
When you do, your thinking is very different.

Quote:

and might well be able to obtain a data stack if one were needed.

My gut says this is a terrible idea, but it's no different from kzalloc (I think it is GPT blah) in IRQ context.
But then, we don't like that, either ;-)

Sorry for using the term "bullshit" earlier; i should have just said "Nonsense."

If you'd take some advice, I highly recommend at least 6-12 months coding ASM on a decent architecture (or if you must use Intel, then use nasm, with standard dest src ordering, or "Intel syntax" not "AT&T" yuck.)
It teaches you an awful lot.

I highly recommend Seyfarth's excellent book on amd64 coding; it's a steal. (Though you must spend a few weeks on binary and hex, till they feel more natural than algebra.)

An overarching point I must make: for an asm coder, the stack is sacrosanct.
Before you even get into all that theoretical discussion above, you have to first understand that what "modern" compilers do, horrifies an asm coder, a large proportion of whose time is spent on keeping that stack correct (which is why it's so nice to use an HLL) -- or you find out what "undefined behaviour" really means.

Then you want to come along and shove loads of random crap on there? And run loops across the data that rely on the data for control?! You guys are nuts. Period.
Especially when Forth already solved this back in the 1970s, and it was running well on a Z80.. (Seriously: do a search for "Forth data-stack Z80" if you are researching this.)

Start there, from that perspective on what the CPU does, and how it operates when we code to it directly (which is still what happens at runtime on every machine out there, even if it is being given bulshytt code.)
Then work forward; not backwards from the end-result of decades of accretion from going along with what someone else decided was "good enough."

Certainly, before you start advocating mandatory extra cycles on and after every subroutine call, for every machine -- at kernel-level no less -- to deal with the result of that bad design choice; which as discussed is not at all mandated by any CPU architecture out there: it is purely a software implementation decision. (Sugar coating notwithstanding: you can't sugar-coat a turd, and we can use those insns for a DSP instead.)

After all, these are "modern" times, I keep hearing, and we have vastly more address space, and certainly more registers, than the Z80 (lovely as it is.)

Live a little: spend a register on a data stack, obviate the fundamental buffer-overflow problem once and for all, and save all those machine cycles, as well as improving your security.
Then we can get round to all those gaping-hole function pointers^W^W "methods" all over the shop.. ;-)

Pretending it's anything other than garbage in terms of machine-control information, and relying on human programmers who cannot even see the asm, is both wildly stupid and wilfully blind to Law 0, and Law 1 of Computing: GIGO.