*[RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist@ 2020-02-14 13:39 Thomas Gleixner
2020-02-14 13:39 ` [RFC patch 01/19] sched: Provide migrate_disable/enable() inlines Thomas Gleixner
` (20 more replies)0 siblings, 21 replies; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 13:39 UTC (permalink / raw)
To: LKML
Cc: David Miller, bpf, netdev, Alexei Starovoitov, Daniel Borkmann,
Sebastian Sewior, Peter Zijlstra, Clark Williams, Steven Rostedt,
Juri Lelli, Ingo Molnar
Hi!
This is a follow up to the initial patch series which David posted a while
ago:
https://lore.kernel.org/bpf/20191207.160357.828344895192682546.davem@davemloft.net/
which was (while non-functional on RT) a good starting point for further
investigations.
PREEMPT_RT aims to make the kernel fully preemptible except for a few low
level operations which have to be unpreemptible, like scheduler, low level
entry/exit code, low level interrupt handling, locking internals etc.
This is achieved by the following main transformations (there are some
more minor ones, but they don't matter in this context):
1) All interrupts, which are not explicitely marked IRQF_NO_THREAD, are
forced into threaded interrupt handlers. This applies to almost all
device interrupt handlers.
2) hrtimers which are not explicitely marked to expire in hard interrupt
context are expired in soft interrupt context.
3) Soft interrupts are also forced into thread context. They run either
in the context of a threaded interrupt handler (similar to the !RT
mode of running on return from interrupt), in ksoftirqd or in a
thread context which raised a soft interrupt (same as in !RT).
Soft interrupts (BH) are serialized per CPU so the usual bottom half
protections still work.
4) spinlocks and rwlocks are substituted with 'sleeping spin/rwlocks'.
The internal representation is a priority inheritance aware rtmutex
which has glue code attached to preserve the spin/rwlock semantics
vs. task state. They neither disable preemption nor interrupts, which
is fine as the interrupt handler they want to be protected against is
forced into thread context.
The true hard interrupt handlers are not affecting these sections as
they are completely independent.
The code pathes which need to be atomic are using raw_spinlocks which
disable preemption (and/or interrupts).
On a non RT kernel spinlocks are mapped to raw_spinlocks so everything
works as usual.
As a consequence allocation of memory is not possible from truly atomic
contexts on RT, even with GFP_ATOMIC set. This is required because the slab
allocator can run into a situation even with GPF_ATOMIC where it needs to
call into the page allocator, which in turn takes regular spinlocks. As the
RT substitution of regular spinlocks might sleep, it's obvious that memory
allocations from truly atomic contexts on RT are not permitted. The page
allocator locks cannot be converted to raw locks as the length of the
resulting preempt/interrupt disabled sections are way above the tolerance
level of demanding realtime applications.
So this leads to a conflict with the current BPF implementation. BPF
disables preemption around:
1) The invocation of BPF programs. This is required to guarantee
that the program runs to completion on one CPU
2) The invocation of map operations from sys_bpf(). This is required
to protect the operation against BPF programs which access the
same map from perf, kprobes or tracing context because the syscall
operation has to lock the hashtab bucket lock.
If the perf NMI, kprobe or tracepoint happens inside the bucket lock
held region and the BPF program needs to access the same bucket then
the system would deadlock.
The mechanism used for this is to increment the per CPU recursion
protection which prevents such programs to be executed. The same per
CPU variable is used to prevent recursion of the programs themself,
which might happen when e.g. a kprobe attached BPF program triggers
another kprobe attached BPF program.
In principle this works on RT as well as long as the BPF programs use
preallocated maps, which is required for trace type programs to prevent
deadlocks of all sorts. Other BPF programs which run in softirq context,
e.g. packet filtering, or in thread context, e.g. syscall auditing have no
requirement for preallocated maps. They can allocate memory when required.
But as explained above on RT memory allocation from truly atomic context is
not permitted, which required a few teaks to the BPF code.
The initial misunderstanding on my side was that the preempt disable around
the invocations of BPF programs is not only required to prevent migration
to a different CPU (in case the program is invoked from preemptible
context) but is also required to prevent reentrancy from a preempting
task. In hindsight I should have figured out myself that this is not the
case because the same program can run concurrently on a different CPU or
from a different context (e.g. interrupt). Alexei thankfully enlightened me
recently over a beer that the real intent here is to guarantee that the
program runs to completion on the same CPU where it started. The same is
true for the syscall side as this just has to guarantee the per CPU
recursion protection.
This part of the problem is trivial. RT provides a true migrate_disable()
mechanism which does not disable preemption. So the preempt_disable /
enable() pairs can be replaced with migrate_disable / enable() pairs.
migrate_disable / enable() maps to preempt_disable / enable() on a
not-RT kernel, so there is no functional change when RT is disabled.
But there is another issue which is not as straight forward to solve:
The map operations which deal with elements in hashtab buckets (or other
map mechanisms) have spinlocks to protect them against concurrent access
from other CPUs. These spinlocks are raw_spinlocks, so they disable
preemption and interrupts (_irqsave()). Not a problem per se, but some of
these code pathes invoke memory allocations with a bucket lock held. This
obviously conflicts with the RT semantics.
The easy way out would be to convert these locks to regular spinlocks which
can be substituted by RT. Works like a charm for both RT and !RT for BPF
programs which run in thread context (includes soft interrupt and device
interrupt handler context on RT).
But there are also the BPF programs which run in truly atomic context even
on a RT kernel, i.e. tracing types (perf, perf NMI, kprobes and trace).
For obvious reasons these context cannot take regular spinlocks on RT
because RT substitutes them with sleeping spinlocks. might_sleep() splats
from NMI context are not really desired and on lock contention the
scheduler explodes in colourful ways. Same problem for kprobes and
tracepoints. So these program types need a raw spinlock which brings us
back to square one.
But, there is an important detail to the rescue. The trace type programs
require preallocated maps to prevent deadlocks of all sorts in the memory
allocator. Preallocated maps never call into the memory allocator or other
code pathes which might sleep on a RT kernel. The allocation type is known
when the program and the map is initialized.
This allows to differentiate between lock types for preallocated and
run-time allocated maps. While not pretty, the proposed solution is to have
a lock union in the bucket:
union {
raw_spinlock_t raw_lock;
spinlock_t lock;
};
and have init/lock/unlock helpers which handle the lock type depending on
the allocation mechanism, i.e. for preallocated maps it uses the raw lock
and for dynamic allocations it uses the regular spinlock. The locks in the
percpu_freelist need to stay raw as well as they nest into the bucket lock
held section, which works for both the raw and the regular spinlock
variant.
I'm not proud of that, but I really couldn't come up with anything better
aside of completely splitting the code pathes which would be even worse due
to the resulting code duplication.
The locks in the LPM trie map needs to become a regular spinlock as well as
trie map is based on dynamic allocation, but again not a problem as this
map cannot be used for the critical types anyway.
I kept the LRU and the stack map locks raw for now as they do not use
dynamic allocations, but I haven't done any testing on latency impact
yet. The stack map critical sections are truly short, so they should not
matter at all, but the LRU ones could. That can be revisited once we have
numbers. I assume that LRU is not the right choice for the trace type
programs anyway, but who knows.
The last and trivial to solve (Dave solved that already) issue is the non
owner release of mmap sem, which is forbidden on RT as well. So this just
forces the IP fallback path.
On a non RT kernel this does not change anything. The compiler optimizes
the lock magic out and everything maps to the state before these changes.
This survives the selftests and a bunch of different invocations of
bpftrace on mainline and on a backport to 5.4-rt.
But of course I surely have missed some details here and there, so please
have a close look at this.
The diffstat of hashtab.c is so large because I sat down and documented the
above write up in condensed form in a comment on top of the file as I
wanted to spare others the dubious experience of having to reverse engineer
the inner workings of BPF.
Thanks,
tglx
8<----------------
include/linux/bpf.h | 8 +-
include/linux/filter.h | 33 ++++++--
include/linux/kernel.h | 7 +
include/linux/preempt.h | 30 +++++++
kernel/bpf/hashtab.c | 164 ++++++++++++++++++++++++++++++++-----------
kernel/bpf/lpm_trie.c | 12 +--
kernel/bpf/percpu_freelist.c | 20 ++---
kernel/bpf/stackmap.c | 18 +++-
kernel/bpf/syscall.c | 16 ++--
kernel/bpf/trampoline.c | 9 +-
kernel/events/core.c | 2
kernel/seccomp.c | 4 -
kernel/trace/bpf_trace.c | 6 -
lib/test_bpf.c | 4 -
net/bpf/test_run.c | 8 +-
net/core/flow_dissector.c | 4 -
net/core/skmsg.c | 8 --
net/kcm/kcmsock.c | 4 -
18 files changed, 248 insertions(+), 109 deletions(-)
^permalinkrawreply [flat|nested] 42+ messages in thread

*Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
2020-02-14 19:56 ` Thomas Gleixner@ 2020-02-18 23:36 ` Alexei Starovoitov
2020-02-19 0:49 ` Thomas Gleixner
2020-02-19 15:17 ` Mathieu Desnoyers0 siblings, 2 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2020-02-18 23:36 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mathieu Desnoyers, LKML, David Miller, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
Peter Zijlstra, Clark Williams, Steven Rostedt, Juri Lelli,
Ingo Molnar
On Fri, Feb 14, 2020 at 08:56:12PM +0100, Thomas Gleixner wrote:
>
> > Also, I don't think using __this_cpu_inc() without preempt-disable or
> > irq off is safe. You'll probably want to move to this_cpu_inc/dec
> > instead, which can be heavier on some architectures.
>
> Good catch.
Overall looks great.
Thank you for taking time to write commit logs and detailed cover letter.
I think s/__this_cpu_inc/this_cpu_inc/ is the only bit that needs to be
addressed for it to be merged.
There were few other suggestions from Mathieu and Jakub.
Could you address them and resend?
I saw patch 1 landing in tip tree, but it needs to be in bpf-next as well
along with the rest of the series. Does it really need to be in the tip?
I would prefer to take the whole thing and avoid conflicts around
migrate_disable() especially if nothing in tip is going to use it in this
development cycle. So just drop patch 1 from the tip?
Regarding
union {
raw_spinlock_t raw_lock;
spinlock_t lock;
};
yeah. it's not pretty, but I also don't have better ideas.
Regarding migrate_disable()... can you enable it without the rest of RT?
I haven't seen its implementation. I suspect it's scheduler only change?
If I can use migrate_disable() without RT it will help my work on sleepable
BPF programs. I would only have to worry about rcu_read_lock() since
preempt_disable() is nicely addressed.
^permalinkrawreply [flat|nested] 42+ messages in thread

*Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
2020-02-18 23:36 ` Alexei Starovoitov@ 2020-02-19 0:49 ` Thomas Gleixner
2020-02-19 1:23 ` Alexei Starovoitov
2020-02-19 15:17 ` Mathieu Desnoyers1 sibling, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-19 0:49 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Mathieu Desnoyers, LKML, David Miller, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
Peter Zijlstra, Clark Williams, Steven Rostedt, Juri Lelli,
Ingo Molnar
Alexei,
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> Overall looks great.
> Thank you for taking time to write commit logs and detailed cover letter.
> I think s/__this_cpu_inc/this_cpu_inc/ is the only bit that needs to be
> addressed for it to be merged.
> There were few other suggestions from Mathieu and Jakub.
> Could you address them and resend?
I have them fixed up already, but I was waiting for further
comments. I'll send it out tomorrow morning as I'm dead tired by now.
> I saw patch 1 landing in tip tree, but it needs to be in bpf-next as well
> along with the rest of the series. Does it really need to be in the tip?
> I would prefer to take the whole thing and avoid conflicts around
> migrate_disable() especially if nothing in tip is going to use it in this
> development cycle. So just drop patch 1 from the tip?
I'll add patch 2 to a tip branch as well and I'll give you a tag to pull
into BPF (which has only those two commits). That allows us to further
tweak the relevant files without creating conflicts in next.
> Regarding
> union {
> raw_spinlock_t raw_lock;
> spinlock_t lock;
> };
> yeah. it's not pretty, but I also don't have better ideas.
Yeah. I really tried hard to avoid it, but the alternative solution was
code duplication which was even more horrible.
> Regarding migrate_disable()... can you enable it without the rest of RT?
> I haven't seen its implementation. I suspect it's scheduler only change?
> If I can use migrate_disable() without RT it will help my work on sleepable
> BPF programs. I would only have to worry about rcu_read_lock() since
> preempt_disable() is nicely addressed.
You have to talk to Peter Zijlstra about this as this is really
scheduler relevant stuff. FYI, he undamentaly hates migrate_disable()
from a schedulabilty POV, but as with the above lock construct the
amount of better solutions is also close to zero.
Thanks,
tglx
^permalinkrawreply [flat|nested] 42+ messages in thread

*Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
2020-02-19 0:49 ` Thomas Gleixner@ 2020-02-19 1:23 ` Alexei Starovoitov0 siblings, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2020-02-19 1:23 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Mathieu Desnoyers, LKML, David Miller, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Sebastian Sewior,
Peter Zijlstra, Clark Williams, Steven Rostedt, Juri Lelli,
Ingo Molnar
On Wed, Feb 19, 2020 at 01:49:57AM +0100, Thomas Gleixner wrote:
> Alexei,
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> > Overall looks great.
> > Thank you for taking time to write commit logs and detailed cover letter.
> > I think s/__this_cpu_inc/this_cpu_inc/ is the only bit that needs to be
> > addressed for it to be merged.
> > There were few other suggestions from Mathieu and Jakub.
> > Could you address them and resend?
>
> I have them fixed up already, but I was waiting for further
> comments. I'll send it out tomorrow morning as I'm dead tired by now.
>
> > I saw patch 1 landing in tip tree, but it needs to be in bpf-next as well
> > along with the rest of the series. Does it really need to be in the tip?
> > I would prefer to take the whole thing and avoid conflicts around
> > migrate_disable() especially if nothing in tip is going to use it in this
> > development cycle. So just drop patch 1 from the tip?
>
> I'll add patch 2 to a tip branch as well and I'll give you a tag to pull
> into BPF (which has only those two commits).
That works too.
> > Regarding
> > union {
> > raw_spinlock_t raw_lock;
> > spinlock_t lock;
> > };
> > yeah. it's not pretty, but I also don't have better ideas.
>
> Yeah. I really tried hard to avoid it, but the alternative solution was
> code duplication which was even more horrible.
>
> > Regarding migrate_disable()... can you enable it without the rest of RT?
> > I haven't seen its implementation. I suspect it's scheduler only change?
> > If I can use migrate_disable() without RT it will help my work on sleepable
> > BPF programs. I would only have to worry about rcu_read_lock() since
> > preempt_disable() is nicely addressed.
>
> You have to talk to Peter Zijlstra about this as this is really
> scheduler relevant stuff. FYI, he undamentaly hates migrate_disable()
> from a schedulabilty POV, but as with the above lock construct the
> amount of better solutions is also close to zero.
I would imagine that migrate_disable is like temporary cpu pinning. The
scheduler has to have all the logic to make scheduling decisions quickly in
presence of pinned tasks and additional migrate_disable shouldn't introduce
slowdowns or complexity to critical path. Anyway, we'll discuss further
when migrate_disable patches hit mailing lists.
^permalinkrawreply [flat|nested] 42+ messages in thread

*Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
2020-02-18 23:36 ` Alexei Starovoitov
2020-02-19 0:49 ` Thomas Gleixner@ 2020-02-19 15:17 ` Mathieu Desnoyers
2020-02-20 4:19 ` Alexei Starovoitov1 sibling, 1 reply; 42+ messages in thread
From: Mathieu Desnoyers @ 2020-02-19 15:17 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Thomas Gleixner, linux-kernel, David S. Miller, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Sebastian Andrzej Siewior,
Peter Zijlstra, Clark Williams, rostedt, Juri Lelli, Ingo Molnar
----- On Feb 18, 2020, at 6:36 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
[...]
> If I can use migrate_disable() without RT it will help my work on sleepable
> BPF programs. I would only have to worry about rcu_read_lock() since
> preempt_disable() is nicely addressed.
Hi Alexei,
You may want to consider using SRCU rather than RCU if you need to sleep while
holding a RCU read-side lock.
This is the synchronization approach I consider for adding the ability to take page
faults when doing syscall tracing.
Then you'll be able to replace preempt_disable() by combining SRCU and
migrate_disable():
AFAIU eBPF currently uses preempt_disable() for two reasons:
- Ensure the thread is not migrated,
-> can be replaced by migrate_disable() in RT
- Provide RCU existence guarantee through sched-RCU
-> can be replaced by SRCU, which allows sleeping and taking page faults.
I wonder if it would be acceptable to take a page fault while migration is
disabled though ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com^permalinkrawreply [flat|nested] 42+ messages in thread

*Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code
2020-02-19 15:17 ` Mathieu Desnoyers@ 2020-02-20 4:19 ` Alexei Starovoitov0 siblings, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2020-02-20 4:19 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Thomas Gleixner, linux-kernel, David S. Miller, bpf, netdev,
Alexei Starovoitov, Daniel Borkmann, Sebastian Andrzej Siewior,
Peter Zijlstra, Clark Williams, rostedt, Juri Lelli, Ingo Molnar
On Wed, Feb 19, 2020 at 10:17:28AM -0500, Mathieu Desnoyers wrote:
> ----- On Feb 18, 2020, at 6:36 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
>
> [...]
>
> > If I can use migrate_disable() without RT it will help my work on sleepable
> > BPF programs. I would only have to worry about rcu_read_lock() since
> > preempt_disable() is nicely addressed.
>
> Hi Alexei,
>
> You may want to consider using SRCU rather than RCU if you need to sleep while
> holding a RCU read-side lock.
>
> This is the synchronization approach I consider for adding the ability to take page
> faults when doing syscall tracing.
>
> Then you'll be able to replace preempt_disable() by combining SRCU and
> migrate_disable():
>
> AFAIU eBPF currently uses preempt_disable() for two reasons:
>
> - Ensure the thread is not migrated,
> -> can be replaced by migrate_disable() in RT
> - Provide RCU existence guarantee through sched-RCU
> -> can be replaced by SRCU, which allows sleeping and taking page faults.
bpf is using normal rcu to protect map values
and rcu+preempt to protect per-cpu map values.
srcu is certainly under consideration. It hasn't been used due to performance
implications. atomics and barriers are too heavy for certain use cases. So we
have to keep rcu where performance matters, but cannot fork map implementations
to rcu and srcu due to huge code bloat. So far I've been thinking to introduce
explicit helper bpf_rcu_read_lock() and let programs use it directly instead of
implicit rcu_read_lock() that is done outside of bpf prog. The tricky part is
teaching verifier to enforce critical section.
^permalinkrawreply [flat|nested] 42+ messages in thread

*Re: [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist
2020-02-14 17:53 ` [RFC patch 00/19] bpf: Make BPF and PREEMPT_RT co-exist David Miller
@ 2020-02-14 18:36 ` Thomas Gleixner
2020-02-17 12:59 ` [PATCH] bpf: Enforce map preallocation for all instrumentation programs Thomas Gleixner
0 siblings, 1 reply; 42+ messages in thread
From: Thomas Gleixner @ 2020-02-14 18:36 UTC (permalink / raw)
To: David Miller
Cc: linux-kernel, bpf, netdev, ast, daniel, bigeasy, peterz,
williams, rostedt, juri.lelli, mingo
David Miller <davem@davemloft.net> writes:
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Fri, 14 Feb 2020 14:39:17 +0100
>
>> This is a follow up to the initial patch series which David posted a
>> while ago:
>>
>> https://lore.kernel.org/bpf/20191207.160357.828344895192682546.davem@davemloft.net/
>>
>> which was (while non-functional on RT) a good starting point for further
>> investigations.
>
> This looks really good after a cursory review, thanks for doing this week.
>
> I was personally unaware of the pre-allocation rules for MAPs used by
> tracing et al. And that definitely shapes how this should be handled.
Hmm. I just noticed that my analysis only holds for PERF events. But
that's broken on mainline already.
Assume the following simplified callchain:
kmalloc() from regular non BPF context
cache empty
freelist empty
lock(zone->lock);
tracepoint or kprobe
BPF()
update_elem()
lock(bucket)
kmalloc()
cache empty
freelist empty
lock(zone->lock); <- DEADLOCK
So really, preallocation _must_ be enforced for all variants of
intrusive instrumentation. There is no if and but, it's simply mandatory
as all intrusive instrumentation has to follow the only sensible
principle: KISS = Keep It Safe and Simple.
The above is a perfectly valid scenario and works with perf and tracing,
so it has to work with BPF in the same safe way.
I might be missing some magic enforcement of that, but I got lost in the
maze.
Thanks,
tglx
^permalinkrawreply [flat|nested] 42+ messages in thread