Paul: What would break if we stop processing rcu entries in (cpu) order?

The head->func(head) in rcu_do_batch() is probably a nightmare for the branch target predictor.

What about:- shrinking struct rcu_head to just a pointer (let's start with the goodie)- Adding a register_rcu_callback() function.It allocates the per-cpu storage for the rcu grace period lists.Seperate lists for each registered callback - thus no need to copy the callback target into each rcu_head structure.It returns a pointer/handle to these lists.- call_rcu gets that handle instead of the plain function pointer.- rcu_do_batch enumerates all registered callbacks. Thus first all callback_struct->func(head) calls for the first registered callback, then the calls for the 2nd callback, etc.Better for the icache, better for the branch predictor.

Paul: Do you have a test case that is suitable for benchmarking rcu?Any workloads were rcu appears significantly in oprofile?And: Do you know how many rcu entries are typically alive? How much memory is used for the function pointers?