Hi Paul, Thanks, for entering a discussion of my idea! Even though you are busy and are critical towards my idea, you take your time to answer! Thanks.

On Wed, 26 Jul 2006, Paul E. McKenney wrote:

> On Thu, Jul 27, 2006 at 02:39:07AM +0100, Esben Nielsen wrote:>>>>>> On Tue, 25 Jul 2006, Paul E. McKenney wrote:>>>>> Not for inclusion, should be viewed with great suspicion.>>>>>> This patch provides an NMI-safe realtime RCU. Hopefully, Mathieu can>>> make use of synchronize_sched() instead, since I have not yet figured>>> out a way to make this NMI-safe and still get rid of the interrupt>>> disabling. ;-)>>>>>> Thanx, Paul>>>> I must say I don't understand all this. It looks very complicated. Is it>> really needed?>>>> I have been thinking about the following design:>>>> void rcu_read_lock()>> {>> if (!in_interrupt())>> current->rcu_read_lock_count++;>> }>> void rcu_read_unlock()>> {>> if (!in_interrupt())>> current->rcu_read_lock_count--;>> }>>>> Somewhere in schedule():>>>> rq->rcu_read_lock_count += prev->rcu_read_lock_count;>> if (!rq->rcu_read_lock_count)>> forward_waiting_rcu_jobs();>> rq->rcu_read_lock_count -= next->rcu_read_lock_count;>> So rq->rcu_read_lock_count contains the sum of the counts of all> tasks that were scheduled away from this CPU.>> What happens in face of the following sequence of events?> Assume that all tasks stay on CPU 0 for the moment.>> o Task A does rcu_read_lock().>> o Task A is preempted. rq->rcu_read_lock_count is nonzero.>> o Task B runs and does rcu_read_lock().>> o Task B is preempted (perhaps because it dropped the lock> that was causing it to have high priority). Regardless of> the reason for the preemption, rq->rcu_read_lock_count is> nonzero.>> o Task C runs and does rcu_read_lock().>> o Task C is preempted. rq->rcu_read_lock_count is nonzero.>> o Task A runs again, and does rcu_read_unlock().>> o Task A is preempted. rq->rcu_read_lock_count is nonzero.>> o Task D runs and does rcu_read_lock().>> o Task D is preempted. rq->rcu_read_lock_count is nonzero.>> And so on. As long as at least one of the preempted tasks is in an> RCU critical section, you never do your forward_waiting_rcu_jobs(),> and the grace period goes on forever. Or at least until you OOM the> machine.>> So what am I missing here?

The boosting idea below. Then tasks A-D must be RT tasks for this to happen. And the machine must anyway run out of RT tasks or it will effectively lock up.

>> I could imagine introducing a pair of counters per runqueue, but then> we end up with the same issues with counter-flip that we have now.>> Another question -- what happens if a given CPU stays idle? Wouldn't> the callbacks then just start piling up on that CPU?

How can a CPU stay idle? There is a tick every 2.5 ms. Even without that the previous CPU can make it schedule if it sees the jobs piling up. Or if that is considered too expensive, it can take over and forward the jobs to the next CPU.

>>> Now what should forward_waiting_rcu_jobs() do?>>>> I imagine a circular datastructur of all the CPUs. When a call_rcu() is>> run on a CPU it is first added a list of jobs for that CPU. When>> forward_waiting_rcu_jobs() is called all the pending jobs are forwarded to>> the next CPU. The next CPU will bring it along the next CPU in the circle>> along with it's own jobs. When jobs hit the original CPU they will be>> executed. Or rather, when the CPU just before calls>> forward_waiting_rcu_jobs(), it sends the jobs belonging to the next CPU to>> the RCU-task of the next CPU, where they will be executed, instead of to>> the scheduler (runqueue) of the next CPU, where it will just be send out on>> a>> new roundtrip along the circle.>>>> If you use a structure like the plist then the forwarding procedure can be>> done in O(number of online CPUs) time worst case - much less in the usual>> case where the lists are almost empty.>>>> Now the problem is: What happens if a task in a rcu read-side lock is>> migrated? Then the rcu_read_lock_count on the source CPU will stay in plus>> while on the target CPU it will go in minus. This ought to be simply>> fixeable by adding task->rcu_read_lock_count to the target runqueue before>> migrating and subtracting it from the old runqueue after migrating. But>> there is another problem: RCU-jobs refering to data used by the task being>> migrated might have been forwarded from the target CPU. Thus the migration>> task have to go back along the circle of CPUs and move all the relevant>> RCU-jobs back to the target CPU to be forwarded again. This is also doable>> in>> number of CPUs between source and target times O(<number of online CPUs>)>> (see above) time.>> So if I have the right (or wrong) pattern of task migrations, the RCU> callbacks never get to their originating CPU?>

In principle, yes. But if the machine starts to migrate tasks that wildly it wont get any work done anyway, because all it's time is done doing migration.

> Alternatively, if the task residing in the RCU read-side critical section> is forwarded around the loop of CPUs, callbacks circulating around this> loop might execute before the RCU read-side critical section completes.>

That is why some of the callbacks (those which has parsed the target CPU but not yet the source CPU) have to be moved back to the target CPU.

I just came up with an even simpler solution:Delay the subtraction of the task->rcu_read_lock_count from srcrq->rcu_read_lock_count until the task calls rcu_read_unlock(). That can be done by flagging the task (do task->rcu_read_lock_count |= 0x80000000) and do a simple if (unlickely(current->rcu_read_lock_count == 0x80000000)) fix_rcu_read_lock_count_on_old_cpu();in rcu_read_unlock(). Now the task can be migrated again before calloing fix_rcu_read_lock_count_on_old_cpu(). The relevant RCU jobs still can't get past the original CPU before the task have called fix_rcu_read_lock_count_on_old_cpu(), so all subsequent migrations can justdo the count down on the intermediate CPUs right away.

>> To avoid a task in a read-side lock being starved for too long the>> following line can be added to normal_prio():>> if (p->rcu_read_lock_count)>> p->prio = MAX_RT_PRIO;>> But doesn't this have the same impact on latency as disabling preemption> in rcu_read_lock() and then re-enabling it in rcu_read_unlock()?>

No, RT tasks can still preempt the RCU read side lock. But SCHED_OTHER and SCHED_BATCH can't. You can also the RCU read side boosting prioritiy dynamic and let the system adjust it or just let the admin adjust it.

Ofcourse. I don't know about hotplug though. But it sounds simple to migrate the tasks away, take the CPU out of the circle and then forward the last RCU jobs from that CPU.

>> I don't have time to code this nor a SMP machine to test it on. But I can>> give the idea to you anyways in the hope you might code it :-)>> I am beginning to think that it will not be at all simple by the time I> code up all the required fixups. Or am I missing something?

Ofcourse, implementing something is always a lot harder than writing the idea down. Anyway, we already worked out some of the hardest details :-)