task ordering for multistage algorithms in CAL

Many multistage parallel algorithms reuse buffers in subsequent stages. For example it can process data from buffer A to B in odd steps and from B to A in even steps. Running tasks via calCtxRunProgram() just queues them in a CAL context. Waiting for each stage to finish guarantess that next task won't start until previous is finished but this is not very efficient.

Is it safe to queue all tasks and then call calCtxFlush() and wait for last task to finish ? If not, how can I avoid waiting for each task separately ? Or at least queue next task while previous one is running ?

I see that Brook+ propably does it in some way but I wasn't able to dig through its runtime library and see what it exactly does.

Another question: is it possible to obtain domain dimensions (as passed to calCtxRunProgram()) inside the kernel function ? Or do I need to pass it explicitly ?