On 19 October 2011 19:19, mark florisson <markflorisson88 at gmail.com> wrote:
> On 19 October 2011 06:01, Robert Bradshaw <robertwb at math.washington.edu> wrote:
>> On Fri, Oct 14, 2011 at 1:07 PM, mark florisson
>> <markflorisson88 at gmail.com> wrote:
>>> On 14 October 2011 19:31, Robert Bradshaw <robertwb at math.washington.edu> wrote:
>>>> On Wed, Oct 12, 2011 at 7:55 AM, mark florisson
>>>> <markflorisson88 at gmail.com> wrote:
>>>>>>> I ultimately feel things like that is more important than 100% coverage of
>>>>>>> the OpenMP standard. Of course, OpenMP is a lot lower-hanging fruit.
>>>>>>>>>>>> +1 Prange handles the (corse-grained) SIMD case nicely, and a
>>>>>> task/futures model based on closures would I think flesh this out to
>>>>>> the next level of generality (and complexity).
>>>>>>>>>> Futures are definitely nice. I suppose I think really like "inline
>>>>> futures", i.e. openmp tasks. I realize that futures may look more
>>>>> pythonic. However, as mentioned previously, I also see issues with
>>>>> that. When you submit a task then you expect a future object, which
>>>>> you might want to pass around. But we don't have the GIL for that. I
>>>>> personally feel that futures is something that should be done by a
>>>>> library (such as concurrent.futures in python 3.2), and inline tasks
>>>>> by a language. It also means I have to write an entire function or
>>>>> closure for perhaps only a few lines of code.
>>>>>>>>>> I might also want to submit other functions that are not closures, or
>>>>> I might want to reuse my closures that are used for tasks and for
>>>>> something else. So what if my tasks contain more parallel constructs?
>>>>> e.g. what if I have a task closure that I return from my function that
>>>>> generates more tasks itself? Would you just execute them sequentially
>>>>> outside of the parallel construct, or would you simply disallow that?
>>>>> Also, do you restrict future "objects" to only the parallel section?
>>>>>>>>>> Another problem is that you can only wait on tasks of your direct
>>>>> children. So what if I get access to my parent's future object
>>>>> (assuming you allow tasks to generate tasks), and then want the result
>>>>> of my parent?
>>>>> Or what if I store these future objects in an array or list and access
>>>>> them arbitrarily? You will only know at runtime which task to wait on,
>>>>> and openmp only has a static, lexical taskwait.
>>>>>>>>>> I suppose my point is that without either a drastic rewrite (e.g., use
>>>>> pthreads instead of openmp) or quite a bit of contraints, I am unsure
>>>>> how futures would work here. Perhaps you guys have some concrete
>>>>> syntax and semantics proposals?
>>>>>>>> It feels to me that OpenMP tasks took a different model of parallelism
>>>> and tried to force them into the OpenMP model/constraints, and so it'd
>>>> be even more difficult to fit them into a nice pythonic interface.
>>>> Perhaps to make progress on this front we need to have a concrete
>>>> example to look at. I'm also wondering if the standard threading
>>>> module (perhaps with overlay support) used with nogil functions would
>>>> be sufficient--locking is required for handling the queues, etc. so
>>>> the fact that the GIL is involved is not a big deal. It is possible
>>>> that this won't scale to as small of work units, but the overhead
>>>> should be minimal once your work unit is a sufficient size (which is
>>>> probably quite small) and it's already implemented and well
>>>> documented/used.
>>>>>> It's all definitely possible with normal threads, but the thing you
>>> lose is convenience and conciseness. For big problems the programmer
>>> might sum up the courage and effort to implement it, but typically you
>>> will just stick to a serial version. This is really where OpenMP is
>>> powerful, you can take a simple sequential piece of code and make it
>>> parallel with minimal effort and without having to restructure,
>>> rethink and rewrite your algorithms.
>>>> That is a very good point.
>>>>> Something like concurrent.futures is definitely nice, but most people
>>> cannot afford to mandate python 3.2 for their users.
>>>>>> The most classical examples I can think of for tasks are
>>>>>> 1) independent code sections, i.e. two or more pieces of code that
>>> don't depend on each other which you want to execute in parallel
>>> 2) traversal of some kind of custom data structure, like a tree or a linked list
>>> 3) some kind of other producer/consumer model
>>>>>> e.g. using with task syntax:
>>>>>> cdef postorder_traverse(tree *t): # bullet 1) and 2)
>>> with task:
>>> traverse(t.left)
>>> with task:
>>> traverse(t.right)
>>>>>> taskwait() # wait until we traversed our subtrees
>>> use(t.data)
>>>> Is there an implicit parallel block here? Perhaps in the caller?
>> Yes, it was implicit in my example. If you'd use that code, you'd call
> it from a parallel section. Depending on what semantics you'd define
> (see below), you'd call it either from one thread in the team, or with
> all of them.
>>>> cdef list_traverse(linkedlist *L): # bullet 2)
>>> with nogil, parallel():
>>> if threadid() == 0:
>>> while L.next:
>>> with task:
>>> do_something(L.data)
>>>>>> In the latter case we don't need a taskwait as we don't care about any
>>> particular order. Only one thread generates the tasks where the others
>>> just hit the barrier and see the tasks they can execute.
>>>> I guess it's the fact that Python doesn't have a nice syntax for
>> anonymous functions or blocks does make this syntax more appealing
>> than an explicit closure.
>>>> Perhaps if we came up with a more pythonic/natural name which would
>> make the intent clear. Makes me want to do something like
>>>> pool = ThreadPool(10)
>> for item in L:
>> with pool:
>> process(item)
>>>> but then you get into issues of passing the pool around. OpenMP has
>> the implicit pool of the nesting parallel block, so "with one thread"
>> or "with cython.parallel.pool" or something like that might be more
>> readable.
>> I think with pool would be good, it must be clear that the task is
> submitted to a threadpool and hence may be executed asynchronously.
>>>> The good thing is that the OpenMP runtime can decide at task
>>> generation point (not only at taskwait or barrier points!) decide to
>>> stop generating more tasks and start executing them. So you won't
>>> exhaust memory if you might have lots of tasks.
>>>> Often threadpools have queues that block when their buffer gets full
>> to achieve the same goal.
>>>>>> As for critical and barrier, the notion of a critical block as a with
>>>> statement is very useful. Creating/naming locks (rather than being
>>>> implicit on the file/line number) is more powerful, but is a larger
>>>> burden on the user and more difficult to support with the OpenMP
>>>> backend.
>>>>>> Actually, as I mentioned before, critical sections do not at all
>>> depend on their line or file number. All they depend on their implicit
>>> or explicit name (the name is implicit when you simply omit it, so all
>>> unnamed critical sections exclude each other).
>>>> Ah, yes. In this case "with cython.parallel.lock([optional name])"
>> could be obvious enough.
>>>>> Indeed, supporting creation of locks dynamically and allowing them to
>>> be passed around arbitrarily would be hard (and likely not worth the
>>> effort). Naming them is trivial though, which might not be incredibly
>>> pythonic but is very convenient, easy and readable.
>>>> You can view this as a lookup by name, not a lock creation. Not
>> allowing them to be used outside of a with clause is a reasonable
>> restriction, and does not preclude a (possibly very distant) extension
>> to being able to pass them around.
>>>>>> barrier, if supported, should be a function call not a
>>>> context. Not as critical as with the tasks case, but a good example to
>>>> see how it flows would be useful here as well.
>>>>>> I agree, it really doesn't have any associated code and trying to
>>> associate code with it is likely more confusing than meaningful. It
>>> was just an idea.
>>> Often you can rely on implicit barriers from e.g. prange, but not
>>> always. I can't think of any real-world example, but you usually need
>>> it to ensure that everyone gets a sane view on some shared data, e.g.
>>>>>> with nogil, parallel():
>>> array[threadid()] = func(threadid())
>>> barrier()
>>> use array[threadid() + 1 % omp_num_threads()] # access data of
>>> some neighbour
>>>>>> This is a rather contrived example, but (see below) it would be
>>> especially useful if you use single/master/once/first that sets some
>>> shared data everyone will operate on (for instance in a prange). To
>>> ensure the data is sane before you use it, you have to put the barrier
>>> to 1) ensure the data has been written and 2) that the data has been
>>> flushed.
>>>>>> Basically, you'll always know when you need a barrier, but it's pretty
>>> hard to come up with a real-world example for it when you have to :)
>>>> Yes, I think barriers are explanatory enough.
>>>>>> As for single, I see doing this manually does require boilerplate
>>>> locking, so what about
>>>>>>>> if cython.parallel.once(): # will return True once for a tread group.
>>>> ...
>>>>>>>> we could implement this via our own locking/checking/flushing to allow
>>>> it to occur in arbitrary expressions, e.g.
>>>>>>>> special_worker = cython.parallel.once()
>>>> if special_worker:
>>>> ...
>>>> [common code]
>>>> if special_worker: # single wouldn't work here
>>>> ...
>>>>>>>>>> That looks OK. I've actually been thinking that if we have barriers we
>>> don't really need is_master(), once() or single() or anything. We
>>> already have threadid() and you usually don't care what thread gets
>>> there first, you only care about doing it once. So one could just
>>> write
>>>>>> if parallel.threadid() == 0:
>>> ...
>>>>>> parallel.barrier() # if required
>>>> Perhaps you want the first free thread to take it up to minimize idle
>> threads. I agree if parallel.threadid() == 0 is a synonym for
>> is_master(), so probably not needed. However, what are the OpenMP
>> semantics of
>>>> cdef f():
>> with parallel():
>> g()
>> g()
>>>> cdef g():
>> with single():
>> ... # executed once, right?
>> with task:
>> ... # executed twice, right?
>> Hmm, not quite. The thing is that function g is called by every thread
> in the team, say N threads, and for each time the team encounters the
> single directive, it will execute it once, so in total it will execute
> the code in the single block twice, as the team encounters it twice.
>> It will however create 2N tasks to execute, as every thread that
> encounters it creates a task. This is probably not what you want, so
> you usually want
>> with parallel():
> if threadid() == 0:
> g()
>> and have the code in g (executed by one thread only) create the tasks.
>> Note also how 'for _ in prange(1):' would not have the same semantics
> here, as it generates a 'parallel for' and not a worksharing for in
> the function (because we don't support orphaned pranges).
>> I think this may all be confusing for users, I think usually you will
> want to create just a single task irrespective of whether you are in a
> parallel or a prange and not "however many threads are in the team for
> parallel and just one for prange because we're sharing work". This
> would also work for orphaned tasks, e.g. you expect 2 tasks in your
> snippet above, not 2N. Fortunately, that would be easy to support.
> We would however have to introduce the same restriction as with
> (implicit) barriers: either all or none of the threads must encounter
> the construct (or maybe loosen it to "if you actually want to create
> the task, make sure at least thread 0 encounters it", which may lead
> users to write more efficient code).
>>>> It might also be convenient to declare variables explicitly shared
>>> here, e.g. this code will not work:
>>>>>> cdef int *buf
>>>>>> with nogil, parallel.parallel():
>>> if parallel.threadid() == 0:
>>> buf = ...
>>>>>> parallel.barrier()
>>>>>> # will will likely segfault, as buf is private because we assigned
>>> to it. It's only valid in thread 0
>>> use buf[...]
>>>>>> So basically you'd have to do something like (&buf)[0][...], which
>>> frankly looks pretty weird. However I do think such cases are rather
>>> uncommon.
>>>> True. Perhaps this could be declared via "with nogil,
>> parallel.parallel(), parallel.shared(buf)" or something like that.
>> That looks elegant enough.
Likewise, I think something like parallel.private(buf) would also be
really nice for arrays, especially if we also allow arrays with
runtime sizes (behind the scenes we could malloc and free). I think
those cases are much more common than parallel.shared().
>> - Robert
>> _______________________________________________
>> cython-devel mailing list
>>cython-devel at python.org>>http://mail.python.org/mailman/listinfo/cython-devel>>>