Pools

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The basic use is to throw an iterable at a pool object:

span style="color: #a05050; font-style: italic;"># pool size
# map is one choice, there are others you may want

Notes:

pools are reusable, so when you are done with them you should close() them

close is the clean way: it lets outstanding work finish, then removes the processes

if your code wants to wait until things are done, do a close(), then a join()

terminate() is the hard-handed: it kills current workers

also what the garbage collector does to pool objects - often not exactly what you want

A worker process will be reused for tasks

You can limit the amount of tasks before it is replaced with a new worker

the chunksize argument controls how many jobs go to individual workers at a time (see also map/imap difference)

on the choice of map:

if you can determine all jobs before starting, a single map is probably easiest

apply hands a single job, map does multiple

map_async does not block the calling process - but only gives the results in a single chunk once it's done

imap variants gives you results as they finish

for more details, see below

Map variant details

The options here are:

pool.map(func, iterable[, chunksize])

waits around for all results before returning

order of returned values same as original

pool.map_async(func, iterable[, chunksize[, callback]])

does not wait around for all results

you can poll the the returned AsyncResult - which will only give data when all tasks are done(verify)

pool.imap(func, iterable[, chunksize])

yields results as they come in

takes items from the source iterable to the worker chunksize at a time (default 1)

lets you work with a generator and avoid memory costs of settling very large lists before chunking them up (if chunksize is small, also means measurably more CPU overhead, so don't leave it at 1 unless you really want to)

Counting finished jobs

The easiest way is to use imap_unordered (or imap), as they are yield results while they finish.

Example:

span style="color: #483d9b;">'bla'"[%5.2f] res:%s count:%s"

Adding jobs while it works

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

That is, use the same pool object for jobs you did not initially know about.

...and avoid the zombie problem.

generator to _imap()

When you don't mind the call blocking, then it is probably convenient to:

hand in a generator

...to an imap variant

The pool keeps doing stuff as long as the generator yields things.
When it stops yielding (that is, when you raise StopIteration),
and the imap function finishes jobs for everything it had yielded, it will return.

Abrupt kills messing with communication

Process.terminate, os.kill, and such can break the communication channels a bit, so you may wish to avoid them when they are not necessary.

It's there in the documents, just worth mentioning.

Pool zombies

If you have a script that creates pools, then forgets about them, it leaves zombies behind (won't show as Z state(verify), but same concept).

This should rarely be noticeable because:

a single-run script will clean up when it quits

a pool will reuse its processes

The best way to create a problem is to creating a new pool for each set of jobs, and then don't clean up (turns out thousands of zombies will crash the system).

The full story is that cleanup happens during a pool.join(), which can only happen after pool.close(), so:

# create pool, do stuff then

Another way is to abuse the fact that

multiprocessing.active_children()

has the side effect of join()ing its children to check them (so does

Process.is_alive()

but that's work). So you could do:

span style="color: #483d9b;">"Waiting for %d workers to finish"

Relevant:

Pool.join()
Wait for the worker processes to exit
One must call close() or terminate() before using join().
Pool.close()
Prevents any more tasks from being submitted to the pool.
Once all the tasks have been completed the worker processes will exit.
Pool.terminate()
Stops the worker processes immediately without completing outstanding work.
When the pool object is garbage collected terminate() will be called immediately.

Memory leaks

Processes are reused.

If that eventually leads to memory leak problems,
you may wish to look at maxtasksperchild (argument to the Pool constructor)