I have a task that processes a list of files on stdin. The start-up time of the program is substantial, and the amount of time each file takes varies widely. I want to spawn a substantial number of these processes, then dispatch work to whichever ones are not busy. There are several different commandline tools that almost do what I want, I've narrowed it down to two almost working options:

The problem is that split does a pure round-robin, so one of the processes gets behind and stays behind, delaying the completion of the entire operation; while parallel wants to spawn one process per N lines or bytes of input and I wind up spending way too much time on startup overhead.

Is there something like this that will re-use the processes and feed lines to whichever processes have unblocked stdins?

Where is that split command from? The name conflicts with the standard text processing utility.
–
GillesOct 9 '12 at 22:27

@Gilles, it's the GNU one: "split (GNU coreutils) 8.13". Using it as a weird alternative to xargs is probably not the intended use but it's the closest to what I want I've found.
–
BCoatesOct 9 '12 at 22:44

2

I've been thinking about that, and a fundamental problem is knowing that an instance of myjob is ready to receive more input. There is no way to know that a program is ready to process more input, all you can know is that some buffer somewhere (a pipe buffer, an stdio buffer) is ready to receive more input. Can you arrange for your program to send some kind of request (e.g. display a prompt) when it's ready?
–
GillesOct 10 '12 at 1:21

Assuming that the program isn't using bufering on stdin, a FUSE filesystem that reacts to read calls would do the trick. That's a fairly large programming endeavor.
–
GillesOct 10 '12 at 1:42

why are you using -l 1 in the parallel args? IIRC, that tells parallel to process one line of input per job (i.e. one filename per fork of myjob, so lots of startup overhead).
–
casOct 10 '12 at 4:40

8 Answers
8

I don't think so. In my favorite magazine was an article once on bash programming which did what you want. I'm willing to believe that if there were tools to do that they would have mentioned them. So you want something along the lines of:

Obviously you may change the invocation to the actual working script to your liking. The magazine I mentionen initially does things like setting up pipes and actually starting worker threads. Check out mkfifo for that, but that route is far more complicated as the worker processes need to signal the master process that they are ready to receive more data. So you need one fifo for each worker process to send it data and one fifo for the master process to receive stuff from the workers.

DISCLAIMER
I wrote that script from the top of my head. It may have some syntax issues.

Depending on your application (myjob) you might eb able to use jobs -s to find stopped jobs. Otherwise list the processes sorted by CPU and select the one consuming fewest resources. Of have the job report itself, eg by setting a flag in the file system when it wants more work.

Assuming the job stops when waiting for input, use

jobs -sl to find out pid of a stopped job and assign it work, for example

That doesn't look possible in such a general case. It implies you have a buffer for each process and you can watch the buffers from outside to decide where to put the next entry (scheduling)... Of course you might write something (or use a batch system like slurm)

But depending on what the process is, you might be able to pre-process the input. For example if you want to download files, update entries from a DB, or similar, but 50% of them will end up being skipped (and therefor you have a large processing difference depending on the input) then, just setup a pre-processor that verifies which entries are going to take long (file exists, data was changed, etc), so whatever comes from the other side is guaranteed to take a fairly equal amount of time. Even if the heuristic is not perfect you might end up with a considerable improvement. You might dump the others to a file and process afterwards in the same manner.

expounding on @ash's answer, you can use a SYSV message queue to distribute the work. If you don't want to write your own program in C there is a utility called ipcmd that can help. Here's what I put together to pass the output of find $DIRECTORY -type f to $PARALLEL number of processes:

No, there isn't a generic solution. Your dispatcher needs to know when each program is ready to read another line, and there's no standard I'm aware of which allows for that. All you can do is put a line on STDOUT and wait for something to consume it; there's not really a good way for the producer on a pipeline to tell if the next consumer is ready or not.

Unless you can estimate how long a particular input file will be processed and the worker processes don't have a way to report back to the scheduler (as they do in normal parallel computing scenarios - often through MPI), you are generally out of luck - either pay the penalty of some workers processing input longer than others (because of inequality of input), or pay the penalty of spawning a single new process for every input file.