I was just running a few commands in a terminal and I started wondering, does Unix/Linux take shortcuts when running piped commands?

For example, let's say I have a file with one million lines, the first 10 of which contain hello world. If you run the command grep "hello world" file | head does the first command stop as soon as it finds 10 lines, or does it continue to search the entire file first?

4 Answers
4

Sort of. The shell has no idea what the commands you are running will do, it just connects the output of one to the input of the other.

If grep finds more than 10 lines that say "hello world" then head will have all 10 lines it wants, and close the pipe. This will cause grep to be killed with a SIGPIPE, so it does not need to continue scanning a very large file.

So I guess, due to race conditions, grep might have read the 11th or 12th pattern already, but probably not the 100 thousandst?
–
user unknownJan 25 '12 at 16:12

3

This depends in part on the length of the lines and the size of the pipe buffer, but the short answer is that grep will reads some reasonably limited amount of extra data before being killed.
–
dmckeeJan 25 '12 at 17:47

When a program tries to write to a pipe and there is no process reading from that pipe, then the writer program receives a SIGPIPE signal. The default action when a program receives SIGPIPE is to terminate the program. A program can choose to ignore the SIGPIPE signal, in which case the write returns an error (EPIPE).

In your example, here's a timeline of what happens:

The grep and head commands start up in parallel.

grep reads some input, starts processing it.

At some point, grep produces a first chunk of output.

head reads that first chunk and writes it out.

Assuming there are enough lines after the first 10 matches (otherwise grep might terminate first), eventually head will have printed out the desired number of lines. At this point, head exits.

Depending on the relative speed of the grep and head processes, grep may have accumulated some data and not printed it out yet. At the time head exits, grep may be reading input or doing internal processing, in which case it'll continue to do so.

Soon grep will write out the data it's processed. At that point, it'll receive a SIGPIPE and die.

It's likely that grep will process a little more input than strictly necessary, but typically only a few kilobytes:

head typically reads in chunks of a few kilobytes (because that's more efficient than issuing a read system call for each byte — this behavior is called buffering), so the remainder of the last chunk after the desired last line is discarded.

There may be some data in transit, as pipes have an associated buffer managed by the kernel (often 512 bytes). This data will be discarded.

grep may have accumulated some data that's ready to become an output chunk (buffering again). It'll receive SIGPIPE when it's trying to flush its output buffer.

All in all the system is precisely designed so that filtering utilities naturally behave efficiently. Programs that need to keep going when their output channel dies out must take the step of ignoring the SIGPIPE signal.

Sortof, the pipeline works like this: it first executes the first command and then the second command in your case.

That is, let's have A|B be the command given. Then it is uncertain whether A or B starts first. They might start at exactly the same time if there are multiple CPUs. A pipe can hold an undefined but finite amount of data.

If B tries to read from the pipe, but no data is available, B will wait until the data arrives. If B was reading from a disk, B might have the same problem and need to wait until a disk read finishes. A closer analogy would be reading from a keyboard. There, B would need to wait for a user to type. But in all of these cases, B has started a "read" operation and must wait until it finishes. But if B is a command such that it needs only partial output of A then after certain point where Bs input level is reached A will be killed by SIGPIPE

If A tries to write to the pipe and the pipe is full, A must wait for some room in the pipe to become free. A could have the same problem if it was writing to a terminal. A terminal has flow control and can moderate the pace of data. In any event, to A, it has started a "write" operation and will wait until the write operation finishes.

A and B are behaving as co-processes, although not all co-processes will be communicating with a pipe. Neither is in full control of the other.

grep has no direct control of the pipe (it is just receiving data), and the pipe has no direct control of grep (it is just sending data)...

What grep, or any other program does, is entirely up to that programs internal logic. If you tell grep via command line options to make an early exit-when-found, then it will, otherwise it will chug on to very end of the file looking for the pattern...

The Terminal is likewise quite disconnected from the internal workings of grep and the shell's piping actions... The Terminal is basically
just a launching pad, and output display...