Thursday, September 04, 2008

I write a lot of code in sed whenever I need to do some kind of filtering, and I realised that there are several patterns that emerge. Sed is a Stream EDitor, and its capabilities are somewhat limited, yet it does provide for some of the more important things required in a programming language. It has sequence, selection, iteration, variables and debugging statements. In this post, I'll go over each of these.

1. Sequence

I've started with sequence because it's always the easiest to explain. Unless branching is involved, a sed script flows from top to bottom. All statements are executed in sequence, and that's pretty much all I have to say about it. Let's move on.

2. Selection

Selection is where things start happening. There are a few ways to execute a statement based on a condition. That condition almost always deals with a pattern in the current input line, but we'll see later how that can be changed. For now, here's how you do selection:

/pattern/ command

s/pattern/replace/ t label

s/pattern/replace/ T label

The first is a simple "execute this command if the current pattern space matches /pattern/". That's akin to saying if(line.match(/pattern/)) { command; } in more common programming languages. Command could even be a block of commands enclosed in braces like this:

/pattern/ { command1 command2 command3 }

Let's take a few examples. We'll assume that sed is called without arguments, so each line is printed once by default.

If the line starts with "hello", add "world" after it:

/^hello/ s/^hello/hello world/

If the current line number is 3, print out the line twice:

3 p

Since each line is printed once by default, the p prints it a second time.

If the line starts with "next", swap it with the next line and print both out:

/^next\>/ { N s/\(.*\)\n\(.*\)/\2\n\1/ }

The second and third type of selection are similar, and basically say branch to a label if the previous replace command succeeded (t) or failed (T). These make more sense when looking at iteration, so that's what we'll do now.

3. Iteration

Things always get interesting when you iterate. You can execute the same set of statements over a group of data without knowing in advance what that data is. The b, t and T commands come in play here, along with labels defined with the : command, similar to other programming languages. We'll look at some common loop types from other languages:

While(condition) {...} (loop executed 0 or more times)

:loopstart /condition/ { command1 command2 command3 b loopstart }

For example, while the input contains ==, append the contents of the file named equals.txt:

:loopstart /==/ { s/==// r equals.txt b loopstart }

We can also do this with the t command:

:loopstart s/==// T loopend r equals.txt b loopstart :loopend

Though it's a little more clumsy this way because you need two labels. The first method is the code pattern that I use for a while loop.

Do {...} While(condition) (loop executed 1 or more times)

:loopstart command1 command2 command3 /condition/ b loopstart

This is almost the same as the first loop, except that the condition is tested at the end of the block of statements. Let's take the same example, but this time, we read in the file at least once:

:loopstart r equals.txt /==/ { s/==// b loopstart }

In this case, using the t command makes it less messy since we need to do the replacement anyway:

:loopstart r equals.txt s/==// t loopstart

The third type of loop is a for loop, which is harder because you can't really do math in sed. Still, if one tries, one can figure out weird ways to count. In this case, we use the hold space:

# We want to print the current line 10 times:

# 1. Grab the current line into the hold space h # 2. Replace the pattern space with = based on what we want to count to c \========== # 3. Print the line as long as there are = left: :loopstart s/^=// T loopend x p x b loopstart

In this code, we need to constantly swap between the pattern space and the hold space, since all our operations are done in the pattern space. Which brings us to variables.

4. Variables

Well, make that variable, since sed has only one piece of memory that can hold something, and that's called the hold space. The good news though, is that it has no size limit - well, theoretically at least. This means that using your own delimiters, you could store anything in there. JSON anyone? I generally use the newline character as a delimiter, since that's unlikely to show up more than once in a single line of input, but you can use anything that you think is unique to your application. Here's one way to do it:

# 1. Swap the hold and pattern space x # 2. Set the pattern space to the value of your variable using the s, c, i, a, g or G commands: s/$/\nfoo\n/ G # 3. Swap the hold and pattern space again x

# 1. Append the current line to the hold space H # 2. Pull the hold space into the pattern space g # At this point the pattern space has the newline separated list of # variables followed by a newline and the current input line # We can use these variables if we know their position, and replace # them into the input line:

# 3. Append the previous input line to the first word of the current input line: s/.*\n.*\n\(.*\)\n\([A-Za-z][A-Za-z]*\)\>/\2\1/

# Now the input line has been modified, and the hold space remains the same

5. Debugging

Finally, we come to debugging, which is extremely useful when writing code in such a strange language. sed has two commands that make debugging possible, though I won't say easy. The = command prints out the current line number of the input file. Note that this is not the line number of the sed script, but of the input that the script reads. The l command prints out the current input line in a visually unambiguous way. It's up to you to scatter your code with these lines to figure out what's happening internally. You can always swap the pattern and hold spaces and use the l command to find out what's in your hold space.

Apart from the above, sed also has methods to read and write files. We've seen reading above with the r command, and similarly, writing is handled by the w command. There are also R and W commands, but you can read the manual to figure those out. I'll leave sed here.