Escape the scripter mentality if reliability matters

I would like to present a problem in order to trigger the gut/instinct
of those reading it. Basically, given the problem statement, and
reading this linearly, how might you handle it?

Let's say I want you to get the fifth word of the fourth paragraph of
the third column of the second page of the first edition of the local
paper in a town to be determined. It's going to be a color, and we have
a little deal with that paper to get our data plugged in every day. You
can get the feed from their web site.

At this point, you're probably thinking of something like this: wget or
curl, then either dump it in a temp file, or just go straight on and
pipe it into some processing tools. Maybe you go for grep, cut, sed and
friends. Perhaps you break out perl or python.

The point is simple enough: I asked you to get one piece of data one
time. You should probably put something simple together that'll run one
and that's it. It's not worth the trouble to go much beyond that. It
doesn't matter how much baling wire or duct tape it takes, since once I
get my answer (orange, blue, lavender, mauve, teal?), I'm done.

Okay, so, now that you're comfortable with that, I'm going to change it
up to make a point.

Now I want you to be able to do this reliably every day for the next two
years. I need this data on a regular delivery schedule and it can't
rely on some human being there to constantly fine-tune things.

At this point, the gears in your head should start turning. Whatever
you come up with, I hope it doesn't resemble the mass of duct-taped gunk
you thought of before. This is a different problem with a completely
different set of requirements, even though the very core is unchanged!
I still need "5w, 4p, 3c, 2p, 1e", but now it has to hold up to all
kinds of craziness that a year might bring.

If you think your shell script abomination is going to work flawlessly
for 730 days (or 731 if we're talking about a leap year!), you have far
more confidence than I do. You have to go beyond that and actually
think about all of the corner cases.

What if there's no first edition that day? Maybe the town has a massive
earthquake and they fail to put out a paper. Or perhaps they do manage
to put one out, but it's the front side of a single sheet of paper. How
about some more failure modes? That second page may exist, but perhaps
it doesn't have a third column, or a fourth paragraph, or maybe that
fourth paragraph doesn't have five words!

Short paragraphs happen!

Okay, what if all of those things do exist, but then the word you find
isn't actually a color? Maybe it's "twenty-seven" or "127.0.0.1" or
"0xF00FC7C8" or anything else. You might have found some data, but it's
completely useless to me.

If handling this gracefully matters to you, a simple hack will not cut
it. You're going to have to invest the time to do it right up front.
If you don't, you run the risk of having to spend at least as much time
after the fact cleaning up a mess and making excuses for your own
disaster.

By now you're probably wondering where this rant came from. The other
night, I spotted a big chunk of code which was put forth as a
demonstration of someone's abilities. It was a collection of shell
scripts which had a ton of glue.

Script #1 was to be run on a distant host. It ran a data capture tool
and sent it through things like grep and sed to filter some of the
details, and then it sent it into a small network tool. Think of "hose"
and "faucet", if you are familiar with them.

Script #2 then used the companion network tool to call over to script
#1's machine. It then sent the whole mess through another bunch of grep
and sed pipelines to filter it some more, and piped it into script #3.

Script #3 was a loop to read stdin and break it into chunks. Each chunk
was then filtered through things like tail, awk, cat and sort, and then
wound up in a series of temporary files.

Finally, script #4 contained a call out to some data processing tool
which would build a graphical representation of the input. It actually
created a FIFO, started that tool listening to one end, then ran its own
loop and wrote to the other end. This ran until you aborted it.

The results look impressive enough. You can start this suite up on your
test environment and start getting numbers back. They'll come back over
the network and will show up on a local display. If that's all you want
to do, and you want to do it right now, you're done! Yay!

The problem is what happens when this thing outlives its welcome. What
if you need this day after day? Or what happens if it's supposed to run
continuously? We're no longer talking about a series of things you can
just start in a bunch of xterms or even screen sessions. It needs to
graduate to a higher level of service.

At that point, it is no longer appropriate to handle this problem that
way. For one thing, all of this parsing has to go right out out the
window. The data in question is already in a nice form on the source
machine. Out there, values like 123 are in fact a single byte with the
value of 123! They haven't been turned into a human-readable stream of
characters which happens to include "foo bar 123 blah blah" somewhere.

What happens if this thing was written some time before October 1st? It
might be parsing "3/14 12:17 foo bar 123 blah blah". It'll work fine...
until October 10th rolls around, then you have "10/10 foo bar 123 blah
blah", and suddenly your data is offset by one character!

"Oh, split on spaces and use fields instead of using raw columns" you
might say. To that, I just say this: why even let it become ASCII in
the first place if you're going to operate on it as data?
Nicely-formatted lines of text are for humans. Let computers talk to
each other in something that's not going to be (as) ambiguous, already.

Incidentally, there is also a small amount of software written in the
last three months of any year which will break in the next couple of
hours as December 31st rolls over to January 1st.

None of this should matter to you! If you are having to worry about
better ways to parse data which is coming from a machine, then you have
already lost. Instead of building better parsers, focus on finding a
way to transport the data in its native form. Everyone will be happier
for it.

This gets into a whole thing I call "scripter mentality". It seems like
some people would rather call (say) tcpdump and parse the results
instead of writing their own little program which uses libpcap. Calling
tcpdump means you have to do the pipe, fork, dup2, exec, parse thing.
Using libpcap means you just have to deal with a stream of data arriving
that you'd have to chew on anyway.

I've read
The Unix Philosophy.
I even agree with most of it. But there is a time and a place for
elaborate | series | of | pipelines, and building long-lived reliable
server operations which can handle errors well without human
interactions is not one of them.

I might be persuaded to look the other way on text-based communications
if you can control both ends strictly and you aren't a bozo who will
create an ambiguous grammar. Otherwise, stick to whatever kind of
encoding floats your boat.

Or you can just sit there and work on parsers for the rest of your
career. I know what I'd rather be doing.