Arjan van der Gaag is a thirtysomething software developer,
historian and all-round geek. This is his blog about Ruby, Rails,
Javascript, Git, CSS, software and the web. See more projects or
follow Arjan at
Github,
Twitter,
LinkedIn or via his feed.

Shell scripting to the rescue

I love Ruby and tend to use it for everything I can use it for. But
I’ve reading up on Unix recently, and I decided to test my newfound
knowledge by using standard unix programs to solve a problem. Those who do not
know Unix are doomed to re-implement it badly (or so I have been told).

I needed to copy a lot of images from a remote server to my local machine.
Since images were constantly being added to the remote server, I wanted to have
a repeatable script to download only those images that were listed in a YAML
file from another application. So I needed to read the YAML file, find the
files listed inside it, and collect those in an archive for easy downloading.

01. Reading input

My input file was in YAML, so the first step is reading that. But since the
file is several thousand lines long, we pipe it into head to just print the
first few lines:

This deletes line one, but there’s a saying along the lines of: “if you cat a
file and immediately pipe it into something else, something’s wrong”. So, I
rewrote it like so:

$ sed 'd' images.yml | head

02. “Parsing” YAML

Then, I needed to get rid of the YAML array element indicators – the dashes
starting each line. I could have used sed for that, but I chose cut, which
extracts fields from a line, splitting the line on a given delimited into
columns. I wanted the second column with a space as delimiter:

This gives a new problem: there are several different external hosts in the
file. I only wanted our own. I decided to rewrite the command and use grep to
filter out all lines that do contain our own host, and then remove the
domain:

Alas, that doesn’t work. I started investigating possible solutions, such as
using xargs – which mashes a bunch of lines into a single line and feed them
as arguments to another program, with some intelligence about the number of
arguments a program accepts. After some fiddling, I got frustrated that zip
just didn’t read filenames from standard input, so I finally decided to open
the zip manual with man zip. Searching the manual for stdin, I found out
zip indeed does not read input filenames from standard input by default, but
On Mac OS X, there’s the --names-stdin option, while on most other systems
there’s -@. There you go, it pays to RTFM.

Then, I want to only use the original image, not the generated thumbnails. I
happened to know that generated thumbnails have filenames like
original-filename-150x75.jpg. Removing the dimensions at the end of the
filename would give me the regular file. My list could very well contain that
original file already, but uniq would sort that out. So, there’s one more
sed to add:

That gave me a dump archive file containing all my images. As I was happy with
the result, I tacked on a -9 to enable maximum compression for the archive,
shaving a couple of percentage points of the end result file size.

Conclusion

This post might seem long, but the process of developing this command chain was
actually rather quick. Feedback is almost instant and there’s a rich collection
of tools to get the job done. I’m pretty sure developing a Ruby script doing
the same thing would have involved more manual tweaking and looking up
documentation.

You cannot leave comments on my site, but you can always tweet questions or comments at me: @avdgaag.