Monday, October 10, 2011

Shell scripts make it easy to pass data between external commands. But shell script as a programming language lacks features like non-trivial data structures and easy, robust concurrency. These would be useful in building quick solutions to system administration and automation problems.

As others have noted,12345 Haskell is an interesting alternative for these scripting tasks. I wrote the shqq library to make it a little easier to invoke external programs from Haskell. With the shquasiquoter, you write a shell command which embeds Haskell variables, execute it as an IO action, and get the command's standard output as a String. In other words, it's a bit like the backtick operator from Perl or Ruby.

For efficiency, we find potential duplicates by size, and then checksum only these files. We use external shell commands for checksumming as well as the initial directory traversal. At the end we print the names of duplicated files, one per line, with a blank line after each group of duplicates.

I included type signatures for clarity, but you wouldn't need them in a one-off script. Not counting imports and the LANGUAGE pragma, that makes 10 lines of code total. I'm pretty happy with the expressiveness of this solution, especially the use of parallel IO for an easy speedup.

That would have taken one line in bash. And be nicer to the eye / simpler to read.Something like this would probably do the trick:find -type f | while read F; do echo "$(sha1sum "$F" | cut -c 1-40) $(stat -c %s "$F") $F"; done | uniq -d -f 3 -c # -c is optional(Files with backslashes in their names won’t work.)

Sure, it won’t be as fast to run, but it’s just a one-off thing, and the hard disks are the speed bottleneck anyway, so the difference will be negligible.