How awk can make your life easier. April 4, 2009

Sometimes I have trouble imagining how to get things done efficiently without all the wonderful little (or not so little) unix command line tools, perhaps chained together with a few pipes inside a bash loop. However, if I had to choose a single utility to take with me on a desert island I would not hesitate and pick awk. If you haven’t ever used awk, chances are that you really should. Hopefully, the following will help you decide if awk could help make your computing life a bit easier.

In the first installation I will only provide a high level introduction to awk; next time we’ll dive in a bit deeper. Before we get started it is probably prudent to point out that there are several implementations of awk floating around these days. If you use Linux, chances are that you will have GNU awk, or gawk for short, available. I will use gawk in my examples.

After these preliminaries, let’s talk about what awk actually does. As it turns out, awk is not simply a tool but an actual programming language designed to efficiently deal with the content of (plain text) files in a line by line fashion. Whereas other languages (python, ruby, C, ..) require one to explicitly open files and read their content line by line, awk provides this infrastructure for free and furthermore splits each line into fields according to a given separator — space by default. Hence, whenever you find yourself reading files and then doing something to each line, awk may help make your life considerably easier. Let’s look at an example data file

Here, each line contains three columns separated by a space. The first one is a date, the second one the dollar amount for a certain sales item, and the third the number of items sold. How would we go about converting this file into one that lists the total sales for each date in the first column and the corresponding date in the second one?

Sure, you could write a little python script but lets use awk instead. I mentioned above that awk splits each line into fields. These can be accessed via $0, $1, $2, …, where $0 holds the complete line, $1 the first field, $2 the second one, and so on. Hence,

# gawk '{ print $0 }' my_data.txt

will print each line of my_data.txt to stdout. awk automagically walks through all lines of the input file and applies the commands inside the curly braces to each line. That said, it’s now dead simple to write down the code to transform our sales example. Here it is

# gawk '{print $2*$3,$1}' my_data.txt

For each line we simply print token $2 (per item price) multiplied by token $3 (number of items) followed by $1 which is the date. Our output therefore consists of two columns; the comma makes sure that they are separated by a space.

As you can see, if you need to process column centric data files awk really can make your life a lot easier. If this has wet your appetite for awk you might be interested in the next installation on this blog. If you’re too impatient google is your friend or have a look at the excellent book by Dougherty and Robbins [1].