Friday, 30 March 2018

Awk Field separator and field references: This is the third article from our tutorial series on awk. In first article, we had an introduction with awk and in second one, we created Hello world program in awk. In this article, we will be learning about separating fields and referencing them using awk.

Referencing Fields and Records

In the first article from this tutorial series, Introduction to awk, we covered following points:

awk presumes that the input is a structured type of data

It interprets each line from input file(s) as a Record

Each line will have strings/words separated (or delimited) by whitespaces or some character. These separators are referred to as delimiters.

Each of those strings/words separated by delimiter is called as a Field.

In above file, each of the line is interpreted as a record. As each word/string is separated by a colon ( : ), it becomes a delimiter and each word separated by the delimiter i.e.foouser, 1001, /bin/bash, etc. are the fields.

In awk, we reference each field using $ operator, followed by a number or an awk variable. We learn more about awk variables in later articles to keep things simple here. Thus, we can reference first field from the record using $1, second field with $2, third field with $3 and so on. $0 is used to reference the record (or the input line).

Lets take a look at following example. We have an input file result.txt with contents as below [snipped]:

In above example, we have not used any field separator or delimiter anywhere in the awk command. So, it can be concluded that, awk considers whitespace as a default field separator. awk allows us to set a field separator of our own choice with -F option followed by the delimiter. Lets check this with /etc/passwd file, that has fields delimited by a colon.

While writing an awk script, we can change the field separator by using awk variable FS. We need to instruct awk to consider a custom delimiter before it start reading lines from input file. Here, BEGIN block comes handy. BEGIN block is executed before any input lines are read. Similarly, we have END block which gets executed once all of the lines from input file are read. Both BEGIN and END blocks are optional.

So, we can write an awk script passwd.awk as:

BEGIN {FS=":"}{
print $3, $1, $7}

As covered in our first tutorial (link), we can use the instructions from this script using option -f as below:

By default, all the instructions from the script are executed on every single line from the input file. To execute these instructions on selected lines, we can also introduce pattern matching by enclosing the regular expression within slashes ( /[REGEX]/ ). This will execute the instructions from awk script on only those lines matching the regex.

To verify this, we use our results.txt file again. From the entire list of students and their marks in certain subjects, we can filter only those records of students who got exactly 50 marks, whichever may be the subject. So, we can use 50 as the pattern to match, as shown below:

Or we can filter only those records in which students who have their names starting with string Jo. For this, we can use a regex ^Jo with tilde ( ~ ) operator to match against first field ( $1 ) which is name of the student.