Gawking Hex Files

2 years ago

Last time I talked about how to use AWK (or, more probably the GNU AWK known as GAWK) to process text files. You might be thinking: why did I care? Hardware hackers don’t need text files, right? Maybe they do. I want to talk about a few common cases where AWK can process things that are more up the hardware hacker’s alley.

The Simple: Data Logs

If you look around, a lot of data loggers and test instruments do produce text files. If you have a text file from your scope or a program like SIGROK, it is simple to slice and dice it with AWK. Your machines might not always put out nicely formatted text files. That’s what AWK is for.

AWK makes the default assumption that fields break on whitespace and end with line feeds. However, you can change those assumptions in lots of ways. You can set FS and RS to change the field separator and record separator, respectively. Usually, you’ll set this in the BEGIN action although you can also change it on the command line.

For example, suppose your test file uses semicolons between fields. No problem. Just set FS to “;” and you are ready to go. Setting FS to a newline will treat the entire line as a single field. Instead of delimited fields, you might also run into fixed-width fields. For a file like that, you can set FIELDWIDTHS.

If the records aren’t delimited, but a fixed length, things are a bit trickier. Some people use the Linux utility dd to break the file apart into lines by the number of bytes in each record. You can also set RS to a limited number of any character and then use the RT variable (see below) to find out what those characters were. There are other options and even ways to read multiple lines. The GAWK manual is your friend if you have these cases.

Once you have records and fields sorted, it is easy to do things like average values, detect values that are out of limit, and just about anything else you can think of.

Spreadsheet Data Logs

Some tools output spreadsheets. AWK isn’t great at handling spreadsheets directly. However, a spreadsheet can be saved as a CSV file and then AWK can chew those up easily. It is also an easy format to produce from an AWK file that you can then read into a spreadsheet. You can then easily produce nice graphs, if you don’t want to use GNUPlot.

Simplistically, setting FS to a comma will do the job. If all you have is numbers, this is probably enough. If you have strings, though, some programs put quotes around strings (that may contain commas or spaces). Some only put quotes around strings that have commas in them.

To work around this problem cleanly, AWK offers an alternate way to define fields. Normally, FS tells you what characters separate a field. However, you can set FPAT to define what a field looks like. In the case of CSV file, a field is any character other than a comma or a double quote and then anything up to the next double quote.

This isn’t perfect. For example, escaped quotes don’t work right. Quoted text with new lines in it don’t either. The manual has some changes that remove quotes and handle empty fields, but the example above works for most common cases. Often the easiest approach is to change the delimiter in the source program to something unique, instead of a comma.

Hex Files

Another text file common in hardware circles is a hex file. That is a text file that represents the hex contents of a programmable memory (perhaps embedded in a microcontroller). There are two common formats: Intel hex files and Motorola S records. AWK can handle both, but we’ll focus on the Intel variant.

Old versions of AWK didn’t work well with hex input, so you’d have to resort to building arrays to convert hex digits to numbers. You still see that sometimes in old code or code that strives to be compatible. However, GNU AWK has the strtonum function that explicitly converts a string to a number and understands the 0x prefix. So a highly compatible two digit hex function looks like this (not including the code to initialize the hexdigit array):

In fact, the last function is a little better (and misnamed) because it can handle any hex number regardless of length (up to whatever limit is in GAWK).

Hex output is simple since you have printf and the X format specifier is available. Below is an AWK script that chews through a hex file and provides a count of the entire file, plus shows a breakdown of the segments (that is, non-contiguous memory regions).

This shows a few AWK features: the BEGIN action, user-defined functions, the use of named character classes (:xdigit: is a hex digit) and arrays (block and blockct use numeric indices even though they don’t have to). In the END action, the summary uses printf statements for both decimal and hex output.

Binary Files

Text files are fine, but real hardware uses binary files that people (and AWK) can’t easily read, right? Well, maybe people, but AWK can read binary files in a few ways. You can use getline in the BEGIN part of the script and control how things are read directly. You can also use the RS/RT trick mentioned above to read a specific number of bytes. There are a few other AWK-only methods you can read about if you are interested.

However, the easiest way to deal with binary files in AWK is to convert them to text files using something like the od utility. This is a program available with Linux (or Cygwin, and probably other Windows toolkits) that converts a binary file to different readable formats. You probably want hex bytes, so that’s the -t x2 option (or use x4 for 16-bit words). However, the output is made for humans, not machines, so when a long run of the same output occurs, od omits them replacing all the missing lines with a single asterisk. For AWK use, you want to use the -v option to turn that behavior off. There are other options to change the output radix of the address, swap bytes, and more.

This is dead simple to parse with AWK. The address will be $1 and each field will be $2, $3, etc. You can just convert the file yourself, use a pipe in the shell, or–if you want a clean solution–have AWK run od as a subprocess. Since the input is text, all of AWK’s regular expression features still work, which is useful.

Writing binary files is easy, too, since printf can output nearly anything. An alternative is to use xxd instead of od. It can convert binary files to text, but also can do the reverse.

Full Languages

There’s an old saying that if all you have is a hammer, everything looks like a nail. I doubt that AWK is the best tool to build full languages, but it can be a component of some quick and dirty hacks. For example, the universal cross assembler uses AWK to transform assembly language files into an internal format the C preprocessor can handle

Since AWK can call out to external programs easily, it would be possible to write things that, for example, processed a text file of commands and used them to drive a robot arm. The regular expression matching makes text processing easy and external programs could actually handle the hardware interface.

Think that’s far fetched? We’ve covered stranger AWK use cases, including a Wolfenstien-like game that uses 600 lines of AWK script (as seen to the right).

So, sure it is software, but it is a tool that has that Swiss Army knife quality that makes it a useful tool for software and hardware hackers alike. Of course, other tools like Perl, Python, and even C or C++ can do more. But often with a price in complexity and learning curve. AWK isn’t for every job, but when it works, it works well.