Fixing aberrant files using R and the shell: a case study

Once in a while, you embark on what looks like a simple computational procedure only to encounter frustration very early on. “I can’t even read my file into R!” you cry.

Step back, take a deep breath and take note of what the software is trying to tell you. Most times, you’ve just missed something very straightforward. Here’s an example.
Recently, I was trying to retrieve some data describing characteristics of microbial genomes from the NCBI FTP site. The file, lproks_0.txt(direct link), looked like a regular tab-delimited file with a couple of header lines:

Sharp eyes will notice a problem right there, on the first line of data. Less sharp-eyed users like me will open an R console to read the file, expecting no issues:

genomes

“I can’t even read my file into R!”

My first mistake: reach for the tool that I know best, not the tool which is appropriate. In this case, my instinct was to count the fields in a line using awk. Since we skipped the first 2 lines in R, we want to examine line 378 in the original file:

As it happens, R comes with several useful functions to examine the structure of files. One of these is count.fields(), which does what it says on the tin. We can combine it with table() to sum the field count for each line:

Problem solved. Some of the lines contain a single-quote character; read.table() thinks that this indicates a quoted field. Others contain the “#” symbol; read.table() interprets this as a comment character. No wonder that this file could not be read correctly.

How to fix the file? Ideally, you would find a way to read it “as is”. Alternatively, if altering the file contents is not an issue, we can bring sed into play: