Hell is other people’s data

I present what follows as “a typical day in the life of a bioinformatician”.

I click through to the Download page. At first sight, it’s reasonably promising. Excel files – not perfect, but beats a PDF – and ooh, TXT files. OK, I grab the “US version” (which uses decimal points, not commas for decimal numbers) of the uncurated database:

less ._UMDTP53_all_2012_R1_US.txt
# "._UMDTP53_all_2012_R1_US.txt" may be a binary file. See it anyway?
# no, I don't want to do that
strings ._UMDTP53_all_2012_R1_US.txt
# TEXTXCEL

At this point, alarm bells are ringing. I’m thinking “Mac user”. Don’t get me wrong; I own a Macbook Air myself, I see smart people doing good work with Macs. But in my experience, biologist + Mac + text files = trouble.

I push on and check the number of records in the main text file.

wc -l UMDTP53_all_2012_R1_US.txt
# 0 UMDTP53_all_2012_R1_US.txt

Mmm – oh, ^M

Zero lines. That can’t be right. We look at the file using less.

Aha, the old “line endings” issue. Easy enough to fix with the classic utility dos2unix – in this case, called as mac2unix:

cat UMDTP53_all_2012_R1_US.txt | mac2unix > /tmp/UMDTP53_all_2012_R1_US.txt
# on we go; overwriting the original since we can always get it from the zip
mv /tmp/UMDTP53_all_2012_R1_US.txt UMDTP53_all_2012_R1_US.txt
wc -l UMDTP53_all_2012_R1_US.txt
# 36248

The web page said there should be 36 249 records…but the file looks OK. Right, let’s get on and parse this file. It’s tab-delimited, I like Ruby, so I try to do the right thing and use a library – in this case, the Ruby CSV library. First, just a run-through to see that it all works as expected:

Not all as expected, then. Colour me astonished. Frankly, this is why people resort to splitting on the delimiter instead of using libraries…anyway, search the web for that error and you’ll find a lot of confusing, conflicting and sometimes, just plain wrong explanations and advice. Let me save you the trouble. The Ruby CSV library expects UTF-8 encoding. This file is not encoded in UTF-8. So what is it? A surprisingly tricky question to answer.

First step: another visual examination using less, which shows some odd characters. The file contains a “Medline” column, so I search PubMed with the UID and see that author Mol<8F>s is supposed to be author Molès. I am not an expert in character encoding but when I post my frustration to Twitter, I find someone who is:

All good! I can parse the file and start using the fields to do something useful. Once again the interesting part (analysis) takes minutes, getting to the analysis takes hours or days.

It’s always tempting to say “well all of these problems could be avoided if people only did [insert better approach here]”. The thing is, they don’t. Dealing with it is just part of the job and having the skills to deal with it, I think, deserves more recognition than it gets.

Simply describing what the data is would be a big step forward. Just saying “text” doesn’t cut it. You need to say “Text encoded with Mac Roman, lines ended with Carriage Return, fields separate by tabs, tabs within fields quoted by a preceding backslash”. Or similar.

This of course is why we need more explicit, self-describing formats. I think XML is pretty horrible but it is at least a foundation for creating formats that express data more cleanly.

I just downloaded the file on my mac powerbook, unzipped it, and dragged it onto the excel icon in the doc. It popped right up. 36249 lines (first line is a header). Took less than 3 minutes from start to finish. :-)

This is not a Mac problem. It’s very specifically an “Excel on Mac” problem, and its buggy CSV export. But it’s not like this problem is limited to Mac, on the contrary (in fact, Mac line endings are the usual Unix-y LF, except in the aforementioned buggy Excel export). On Windows, the file endings would likewise have been wrong (CR+LF). The encoding would likewise have been non-Unicode (Windows-125*).

As for `mac2unix` not working as expected – that’s quite simply a gotcha in the utility, because it ignores binary files (and it thinks the file is binary due to an invalid character in line 418, where this comes from is not obvious). Incidentally, the missing last record is due to a missing line termination in the last line. Again, a deficiency of the specific tool (Excel) rather than the operating system.

So why am I making a big point out of this? Because I don’t really see what the alternatives are. Windows? Same issue. Linux? Too complicated for the biologist to learn, and lacking some tools (Illustrator …). A Mac, on the other hand, actually offers all the tools, both competent graphical tools and the necessary command line utilities to perform the grunt work. There’s simply no substitute for learning the tools, and this is where criticism should be levelled. But if we want to blame tools – and I love doing that – let’s blame Excel, and get its use for biologists banned: it’s simply irredeemably broken.

Good points. Not saying it’s a “Mac problem”, though. It’s a “people using Mac” problem :) or more generally, a “people using computers” problem. As you suggest, not something we can easily fix.

To expand a little: in “my day” (the 1990s), the Mac was the favoured tool of molecular biologists because, they said, it was “easy to use”. Unfortunately, “easy to use” is often synonymous with “we don’t know what we’re doing but we can create things anyway”. And that is why I tend to roll my eyes when I see “created by biologists on a Mac”.

In fact in many ways, I see bioinformatics as a response to 1990s molecular biology which was computer-illiterate, innumerate and devoid of statistics – but I think that is another blog post.

(also, pretty sure the missing record is just down to them forgetting to subtract 1 for the header row :) )