Martin Johnsson's blog

Meny

Using R: reading tables that need a little cleaning

Sometimes one needs to read tables that are a bit messy, so that read.table doesn’t immediately recognize the content as numerical. Maybe some weird characters are sprinkled in the table (ever been given a table with significance stars in otherwise numerical columns?). Some search and replace is needed. You can do this by hand, and I know this is a task for which many people would turn to a one-liner of perl or awk, but I like to do my scripting in R.

Again, this is in the ”trivial of you only think it out” category of problems. But it gives a starting point for more subtle changes to text files that one might have to do from time to time.

Assume the data look something like this (I made this up):

id;col1;col2
id1; 0,2;0.55*
id2; -,1;,23

Thankfully, R’s conversion from numbers to text is pretty good. It will understand, for instance, that both .5 and 0.5 are numbers. If the problem was just an odd decimal separator, like the Swedish decimal comma, we could just change the dec parameter to read.table. In the above, we have mixed decimal separators. Let’s start out by reading the table as character.

This also stores away the sample ids in row names, and removes the first column with the ids. This is optional, but for this example I assume that all columns should be numeric. If that is not the case, the code below can of course be restricted to the numeric columns, and the character (or factor) columns can be added later.

A data.frame is pretty much a list of vectors, so we use plyr to apply over the list and stringr to search and replace in the vectors. After removing characters that aren’t numbers, decimal separators, or the minus sign, we change the decimal separator, and convert the vector to numeric. Finally, we make the list a data.frame, and propagate the row names.