package in R is awesome for solving most data read problems, except it doesn't parse fixed width files. However, I can read each line in as a single character string (~500,000 rows, 1 column). This takes 3-5 seconds. (I love data.table.)

What I am wondering is whether this can be improved any further? I know I am not the only one who has to read fixed width files, so if this could be made faster, it would make loading even larger files (with millions of rows) more tolerable. I tried using

parallel

with

mclapply

and

data.table

instead of

lapply

, but those didn't change anything. (Likely due to my inexperience in R.) I imagine that an Rcpp function could be written to do this really fast, but that is beyond my skill set. Also, I may not be using lapply and apply appropriately.

Can anyone make suggestions to improve the speed of this? Or is this about as good as it gets?

Here is code to create a similar data.table within R (rather than linking to actual data). It should have 331 characters, and 500,000 rows. There are spaces to simulate missing fields in the data, but this is NOT space delimited data. (I am reading raw SEER data, in case anyone is interested.) Also including column widths (cols) and variable names (seervars) in case this helps someone else. These are the actual column and variable definitions for SEER data.

Also, the updated stringi::stri_sub function is at least twice as fast as base::substr(). So, in the code above that uses fread to read the file (about 4 seconds), followed by apply to parse each line, the extraction of 143 variables took about 8 seconds with stringi::stri_sub compared to 19 for base::substr. So, fread plus stri_sub is still only about 12 seconds to run. Not bad.

10 Dec 2015 update:

You can use the LaF package, which was written to handle large fixed width files (also too large to fit into memory). To use it you first need to open the file using laf_open_fwf. You can then index the resulting object as you would a normal data frame to read the data you need. In the example below, I read the entire file, but you can also read specific columns and/or lines: