reads the file. We read the file in big chunks, by default 50MB at a time. The exact size of the buffer doesn’t matter – on my computer, 1MB was as good, but 10k much slower. Since the loop over these chunks is done inside R code, we want the loop to have not too many iterations.

The rest of the function deals with preserving the part of the buffer that might have a partial line, because we read in constant-sized chunks. (I hope this is done correctly)

The result is that we read in a file in 1/3rd of the time. This function is based on the thread Reading large files quickly in the R help mailing list. In particular, Rob Steele’s note. Does anyone have a faster solution? Can we get as fast as rev? As fast as grep?

Update.

I got comments that it isn’t fair to compare to grep and wc, because they don’t need to keep the whole file in memory. So, I tried the following primitive program:#include < stdio.h >#include < stdlib.h >

main(){ FILE *f ; char *a ; f = fopen("file","r") ;

a=malloc(283000001) ; fread( a, 1, 283000000, f ) ;

fclose(f) ;}

That finished reading the file in 0.5sec. It is even possible to write a version of this function that uses R_alloc and can by dynamically loaded. That is quite fast. I then turned to study readChar(), and why it is slower than the C program. Maybe I’ll write about it sometime. But it seems that reading the whole file in at once with readChar is much faster. Though it takes more memory…

Much better. readChar() could be made a bit faster, but strsplit() is where more gain can be made. This code can be used instead of readLines if needed. I haven’t shown a better scan() or read.table()….

This thread will continue for other read functions.

Related

To leave a comment for the author, please follow the link and comment on his blog: MLT thinks.