Thursday, May 27, 2010

readFile and lazy I/O

Recently, I came across a problem in a Haskell script I run frequently. Every so often, I drop a report file into a designated folder. Then I run my script, which peforms an operation like the following.

getReportFiles >>= map readFile

This worked fine for months - until this morning, when the program crashed with an error indicating that too many files had been opened.

The problem is that readFile uses lazy I/O. Generally, when we write code like getLine >>= putStrLn, we expect these calls to happen in order - indeed, that's one of the primary purposes of the IO monad. But readFile uses hGetContents internally, which is an exception to the strict I/O found elsewhere in Haskell. So readFile opens the file and then returns a thunk instead of actually reading the file. Only when the thunk is evaluated is the I/O performed and the file read into memory. And only when the thunk has been fully evaluated will the open file be closed.

So in my snippet, I was reading in hundreds of files as thunks. and until the full contents of the thunks were evaluated, the files all remained open. This was no problem until the number of reports I had to process reached a certain point and exposed the bug.

ByteString's readFile method is eager, so you'll get back the complete file contents.

Update: In the comments, Chris points out System.IO.Strict, which has a strict readFile function that simply replaces Prelude.readFile.

Lazy I/O can be very useful: instead of reading in the complete contents of a large file, you can read it lazily using a function in the hGetContents family and then process it without having to read the entire contents into memory at once. But lazy I/O can surprise you if you're not expecting it.

(thanks to #haskell for pointing out the eager Data.ByteString.Char8.readFile)