Archives

Processing large CSV files with Ruby

Processing large files is a memory intensive operation and could cause servers to run out of RAM memory and swap to disk. Let's look at few ways to process CSV files with Ruby and measure the memory consumption and speed performance.

Prepare CSV data sample

Before we start, let's prepare a CSV file data.csv with 1 million rows (~ 75 MB) to use in tests.

Output can vary between machines, but the point is that when building the CSV file, the Ruby process did not spike in memory usage because the garbage collector (GC) was reclaiming the used memory. The memory increase of the process is about 1MB, and it created a CSV file with size of 75 MB.

Important to note here is the big memory spike to 920 MB. That is because we build the whole CSV object in memory. That causes lots of String objects to be created by the CSV library and the used memory is much more higher than the actual size of the CSV file.

Parsing CSV from in memory String (CSV.parse)

Let's build a CSV object from a content in memory and iterate with the following script:

From the results we can see that the memory used is about the file size (75 MB) because the file content is loaded in memory and the processing time is about twice faster. This approach is useful when we have the content that we don't need to read it from a file and we just want to iterate over it line by line.

Parsing CSV file line by line from IO object

Can we do any better than the previous script? Yes, if we have the CSV content in a file. Let's use an IO file object directly:

In the last script we see less than 1 MB of memory increase. Time seems to be a very little slower compared to previous script because there is more IO involved. The CSV library has a built in mechanism for this, CSV.foreach: